A comprehensive analysis of Machine Learning and Transformer models for multi-label emotion detection on the GoEmotions dataset.
This repository contains the code and resources for the project "Emotion Detection: A Comparative Analysis," which evaluates the performance of classic machine learning models against modern Transformer architectures. The key achievement of this project is a fine-tuned RoBERTa model that demonstrates high sensitivity with a state-of-the-art recall score.
- Comprehensive Analysis: Compares 4 different models (Logistic Regression, Random Forest, DistilBERT, and RoBERTa).
- Multi-Label Classification: Tackles the complex task of predicting multiple emotions for a single piece of text.
- In-Depth EDA: Includes detailed Exploratory Data Analysis on the GoEmotions dataset, highlighting the severe class imbalance.
- High-Recall Model: The fine-tuned RoBERTa model achieves a weighted average recall of 0.66, demonstrating high sensitivity.
- Code: All code is provided in easy-to-follow Jupyter/Colab notebooks.
The project workflow is divided into two parallel approaches:
-
Classic Machine Learning Baseline:
- Text is vectorized using TF-IDF.
- Logistic Regression and Random Forest models are trained using a
MultiOutputClassifier
.
-
Transformer Fine-Tuning:
- Text is tokenized using specific tokenizers for DistilBERT and RoBERTa.
- The pre-trained models are fine-tuned on the GoEmotions dataset using PyTorch and Hugging Face.
The results clearly show the superiority of Transformer models. Our fine-tuned RoBERTa model achieved the best performance, most notably a high recall score, indicating its effectiveness at identifying emotions.
A key finding was the precision-recall trade-off, a direct consequence of the dataset's class imbalance. While our model excels at finding emotions (high recall), it sometimes over-predicts, leading to lower precision.
To run this project, follow these steps:
-
Clone the repository:
-
Install the dependencies:
pip install -r requirements.txt
-
Download the dataset: The
GoEmotions.csv
dataset kaggle or Google's github. -
Run the notebooks: Open the files in the
notebooks/
directory using Jupyter Notebook, JupyterLab, or Google Colab.