A Diagnostic and Predictive Exploration of Crowdsourced Emotional Labels
This project explores emotional patterns, rater bias, and label reliability in the GoEmotions dataset using Python and data science tools.
The EDA investigates:
- Annotation inconsistency and “neutral spamming”
- Emotion co-occurrence and correlation
- Contradictions between labels and textual signals
- Predictive performance using TF-IDF + Logistic Regression
- Semantic structure using Word2Vec + t-SNE
This project was presented as the final EDA project at TovTech.
emotion-architecture-in-reddit-comments-by-d.ipynb
— contains the full annotated analysis, visuals, and conclusions
As part of this EDA, I studied statistics, data science, and modeling using the DataCamp platform. I learned to:
- Apply diagnostic thinking to real-world datasets
- Identify annotation bias and contradictions
- Use statistical tools to interpret NLP structures
- Process findings clearly under pressure
Despite a nontraditional background, this project helped me realize the power of clean logic, ethical scrutiny, and practical data skills.
- Source: GoEmotions on Kaggle
Feel free to reach out or fork this repo if you're interested in:
- Emotion AI quality control
- Annotator profiling
- Ethical NLP systems