🎭 Multimodal Emotion Recognition using Transformers

This project explores multimodal emotion recognition using facial and vocal cues from the RAVDESS dataset. We evaluate and compare three fusion strategies built on transformer and attention-based architectures to enhance affective computing systems that better understand human emotions.

🚀 Overview

Emotion recognition is central to improving human-computer interaction, sentiment analysis, and adaptive interfaces. Our system processes both facial expressions (video) and speech signals (audio) using:

A vision branch powered by EfficientFace (pre-trained on AffectNet)
An audio branch based on MFCCs and convolutional layers
Three distinct modality fusion strategies:
- Late Transformer Fusion
- Intermediate Transformer Fusion
- Intermediate Attention-Based Fusion (best performance)

Our best model achieved:

Top-5 Accuracy: 98.12%

🗂️ Project Structure

├── data/ # Preprocessed RAVDESS dataset
├── models/ # PyTorch models and architecture definitions
├── utils/ # Preprocessing, dataloaders, and utilities
├── experiments/ # Training scripts and result logs
├── notebooks/ # Exploratory analysis and visualizations
├── results/ # Output metrics and model predictions
└── README.md # Project documentation

📊 Dataset

We used the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS):

24 actors (12 female, 12 male)
7,356 audio-video recordings
8 emotions: neutral, calm, happy, sad, angry, fearful, surprise, and disgust
Available in three formats: audio-only, video-only, and audiovisual

Preprocessing:

🎧 Audio: Resampled to 16kHz, mono conversion, MFCC extraction, amplitude normalization
📹 Video: Extracted 15 frames per video, resized to 224×224, augmented using OpenFace
🔄 Synced and standardized for transformer compatibility

🧠 Model Architecture

🔍 Feature Extraction

Vision Branch:
- EfficientFace (pretrained) + Temporal 1D Conv Layers
Audio Branch:
- MFCC input → 4 Conv blocks → Global Avg Pooling

🔗 Fusion Mechanisms

Fusion Type	Description
Late Transformer Fusion	Independent processing; fused at transformer stage
Intermediate Transformer	Fusion at mid-feature layers with transformer-based cross-attention
Intermediate Attention 🏆	Scaled dot-product attention between modalities (no feature entanglement)

🧪 Experiments

🛠️ Training Setup

Optimizer: SGD (lr = 0.04, momentum = 0.9, weight decay = 1e-3)
Epochs: 100
Data Augmentation: Random horizontal flips, rotations

📈 Performance Summary

Method	Loss	Top-1 Precision (%)	Top-5 Precision (%)
Late Transformer Fusion	16.699	14.375	59.167
Intermediate Transformer	35.392	13.958	85.208
Intermediate Attention 🏆	2.393	33.958	98.125

📌 Key Insights

Transformer-based fusion offers solid cross-modal alignment but risks overfitting.
Simpler attention-based fusion (without direct fusion) performs best in this setup.
Audio and facial features complement each other; their joint learning improves robustness.

📚 References

Vaswani et al., Attention is All You Need
Kumar et al., Multimodal Emotion Recognition on RAVDESS
Yue et al., Multi-task learning for emotion and intensity recognition
Sherman et al., Speech Emotion Recognition with BLSTM + Attention

🏁 Future Work

Integrate self-supervised embeddings (e.g., Wav2Vec 2.0)
Expand to larger datasets (IEMOCAP, CREMA-D)
Explore graph-based fusion or temporal transformers
Real-time deployment in adaptive user interfaces

📬 Contact

Venkata Revanth Jyothula
📍 New York City
📫 jyorevanth@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
datasets		datasets
models		models
pretrained_model		pretrained_model
ravdess_preprocessing		ravdess_preprocessing
results		results
.DS_Store		.DS_Store
CV_Final_Report.pdf		CV_Final_Report.pdf
EfficientFace_Trained_on_AffectNet7.pth.tar		EfficientFace_Trained_on_AffectNet7.pth.tar
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
emotionfigure.png		emotionfigure.png
main.py		main.py
model.py		model.py
opts.py		opts.py
processed.txt		processed.txt
requirements.txt		requirements.txt
train.py		train.py
transforms.py		transforms.py
utils.py		utils.py
validation.py		validation.py
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎭 Multimodal Emotion Recognition using Transformers

🚀 Overview

🗂️ Project Structure

📊 Dataset

Preprocessing:

🧠 Model Architecture

🔍 Feature Extraction

🔗 Fusion Mechanisms

🧪 Experiments

🛠️ Training Setup

📈 Performance Summary

📌 Key Insights

📚 References

🏁 Future Work

📬 Contact

About

Uh oh!

Releases

Packages

Languages

License

revforyou/Emotion_Recognition_Transformers

Folders and files

Latest commit

History

Repository files navigation

🎭 Multimodal Emotion Recognition using Transformers

🚀 Overview

🗂️ Project Structure

📊 Dataset

Preprocessing:

🧠 Model Architecture

🔍 Feature Extraction

🔗 Fusion Mechanisms

🧪 Experiments

🛠️ Training Setup

📈 Performance Summary

📌 Key Insights

📚 References

🏁 Future Work

📬 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages