Skip to content

Commit 0b4c1d2

Browse files
README.md
1 parent 2071a38 commit 0b4c1d2

File tree

1 file changed

+82
-1
lines changed

1 file changed

+82
-1
lines changed

README.md

Lines changed: 82 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,82 @@
1-
# human_language_technologies
1+
# Human Language Technologies
2+
## Academic Year Project 2023/2024
3+
4+
# Cyberbullying Classification
5+
6+
## Introduction
7+
8+
This project, conducted as part of the Human Language Technologies (HLT) course, aims to develop and evaluate a Natural Language Processing (NLP) model for the classification of tweets from the social media platform X (formerly known as Twitter) as potential acts of cyberbullying or offensive behavior.
9+
10+
## Motivations
11+
12+
The project is driven by three primary motivations:
13+
14+
1. **Application of Knowledge**: To apply the theoretical and methodological concepts learned during the HLT course.
15+
2. **Social Relevance**: To address the growing social and psychological issue of cyberbullying, which has become more prevalent since the Covid-19 pandemic.
16+
3. **Challenging Goals**: To meet the challenging goals set by the authors of the dataset and to contribute meaningful insights into the domain of cyberbullying detection.
17+
18+
## Project Structure
19+
20+
The project is organized into several key directories:
21+
22+
- **`_chunckdevs`**: A custom library developed by the team specifically for this project.
23+
- **`data`**: Contains all datasets used for training and evaluation.
24+
- **`notebooks`**: Includes commented Jupyter notebooks for preprocessing, baseline models, advanced models, and transformer-based models.
25+
- **`outputs`**: Stores generated outputs, including trained models and other relevant files.
26+
- **`requirements.txt`**: Contains all the libraries needed for code execution.
27+
28+
## Dataset and Goal
29+
30+
The dataset, sourced from Kaggle, consists of over 47,000 tweets, each labeled according to the type of cyberbullying. The dataset is balanced, with each class containing approximately 8,000 labels. Tweets are categorized either as descriptions of bullying events or as the bullying acts themselves. The primary objectives are:
31+
32+
1. **Binary Classification**: To identify whether a tweet constitutes an act of cyberbullying or not.
33+
2. **Multiclass Classification**: To detect the specific type of discriminatory act, with labels including:
34+
- Age
35+
- Ethnicity
36+
- Gender
37+
- Religion
38+
- Other types of cyberbullying
39+
- Not cyberbullying
40+
41+
## Data Understanding and Preparation
42+
43+
### Data Understanding
44+
45+
Initial exploration included the creation of word clouds for each class, revealing significant semantic differences related to cyberbullying. Hashtags, initially considered, were eventually excluded due to their low frequency and lack of specificity.
46+
47+
### Data Preprocessing
48+
49+
Two versions of the dataset were prepared: one containing all tweets and another with only English texts. Both versions were split into development and test sets. Normalization was applied exclusively to the development set tweets. Duplicate tweets, particularly those labeled as "other cyberbullying," were identified and removed.
50+
51+
## Classification
52+
53+
### Models Implemented
54+
55+
A variety of models were implemented and evaluated:
56+
57+
- Baseline models
58+
- Advanced models
59+
- Transformer-based models
60+
- Ensemble models
61+
62+
### Feature Engineering
63+
64+
Features were engineered for both baseline and advanced models, with extensive hyperparameter tuning to optimize performance.
65+
66+
### Evaluation Metrics
67+
68+
Model performance was evaluated using metrics such as precision, recall, and F1-score.
69+
70+
## Results for Classification
71+
72+
### Baseline and Advanced Models
73+
74+
Ensemble models achieved the highest F1-scores, although precision for certain classes remained challenging.
75+
76+
### Comparison with State-of-the-Art
77+
78+
Our models were benchmarked against state-of-the-art (SOTA) models to evaluate relative performance.
79+
80+
## Conclusions
81+
82+
Our analysis demonstrates that while machine learning models can effectively distinguish between different types of cyberbullying, they struggle with context and intent, particularly in distinguishing non-cyberbullying tweets from harmful messages. This underscores the need for further research into context disambiguation and intent understanding to improve the efficacy of cyberbullying detection models.

0 commit comments

Comments
 (0)