The increasing spread of misinformation in digital media presents a major challenge. This project aims to develop a machine learning model that classifies Arabic news articles as either real or fake, using natural language processing (NLP) techniques. Our goal is to automate and enhance media verification by leveraging text-based features.
- Rows: 5,352 news articles
- Columns:
- 🆔 Id: Unique identifier
- 🗓️ date: Date of publication
- 🌐 platform: News source (e.g., Aljazeera)
- 📰 title: Article headline
- 📄 News content: Full news body (Arabic)
- 🧪 Label: Either real or fake
- ✅ Real: 3,913 articles
⚠️ Fake: 1,439 articles
| 🌐 Platform | 📈 Number of Articles |
|---|---|
| 🟢 Aljazeera | 3,422 |
| 🔵 Misbar | 1,426 |
| 🟣 Tibyan | 247 |
| ⚪ Other | 257 |
💡 Note: "Other" includes several smaller news channels grouped together.
- ✅ Total duplicate rows: 0
- ✅ No null values in any column
| Metric | Value |
|---|---|
| 🔢 Mean | 55.76 |
| 🔽 Min | 7 |
| 🔼 Max | 379 |
| Metric | Value |
|---|---|
| 🔢 Mean | 1,363.51 |
| 🔽 Min | 7 |
| 🔼 Max | 64,878 |
- Preprocessed dataset contains 2 columns:
processed_text(tokenized text) andLabel - All rows contain tokenized Arabic text stored as stringified lists
- Converted token lists to plain text
- Removed Arabic stopwords
- Normalized words (removed punctuation & diacritics)
- Applied TF-IDF vectorization with up to 5,000 features using unigram + bigram
- Real: 3,913 articles (~73%)
- Fake: 1,439 articles (~27%)
⚠️ Dataset is imbalanced — may affect recall for fake news.
The dataset is fairly clean and well-labeled.
Tokenization and stopword removal improved preprocessing.
The imbalance remains a challenge but can be addressed in future work with SMOTE or class weighting.
We trained multiple ML models:
- Naive Bayes → Lightweight, efficient with word counts
- Logistic Regression → Strong linear baseline
- Linear SVM → Performs well with sparse TF-IDF features
- Random Forest → Captures non-linear patterns & avoids overfitting
| Model | Accuracy | F1 (Fake) | F1 (Real) |
|---|---|---|---|
| Naive Bayes | 87.4% | 75.4% | 91.5% |
| Logistic Regression | 87.6% | 74.9% | 91.7% |
| Linear SVM | 87.9% | 77.1% | 91.7% |
| Random Forest | 88.8% | 79.4% | 92.3% |
- Random Forest Classifier (
n_estimators=100, random_state=42) - Achieved highest overall accuracy (88.8%)
- Balanced performance across fake & real labels
- Ensemble methods (Random Forest) outperformed linear models.
- TF-IDF bigram features improved sensitivity to context.
- Imbalance impacted minority class (fake) detection.
We successfully built a machine learning pipeline that classifies Arabic news with ~89% accuracy.
Future Improvements:
- Apply resampling techniques (SMOTE, class-weighted models)
- Experiment with deep learning (e.g., BERT for Arabic)
- Deploy as a real-time fake news detection app
git clone https://github.com/FaresAlnamla/Palestine-Fake-News-Detection.git
cd Palestine-Fake-News-Detectionjupyter notebookOpen Fake News Detection Model.ipynb and execute cells step by step.
- Python 3.11.9
- pandas, numpy
- scikit-learn
- matplotlib, seaborn, plotly
- Jupyter Notebook
- 📁 Dataset:
cleaned_news_dataset.csv - 📄 Notebook:
Fake News Detection Model.ipynb - 📊 Vectorizer:
TfidfVectorizer(max_features=5000, ngram_range=(1,2)) - 📤 Best Model:
RandomForestClassifier(n_estimators=100, random_state=42)
✨ Built with data & code ❤️