📰 Fake News Detection using NLP & Machine Learning

🧾 Abstract & Introduction

The increasing spread of misinformation in digital media presents a major challenge. This project aims to develop a machine learning model that classifies Arabic news articles as either real or fake, using natural language processing (NLP) techniques. Our goal is to automate and enhance media verification by leveraging text-based features.

🗂️ Dataset Overview

Rows: 5,352 news articles
Columns:
- 🆔 Id: Unique identifier
- 🗓️ date: Date of publication
- 🌐 platform: News source (e.g., Aljazeera)
- 📰 title: Article headline
- 📄 News content: Full news body (Arabic)
- 🧪 Label: Either real or fake

Label Distribution

✅ Real: 3,913 articles
⚠️ Fake: 1,439 articles

Top 3 Platforms + Others

🌐 Platform	📈 Number of Articles
🟢 Aljazeera	3,422
🔵 Misbar	1,426
🟣 Tibyan	247
⚪ Other	257

💡 Note: "Other" includes several smaller news channels grouped together.

Data Quality Checks

✅ Total duplicate rows: 0
✅ No null values in any column

Title Length Stats

Metric	Value
🔢 Mean	55.76
🔽 Min	7
🔼 Max	379

News Content Length Stats

Metric	Value
🔢 Mean	1,363.51
🔽 Min	7
🔼 Max	64,878

📊 Dataset Analysis

Structure

Preprocessed dataset contains 2 columns: processed_text (tokenized text) and Label
All rows contain tokenized Arabic text stored as stringified lists

Cleaning Steps

Converted token lists to plain text
Removed Arabic stopwords
Normalized words (removed punctuation & diacritics)
Applied TF-IDF vectorization with up to 5,000 features using unigram + bigram

Distribution

Real: 3,913 articles (~73%)
Fake: 1,439 articles (~27%)

⚠️ Dataset is imbalanced — may affect recall for fake news.

💭 Opinion on Dataset Quality

The dataset is fairly clean and well-labeled.
Tokenization and stopword removal improved preprocessing.
The imbalance remains a challenge but can be addressed in future work with SMOTE or class weighting.

🔁 Methodology Flow

We trained multiple ML models:

Naive Bayes → Lightweight, efficient with word counts
Logistic Regression → Strong linear baseline
Linear SVM → Performs well with sparse TF-IDF features
Random Forest → Captures non-linear patterns & avoids overfitting

📈 Evaluation Results

Model	Accuracy	F1 (Fake)	F1 (Real)
Naive Bayes	87.4%	75.4%	91.5%
Logistic Regression	87.6%	74.9%	91.7%
Linear SVM	87.9%	77.1%	91.7%
Random Forest	88.8%	79.4%	92.3%

✅ Best Model

Random Forest Classifier (n_estimators=100, random_state=42)
Achieved highest overall accuracy (88.8%)
Balanced performance across fake & real labels

🗣️ Discussion

Ensemble methods (Random Forest) outperformed linear models.
TF-IDF bigram features improved sensitivity to context.
Imbalance impacted minority class (fake) detection.

✅ Conclusion & Future Work

We successfully built a machine learning pipeline that classifies Arabic news with ~89% accuracy.

Future Improvements:

Apply resampling techniques (SMOTE, class-weighted models)
Experiment with deep learning (e.g., BERT for Arabic)
Deploy as a real-time fake news detection app

⚙️ Installation & Usage

Clone Repository

git clone https://github.com/FaresAlnamla/Palestine-Fake-News-Detection.git
cd Palestine-Fake-News-Detection

Run Jupyter Notebook

jupyter notebook

Open Fake News Detection Model.ipynb and execute cells step by step.

📦 Dependencies

Python 3.11.9
pandas, numpy
scikit-learn
matplotlib, seaborn, plotly
Jupyter Notebook

📎 Appendices

📁 Dataset: cleaned_news_dataset.csv
📄 Notebook: Fake News Detection Model.ipynb
📊 Vectorizer: TfidfVectorizer(max_features=5000, ngram_range=(1,2))
📤 Best Model: RandomForestClassifier(n_estimators=100, random_state=42)

✨ Built with data & code ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
Fake News Detection Model‏.ipynb		Fake News Detection Model‏.ipynb
LICENSE		LICENSE
README.md		README.md
cleaned_news_dataset.csv		cleaned_news_dataset.csv
original_news_dataset.csv		original_news_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📰 Fake News Detection using NLP & Machine Learning

🧾 Abstract & Introduction

🗂️ Dataset Overview

Label Distribution

Top 3 Platforms + Others

Data Quality Checks

Title Length Stats

News Content Length Stats

📊 Dataset Analysis

Structure

Cleaning Steps

Distribution

💭 Opinion on Dataset Quality

🔁 Methodology Flow

📈 Evaluation Results

✅ Best Model

🗣️ Discussion

✅ Conclusion & Future Work

⚙️ Installation & Usage

Clone Repository

Run Jupyter Notebook

📦 Dependencies

📎 Appendices

About

Uh oh!

Releases

Packages

Languages

License

FaresAlnamla/Palestine-Fake-News-Detection

Folders and files

Latest commit

History

Repository files navigation

📰 Fake News Detection using NLP & Machine Learning

🧾 Abstract & Introduction

🗂️ Dataset Overview

Label Distribution

Top 3 Platforms + Others

Data Quality Checks

Title Length Stats

News Content Length Stats

📊 Dataset Analysis

Structure

Cleaning Steps

Distribution

💭 Opinion on Dataset Quality

🔁 Methodology Flow

📈 Evaluation Results

✅ Best Model

🗣️ Discussion

✅ Conclusion & Future Work

⚙️ Installation & Usage

Clone Repository

Run Jupyter Notebook

📦 Dependencies

📎 Appendices

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages