Task 4 – Named Entity Recognition (NER) from News Articles

Description

This project implements Named Entity Recognition (NER) on news articles using both rule-based and model-based approaches. The goal is to identify entities such as people, organizations, locations, dates, emails, and URLs, and visualize them.

Bonus tasks completed:

✅ Visualizations with displaCy for sample documents
✅ Comparison of two spaCy models (en_core_web_sm vs en_core_web_lg) using Jaccard similarity and label counts

Recommended dataset: CoNLL003 (Kaggle)

🔧 Requirements

Install Python dependencies:

pip install pandas spacy tqdm matplotlib
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg

⚡ How to Run

Edit the CONFIG block in ner_news_pipeline.py:

DATA_DIR = r"./sample_data"
OUTPUT_DIR = r"./ner_outputs"
SAMPLES_TO_VISUALIZE = 15
MAX_DOCS_PER_SPLIT = None  # or a smaller number for testing

Run the script:

python ner_news_pipeline.py

Outputs

Outputs will be saved in the ner_outputs/ folder.

📊 Outputs Explained

File / Folder	Description
`ner_results_all_splits.csv`	Master CSV with text previews, entities from rule-based, small model, large model, and Jaccard similarity.
`*_rule_label_counts.csv`	Entity counts per label from rule-based NER.
`*_sm_label_counts.csv`	Entity counts per label from `en_core_web_sm`.
`*_lg_label_counts.csv`	Entity counts per label from `en_core_web_lg`.
`model_comparison_summary.csv`	Comparison of top 5 labels, unique labels per model, and number of documents per split.
`sm_vs_lg_jaccard_by_split.csv`	Jaccard similarity between small and large models per split.
`displacy_html/`	HTML visualizations highlighting entities per model/sample.

🧩 Example Usage

Load master results in Python:

import pandas as pd

df = pd.read_csv("./ner_outputs/ner_results_all_splits.csv")
print(df.head())

Visualize an HTML file:

Open any file in ner_outputs/displacy_html/ with your browser:

📈 Visualizations

Here are some sample visualizations:

Small model (en_core_web_sm)

Large model (en_core_web_lg)

Tip: Open the .html files in a browser to interactively explore highlighted entities.

✅ Task Completion

Implemented features:

Load multiple file formats (CSV, TSV, JSONL, CoNLL)
Rule-based NER with gazetteers + regex
Model-based NER with two spaCy models
Save unified CSV of entities
Save per-split entity frequency CSVs
Generate displaCy HTML visualizations
Compare two spaCy models (counts + Jaccard)
Bonus: sample visualizations included

📖 References

⚠️ Notes

For large datasets, consider setting MAX_DOCS_PER_SPLIT to limit runtime.
The rule-based NER uses a small static gazetteer; you can extend it with more entities.
HTML visualizations are generated for the first SAMPLES_TO_VISUALIZE documents per split.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Results		Results
html_samples		html_samples
images		images
LICENSE		LICENSE
Model_Usage.py		Model_Usage.py
README.md		README.md
Task4_NER-News-Articles.py		Task4_NER-News-Articles.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Task 4 – Named Entity Recognition (NER) from News Articles

Description

🔧 Requirements

⚡ How to Run

📊 Outputs Explained

🧩 Example Usage

📈 Visualizations

✅ Task Completion

📖 References

⚠️ Notes

License

About

Uh oh!

Releases

Packages

Languages

License

AdelAdool/NER-News-Articles

Folders and files

Latest commit

History

Repository files navigation

Task 4 – Named Entity Recognition (NER) from News Articles

Description

🔧 Requirements

⚡ How to Run

📊 Outputs Explained

🧩 Example Usage

📈 Visualizations

✅ Task Completion

📖 References

⚠️ Notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages