This project implements Named Entity Recognition (NER) on news articles using both rule-based and model-based approaches. The goal is to identify entities such as people, organizations, locations, dates, emails, and URLs, and visualize them.
Bonus tasks completed:
- ✅ Visualizations with displaCy for sample documents
- ✅ Comparison of two spaCy models (
en_core_web_smvsen_core_web_lg) using Jaccard similarity and label counts
Recommended dataset: CoNLL003 (Kaggle)
Install Python dependencies:
pip install pandas spacy tqdm matplotlib
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg- Edit the CONFIG block in
ner_news_pipeline.py:
DATA_DIR = r"./sample_data"
OUTPUT_DIR = r"./ner_outputs"
SAMPLES_TO_VISUALIZE = 15
MAX_DOCS_PER_SPLIT = None # or a smaller number for testing- Run the script:
python ner_news_pipeline.py- Outputs
Outputs will be saved in the ner_outputs/ folder.
| File / Folder | Description |
|---|---|
ner_results_all_splits.csv |
Master CSV with text previews, entities from rule-based, small model, large model, and Jaccard similarity. |
*_rule_label_counts.csv |
Entity counts per label from rule-based NER. |
*_sm_label_counts.csv |
Entity counts per label from en_core_web_sm. |
*_lg_label_counts.csv |
Entity counts per label from en_core_web_lg. |
model_comparison_summary.csv |
Comparison of top 5 labels, unique labels per model, and number of documents per split. |
sm_vs_lg_jaccard_by_split.csv |
Jaccard similarity between small and large models per split. |
displacy_html/ |
HTML visualizations highlighting entities per model/sample. |
Load master results in Python:
import pandas as pd
df = pd.read_csv("./ner_outputs/ner_results_all_splits.csv")
print(df.head())Visualize an HTML file:
Open any file in ner_outputs/displacy_html/ with your browser:
Here are some sample visualizations:
Small model (en_core_web_sm)
Large model (en_core_web_lg)
Tip: Open the
.htmlfiles in a browser to interactively explore highlighted entities.
Implemented features:
- Load multiple file formats (CSV, TSV, JSONL, CoNLL)
- Rule-based NER with gazetteers + regex
- Model-based NER with two spaCy models
- Save unified CSV of entities
- Save per-split entity frequency CSVs
- Generate displaCy HTML visualizations
- Compare two spaCy models (counts + Jaccard)
- Bonus: sample visualizations included
- For large datasets, consider setting
MAX_DOCS_PER_SPLITto limit runtime. - The rule-based NER uses a small static gazetteer; you can extend it with more entities.
- HTML visualizations are generated for the first
SAMPLES_TO_VISUALIZEdocuments per split.
This project is licensed under the MIT License - see the LICENSE file for details.

