A useful list of datasets I collected for NLP tasks. You can fork and/or clone this repository and get all the datasets available.
git clone https://github.com/nluninja/nlp_datasets
Name | Description | classes | format | language |
---|---|---|---|---|
20 Newsgroups dataset |
file set arranged into 20 topic folders | see corpus page | files | en |
The Anatomical Entity Mention (AnEM) corpus |
PubMeb dataset | Anatomical_system, Cell,Cellular_component, Developing_anatomical_structure, Immaterial_anatomical_entity, Multi-tissue_structure, Organ, Organism_subdivision, Organism_substance, Pathological_formation, Tissue | conll/iob2 | |
AG News Topic dataset |
News Topic Classification dataset - Antonio Gulli - UniPi | World, Sports, Business, Sci/Tech | csv | en |
BBC News |
BBC News Classification dataset | business, entertainment, politics, sport, tech | csv | en |
CoNLL 2003 |
named entity recognition dataset | People, Location, Organization, Misc | conll/iob2 | en |
emotions classification dataset |
emotion classification dataset which contains tweets labeled into 6 categories | joy, sadness, anger, fear, love, surprise | csv | en |
Georgetown University Multilayer corpus in CoNLL |
CoNLL tagged corpus for entity extraction | 23 classes (person, substance, quantity, time, place, organization) | conll/iob2 | en |
Relationship and Entity Extraction Evaluation Dataset in CoNLL |
CoNLL tagged corpus for entity extraction | 21 classes (person, temporal, weapon, MilitaryPlatform, quantity, organization) | conll/iob2 | en |
sentiment140 dataset |
dataset which contains tweets labeled according to their polarity | negative, neutral, positive | csv | en |
Toxic Comments dataset Reviews |
Wikipedia comments labeled into 6 categories with score | toxic, severe_toxic, obscene, threat, insult, identity_hate | csv | en |
WikiGold Dataset |
named entity recognition dataset | People, Location, Organization, Misc | conll/iob2 | en |
Wikipedia Movie Plots dataset |
descriptions of movies from around the world scraped from WikiPedia | Genre Classes | csv | en |
WNUT 17 Emerging Entities Dataset |
Twitter/StackOverflow data for discovering emerging entities | Entity Classes | conll/iob2 | en |
Yelp! Reviews |
reviews dataset from Yelp! for classification/sentiment analysis tasks | 1 to 5 rates | csv | en |
I appreciate your contribution to this repo, so don't hesitate to submit your changes via pull request for bug fixing or for adding a new dataset as well!
pull request https://github.com/nluninja/nlp_datasets
use the corpus_template for uploading the new dataset. I look forward seeing your contribution! 🙏 😘