Skip to content

Latest commit

 

History

History

20_newsgroup

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

20 Newsgroups dataset - Classification

Description

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Entries

document set: 19.997 docs

URL

https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
http://qwone.com/~jason/20Newsgroups/

File Format

List of file folders, the folder name corresponds to the topic/class. The 20 topics are:

  • alt.atheism
  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • misc.forsale
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • soc.religion.christian
  • talk.politics.guns
  • talk.politics.mideast
  • talk.politics.misc
  • talk.religion.misc