Improve Programming language detection and classification

### Description

ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.

The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.

See https://github.com/nexB/aboutcode/wiki/GSOC-2019#improve-programming-language-detection-and-classification-in-scancode


Here are some actual tools for general filetype and Programming language  detection:
In use today:
- Python stdlib and mime detection: based on extensions only afaik. [we use it](https://github.com/nexB/scancode-toolkit/blob/develop/src/typecode/contenttype.py#L242)
- libmagic: [we use it ](https://github.com/nexB/scancode-toolkit/blob/develop/src/typecode/contenttype.py#L257) with our [own ctypes binding](https://github.com/nexB/scancode-toolkit/blob/d1e54630fbbcf6f7e0cfb773028e0b7a0d7c7b98/src/typecode/magic2.py) and it would need to be upgraded to the latest libmagic as part of the project
- Pygments [lexers](https://github.com/nexB/scancode-toolkit/blob/develop/src/typecode/contenttype.py#L279): this is a code lexing and highlighting library and it therefore also detects programming languages as a side effect. This used to be also what Github was using in Linguist a [while back](https://github.com/github/linguist/blob/9385e70d2dc1fbe37bd11c0b2f302d196289e6d1/github-linguist.gemspec#L19). 

( we also use a shannon entropy detector and binaryornot to detect binaries)

Things to look at could include :
- [freedesktop shared mime info](https://www.freedesktop.org/wiki/Software/shared-mime-info/): a signature based approach and the gold standard on Linux desktops and more. There a few [Python](https://pypi.org/project/z3c.sharedmimeinfo/0.1.0/) libraries [that](https://github.com/jleclanche/python-mime) support [this](https://github.com/plone/Products.MimetypesRegistry/)
- Github [linguist](https://github.com/github/linguist): in Ruby, used to count LOC and detect languages. Uses a combo of signatures/lexers from sublime and a naive bayesian classifier on top AFAICR
- douban [linguist](https://github.com/douban/linguist): a Python port of GH linguist ... interesting but not super active.
- [enry](https://github.com/src-d/enry): a Go port of GH linguist
- [ohcount](https://github.com/blackducksoftware/ohcount) uses ragel lexers
- https://github.com/yoeo/guesslang uses [Tensorflow](https://guesslang.readthedocs.io/en/latest/how.html)


See also: #1036 #1012 and #426 #1355 #1201 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve Programming language detection and classification #1445

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve Programming language detection and classification #1445

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions