Description
Description
ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.
The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.
Here are some actual tools for general filetype and Programming language detection:
In use today:
- Python stdlib and mime detection: based on extensions only afaik. we use it
- libmagic: we use it with our own ctypes binding and it would need to be upgraded to the latest libmagic as part of the project
- Pygments lexers: this is a code lexing and highlighting library and it therefore also detects programming languages as a side effect. This used to be also what Github was using in Linguist a while back.
( we also use a shannon entropy detector and binaryornot to detect binaries)
Things to look at could include :
- freedesktop shared mime info: a signature based approach and the gold standard on Linux desktops and more. There a few Python libraries that support this
- Github linguist: in Ruby, used to count LOC and detect languages. Uses a combo of signatures/lexers from sublime and a naive bayesian classifier on top AFAICR
- douban linguist: a Python port of GH linguist ... interesting but not super active.
- enry: a Go port of GH linguist
- ohcount uses ragel lexers
- https://github.com/yoeo/guesslang uses Tensorflow