Skip to content

Improve Programming language detection and classification #1445

Open
@pombredanne

Description

@pombredanne

Description

ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.

The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.

See https://github.com/nexB/aboutcode/wiki/GSOC-2019#improve-programming-language-detection-and-classification-in-scancode

Here are some actual tools for general filetype and Programming language detection:
In use today:

  • Python stdlib and mime detection: based on extensions only afaik. we use it
  • libmagic: we use it with our own ctypes binding and it would need to be upgraded to the latest libmagic as part of the project
  • Pygments lexers: this is a code lexing and highlighting library and it therefore also detects programming languages as a side effect. This used to be also what Github was using in Linguist a while back.

( we also use a shannon entropy detector and binaryornot to detect binaries)

Things to look at could include :

See also: #1036 #1012 and #426 #1355 #1201

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions