Description
ScanCode currently reports the following 9 fields for determining the content type of a file:
- MIME Type, - File Type, - Language
- Is Binary, - Is Text, - Is Archive, - Is Media, - Is Source, - Script
When organizing a codebase for analysis, it would be useful to consolidate this data into a single field that indicates the level and type of analysis to apply. The focus should be on identifying source and binary files "program/code" files that are copyright-able and likely to be licensed. There will be many specialized file types (e.g. only for a proprietary software program) that will be not be covered by this feature.
In the prototype phase we will use a ScanCode plugin to analyze a Scan and annotate it with a new primary_content_type field that designates the primary Content Type (primary for analysis) in the format: Language-Type.
An initial test can be to report a primary Content Type to distinguish between SourceCode and Scripts for files written in Programming Languages (e.g. Python ,Ruby) that have both. There should be patterns in the current set of Content Type data to make this distinction.