New summary/primary Content Type prototype

ScanCode currently reports the following 9 fields for determining the content type of a file:
- MIME Type, - File Type, - Language
- Is Binary, - Is Text, - Is Archive, - Is Media, - Is Source, - Script

When organizing a codebase for analysis, it would be useful to consolidate this data into a single field that indicates the level and type of analysis to apply. The focus should be on identifying source and binary files "program/code" files that are copyright-able and likely to be licensed.  There will be many specialized file types (e.g. only for a proprietary software program) that will be not be covered by this feature.

In the prototype phase we will use a ScanCode plugin to analyze a Scan and annotate it with a new primary_content_type field that designates the primary Content Type (primary for analysis) in the format: Language-Type.

An initial test can be to report a primary Content Type to distinguish between SourceCode and Scripts for files written in Programming Languages (e.g. Python ,Ruby) that have both. There should be patterns in the current set of Content Type data to make this distinction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

New summary/primary Content Type prototype #1754

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

New summary/primary Content Type prototype #1754

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions