Skip to content

Proposal: high level file classification #426

Open
@pombredanne

Description

@pombredanne

To support #377 and other scan-based deduction and related refinements, an important step is to "classify" the files in the codebase being scanned. This would mean defining a few high level buckets and heuristics to classify a file in a bucket.

With such classification, smarter results could be provided: for instance the license of documentation files or build scripts does not have the same impact as the license of the main code (and may often not be part of a build or redistributed software as used in a system or app).

I am opening this up for discussion to define the classifications. I think there should be as few classifications as possible. They could be part of a hierarchy, but flat is probably better and simpler.

Here is a first shot at what these classes could be:

  • main code: would be all the code proper that is effectively built and used when a piece of code is used.
  • build scripts: such as Makefiles, poms, CMake lists, etc
  • test code: any code that is used for testing the main code, either unit or integration or else. In many cases this is stored in a tree separated from the main code and often this is not part of the build meant to be used, but instead invoked during a build step (make check, or similar, etc)
  • doc: any code and documents that are documenting the code and often may not be part of the built code. This often includes generated documents for API docs.
  • assets and media: such as images, video, sounds, fonts, etc. often used in GUI and web apps. They have often different licenses and origins from the main code.
  • dev tools: these are scripts, binaries, packages, etc present in the codebase but meant to be used for development and not production. Frequently, their provenance and license may have little impact on the resulting licensing of the main built code.
  • metadata/metafiles: such as package manifests, LICENSE or COPYING files, etc that are describing top-level information for a codebase or a subset of it.
  • generated code: such as when using a parser generator such as Bison/lex, some ORM such as Hibernate or some WSDL or else: this may represent a large volume of code at times and may not have directly identified provenance which needs to be traced to the "descriptor" used to generate the code. This may contain or not injected code plugs under various license (such as a bison skeletton)

Note that a file may end up in more than one class... not sure this would be a good thing.

Beside this classification, determining if some file is deployed or not deployed as part of a production build and built vs. not built is another topic altogether which would not be covered explicitly here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions