Skip to content

feat(scanner): Merge duplicate scan results that share a provenance #10502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 29, 2025

Conversation

maennchen
Copy link
Contributor

@maennchen maennchen commented Jun 19, 2025

TODO

When the SpdxDocumentFile package manager is used, the project and all contained packages often resolve to the same VCS provenance (e.g. the root of the Git repository).
Before this change ORT stored two separate ScanResults for such a provenance – one keyed to the project, one keyed to the package.

That caused two follow-on problems:

  • Both results appeared in the OrtResult, so evaluators saw duplicate findings for the same source tree.
  • Because projects and packages are handled by different rules the package result was additionally padded with a SpdxConstants.NONE finding whenever includeFilesWithoutFindings was enabled. The evaluator therefore compared real license findings from the project result with NONE from the package result and failed with a violation.

This patch

  • merges ScannerRun, and
  • performs the "pad with NONE" step only after deduplication, so every path is represented exactly once.

As a consequence the evaluator now receives one consistent set of license findings per provenance / scanner, eliminating the false mismatch.

This is the first time for me writing Kotlin. Sorry if the code is not up to the usual standards.

@maennchen maennchen requested a review from a team as a code owner June 19, 2025 18:38
@maennchen maennchen force-pushed the jm/scan_deduplication branch from 038c082 to 40190e3 Compare June 19, 2025 18:42
@maennchen maennchen changed the title scanner: Merge duplicate scan results that share a provenance feat(scanner): Merge duplicate scan results that share a provenance Jun 19, 2025
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.30%. Comparing base (e2dd087) to head (d536a3e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #10502      +/-   ##
============================================
+ Coverage     57.28%   57.30%   +0.01%     
- Complexity     1644     1648       +4     
============================================
  Files           341      341              
  Lines         12722    12722              
  Branches       1206     1206              
============================================
+ Hits           7288     7290       +2     
+ Misses         4971     4969       -2     
  Partials        463      463              
Flag Coverage Δ
funTest-docker 71.28% <ø> (ø)
funTest-non-docker 33.00% <ø> (-0.01%) ⬇️
test-ubuntu-24.04 41.77% <ø> (ø)
test-windows-2022 41.75% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@maennchen maennchen force-pushed the jm/scan_deduplication branch from 40190e3 to ed1df5f Compare June 19, 2025 19:37
Copy link
Member

@MarcelBochtler MarcelBochtler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maennchen,
thanks for your contribution!
Do you mind adding a test that demonstrates the issue with the duplicated scan results and that your changes fix these issues?

Edit: As there is currently no existing test for the scan function that use an ORT result, I won't insist on this change.

Edit 2: I Found these existing tests which already test the scan() function:

"Scanning all packages corresponding to a single VCS" should {
val analyzerResult = createAnalyzerResult(pkg0, pkg1, pkg2, pkg3, pkg4)
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap())
"return the expected ORT result" {
val expectedResult = readResource("/scanner-integration-all-pkgs-expected-ort-result.yml")
patchActualResult(ortResult.toYaml(), patchStartAndEndTime = true) shouldBe
patchExpectedResult(expectedResult)
}
"return the expected (merged) scan results" {
val expectedResult = readResource("/scanner-integration-expected-scan-results.yml")
val scanResults = ortResult.getScanResults().toSortedMap()
patchActualResult(scanResults.toYaml(), patchStartAndEndTime = true) shouldBe
patchExpectedResult(expectedResult)
}
"return the expected (merged) file lists" {
val expectedResult = readResource("/scanner-integration-expected-file-lists.yml")
val fileLists = ortResult.getFileLists().toSortedMap()
fileLists.toYaml() shouldBe patchExpectedResult(expectedResult)
}
}
"Scanning a subset of the packages corresponding to a single VCS" should {
"return the expected ORT result" {
val analyzerResult = createAnalyzerResult(pkg1, pkg3)
val expectedResult = readResource("/scanner-integration-subset-pkgs-expected-ort-result.yml")
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap())
patchActualResult(ortResult.toYaml(), patchStartAndEndTime = true) shouldBe
patchExpectedResult(expectedResult)
}
}

Can you have a look at them and extend them to test your use-case?

@maennchen
Copy link
Contributor Author

maennchen commented Jun 25, 2025

@MarcelBochtler I added a test to test the deduplication. Running the same test on main will cause duplicated information.

$ ./gradlew scanner:funTest --tests "org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest"

Parallel Configuration Cache is an incubating feature.
Calculating task graph as configuration cache cannot be reused because a build logic input of type 'SemInfoVersionValueSource' has changed.
Type-safe project accessors is an incubating feature.

> Configure project :
Building ORT version 61.1.0.

> Task :scanner:funTest

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected ORT result STARTED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected ORT result PASSED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected (merged) scan results STARTED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected (merged) scan results PASSED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected (merged) file lists STARTED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning all packages corresponding to a single VCS should > return the expected (merged) file lists PASSED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning a subset of the packages corresponding to a single VCS should > return the expected ORT result STARTED

org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest > Scanning a subset of the packages corresponding to a single VCS should > return the expected ORT result FAILED
    io.kotest.assertions.AssertionFailedError: expected:<[Deletion at line 298]           end_line: -1
      scanners:
        Dummy::pkg1:1.0.0:
        - "Dummy"
        Dummy::pkg3:1.0.0:
        - "Dummy"
        Dummy::project:1.0.0:
        - "Dummy"
      files:
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner-subrepo.git"
            revision: "a732695e03efcbd74539208af98c297ee86e49d5"
            path: ""
          resolved_revision: "a732695e03efcbd74539208af98c297ee86e49d5"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s1/pkg-s1.txt"
          sha1: "e5fb17f8f4f4ef0748bb5ba137fd0e091dd5a1f6"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner-subrepo2.git"
            revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
            path: ""
          resolved_revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s2/pkg-s2.txt"
          sha1: "37996d13eceb6b29db43a381ce8df375b5eee8e9"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: ".gitmodules"
          sha1: "d7f070ddbe0b6dd8a173714d565a1240dd96eacd"
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "82cfc115138054ce5b5e6839f38687c9d7186710"
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg2/pkg2.txt"
          sha1: "cc8f97cebe1dc0ed889a31f504bcf491d5241aaa"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
        - path: "pkg4/pkg4.txt"
          sha1: "3cba29011be2b9d59f6204d6fa0a386b1b2dbd90"
    advisor: null
    evaluator: null
    resolved_configuration: {}


    [Deletion at line 386] > but was:<[Deletion at line 298]           end_line: -1
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        scanner:
          name: "Dummy"
          version: "1.0.0"
          configuration: ""
        summary:
          start_time: "1970-01-01T00:00:00Z"
          end_time: "1970-01-01T00:00:00Z"
          licenses:
          - license: "NOASSERTION"
            location:
              path: "LICENSE"
              start_line: -1
              end_line: -1
          - license: "NOASSERTION"
            location:
              path: "pkg1/pkg1.txt"
              start_line: -1
              end_line: -1
          - license: "NOASSERTION"
            location:
              path: "pkg3/pkg3.txt"
              start_line: -1
              end_line: -1
      scanners:
        Dummy::pkg1:1.0.0:
        - "Dummy"
        Dummy::pkg3:1.0.0:
        - "Dummy"
        Dummy::project:1.0.0:
        - "Dummy"
      files:
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner-subrepo.git"
            revision: "a732695e03efcbd74539208af98c297ee86e49d5"
            path: ""
          resolved_revision: "a732695e03efcbd74539208af98c297ee86e49d5"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s1/pkg-s1.txt"
          sha1: "e5fb17f8f4f4ef0748bb5ba137fd0e091dd5a1f6"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner-subrepo2.git"
            revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
            path: ""
          resolved_revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s2/pkg-s2.txt"
          sha1: "37996d13eceb6b29db43a381ce8df375b5eee8e9"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: ".gitmodules"
          sha1: "d7f070ddbe0b6dd8a173714d565a1240dd96eacd"
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "82cfc115138054ce5b5e6839f38687c9d7186710"
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg2/pkg2.txt"
          sha1: "cc8f97cebe1dc0ed889a31f504bcf491d5241aaa"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
        - path: "pkg4/pkg4.txt"
          sha1: "3cba29011be2b9d59f6204d6fa0a386b1b2dbd90"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
    advisor: null
    evaluator: null
    resolved_configuration: {}


    [Deletion at line 356]         path: ""
          resolved_revision: "6431fd85188db22b942deb66c7a8c1a53921fc35"
        files:
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "ae8044f7fce7ee914a853c30c3085895e9be8b9c"
        - path: "pkg-s2/pkg-s2.txt"
          sha1: "37996d13eceb6b29db43a381ce8df375b5eee8e9"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: ".gitmodules"
          sha1: "d7f070ddbe0b6dd8a173714d565a1240dd96eacd"
        - path: "LICENSE"
          sha1: "7df059597099bb7dcf25d2a9aedfaf4465f72d8d"
        - path: "README"
          sha1: "82cfc115138054ce5b5e6839f38687c9d7186710"
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg2/pkg2.txt"
          sha1: "cc8f97cebe1dc0ed889a31f504bcf491d5241aaa"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
        - path: "pkg4/pkg4.txt"
          sha1: "3cba29011be2b9d59f6204d6fa0a386b1b2dbd90"
      - provenance:
          vcs_info:
            type: "Git"
            url: "https://github.com/oss-review-toolkit/ort-test-data-scanner.git"
            revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
            path: ""
          resolved_revision: "97d57bb4795bc41f496e1a8e2c7751cefc7da7ec"
        files:
        - path: "pkg1/pkg1.txt"
          sha1: "22eb73bd30d47540a4e05781f0f6e07640857cae"
        - path: "pkg3/pkg3.txt"
          sha1: "859d66be2d153968cdaa8ec7265270c241eea024"
    advisor: null
    evaluator: null
    resolved_configuration: {}
    >

16:50:31.855 [ForkJoinPool-1-worker-1] DEBUG org.eclipse.jgit.internal.util.ShutdownHook - Cleanup org.eclipse.jgit.util.FS$FileStoreAttributes$$Lambda/0x00007eabf0386c10@e239dec during JVM shutdown

> Task :scanner:funTest FAILED

4 tests completed, 1 failed

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':scanner:funTest'.
> There were failing tests. See the report at: file:///workspaces/ort/scanner/build/reports/tests/funTest/index.html

* Try:
> Run with --scan to get full insights.

BUILD FAILED in 53s
81 actionable tasks: 52 executed, 29 up-to-date
Configuration cache entry stored.

@maennchen maennchen force-pushed the jm/scan_deduplication branch from 4432de1 to 263be95 Compare June 25, 2025 17:06
@maennchen maennchen force-pushed the jm/scan_deduplication branch from 263be95 to 276bc2f Compare June 25, 2025 18:59
MarcelBochtler
MarcelBochtler previously approved these changes Jun 25, 2025
Copy link
Member

@MarcelBochtler MarcelBochtler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test. This really helps to understand the necessity of this change.

The change looks good to me. Any other opinions @oss-review-toolkit/kotlin-devs ?

@maennchen
Copy link
Contributor Author

The code can be seen in action here: https://github.com/elixir-lang/elixir/actions/runs/16086802339/job/45398947306
(See the Action Artifacts for Details)

@maennchen maennchen marked this pull request as draft July 10, 2025 09:02
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
Prep work for PR oss-review-toolkit#10502.

* Add ScanResult.padNoneLicenseFindings()
  Inserts one synthetic
  LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path that lacks
  a license finding.

* Add ScannerRun.padNoneLicenseFindings()
  Calls the above helper for each contained ScanResult, using the run’s
  FileList data to obtain the full set of paths.

* Replace the in-place code in Scanner with a single call to
  ScannerRun.padNoneLicenseFindings() when
  includeFilesWithoutFindings is true.

* Introduce ScannerRunTest.padNoneLicenseFindings()
  Verifies that a path without findings receives exactly one NONE entry.

This refactor keeps behaviour unchanged while isolating the padding logic in
the domain model, improving reuse and testability and keeping later PRs
focused.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* Add ScanResult.padNoneLicenseFindings(paths)
  Creates a copy whose summary contains a single
  LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in
  *paths* that currently lacks a licence finding.

* Add ScannerRun.padNoneLicenseFindings()
  Uses the run’s FileList map to obtain all paths per provenance and
  delegates to the ScanResult helper for every contained result.

* Remove the ad-hoc padding code from Scanner.kt and replace it with a
  single call to scannerRun.padNoneLicenseFindings() when
  includeFilesWithoutFindings is enabled.

* Introduce ScannerRunTest.padNoneLicenseFindings()
  Ensures a file without licence findings receives exactly one NONE entry.

This refactor keeps behaviour unchanged while centralising the padding
logic in the model layer, improving reuse and testability.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* Add ScanResult.padNoneLicenseFindings(paths)
  Creates a copy whose summary contains a single
  LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in
  *paths* that currently lacks a licence finding.

* Add ScannerRun.padNoneLicenseFindings()
  Uses the run’s FileList map to obtain all paths per provenance and
  delegates to the ScanResult helper for every contained result.

* Remove the ad-hoc padding code from Scanner.kt and replace it with a
  single call to scannerRun.padNoneLicenseFindings() when
  includeFilesWithoutFindings is enabled.

* Introduce ScannerRunTest.padNoneLicenseFindings()
  Ensures a file without licence findings receives exactly one NONE entry.

This refactor keeps behaviour unchanged while centralising the padding
logic in the model layer, improving reuse and testability.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
…ovenance

Prep work for upcoming PR oss-review-toolkit#10502.

* `ScannerRun.init`
  - Add a `require` check that no two `ScanResult`s share the same
    provenance.
  - Add a similar check that no two `FileList`s share the same provenance.
  The offending provenance is rendered via `toYaml()` to aid debugging.

* `ScannerRunTest`
  - Introduce two unit tests that assert an `IllegalArgumentException` is
    thrown when duplicates are present for scan results or file lists,
    respectively.

These runtime checks make improper scan runs surface early and provide clear
error messages, which will simplify the upcoming deduplication work.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
…ovenance

Prep work for upcoming PR oss-review-toolkit#10502.

* `ScannerRun.init`
  - Add a `require` check that no two `ScanResult`s share the same
    provenance and scanner.
  - Add a similar check that no two `FileList`s share the same provenance.
  The offending provenance is rendered via `toYaml()` to aid debugging.

* `ScannerRunTest`
  - Introduce two unit tests that assert an `IllegalArgumentException` is
    thrown when duplicates are present for scan results or file lists,
    respectively.

These runtime checks make improper scan runs surface early and provide clear
error messages, which will simplify the upcoming deduplication work.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* Add ScanResult.padNoneLicenseFindings(paths)
  Creates a copy whose summary contains a single
  LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in
  *paths* that currently lacks a licence finding.

* Add ScannerRun.padNoneLicenseFindings()
  Uses the run’s FileList map to obtain all paths per provenance and
  delegates to the ScanResult helper for every contained result.

* Remove the ad-hoc padding code from Scanner.kt and replace it with a
  single call to scannerRun.padNoneLicenseFindings() when
  includeFilesWithoutFindings is enabled.

* Introduce ScannerRunTest.padNoneLicenseFindings()
  Ensures a file without licence findings receives exactly one NONE entry.

This refactor keeps behaviour unchanged while centralising the padding
logic in the model layer, improving reuse and testability.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* Add ScanResult.padNoneLicenseFindings(paths)
  Creates a copy whose summary contains a single
  LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in
  *paths* that currently lacks a licence finding.

* Add ScannerRun.padNoneLicenseFindings()
  Uses the run’s FileList map to obtain all paths per provenance and
  delegates to the ScanResult helper for every contained result.

* Remove the ad-hoc padding code from Scanner.kt and replace it with a
  single call to scannerRun.padNoneLicenseFindings() when
  includeFilesWithoutFindings is enabled.

* Introduce ScannerRunTest.padNoneLicenseFindings()
  Ensures a file without licence findings receives exactly one NONE entry.

This refactor keeps behaviour unchanged while centralising the padding
logic in the model layer, improving reuse and testability.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* `ScannerRun.init`
  - Add a `require` check that no two `ScanResult`s share the same
    provenance and scanner.
  - Add a similar check that no two `FileList`s share the same provenance.
  The offending provenance is rendered via `toYaml()` to aid debugging.

* `ScannerRunTest`
  - Introduce two unit tests that assert an `IllegalArgumentException` is
    thrown when duplicates are present for scan results or file lists,
    respectively.

These runtime checks make improper scan runs surface early and provide clear
error messages, which will simplify the upcoming deduplication work.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 10, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now
  implement `operator fun plus(...)`, enabling declarative, type-safe
  merging of scan data.
    * Each operator validates invariants (e.g. identical provenance,
      identical scanner / environment / config) and throws an
      `IllegalArgumentException` on mismatch.
    * For compatible objects, collections are union-merged; time ranges
      are widened (`startTime = min`, `endTime = max`).

* `ScannerRun.plus()`
    * Merges file lists and scan results by `provenance to scanner`
      using the new operators.
    * Combines issues / scanners maps by key union.
    * Carries forward earliest start and latest end timestamps.

* **Unit tests**
    * Extensive coverage in `ScannerRunTest` for all merge scenarios,
      including negative cases (different config / environment) and
      positive cases for times, provenances, issues, scanners, file
      lists and scan results.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
@maennchen maennchen force-pushed the jm/scan_deduplication branch from ed52821 to 56c12eb Compare July 10, 2025 14:51
@maennchen maennchen requested review from sschuberth and fviernau July 10, 2025 18:21
maennchen added a commit to maennchen/ort that referenced this pull request Jul 11, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* Add ScanResult.padNoneLicenseFindings(paths)
  Creates a copy whose summary contains a single
  LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in
  *paths* that currently lacks a licence finding.

* Add ScannerRun.padNoneLicenseFindings()
  Uses the run’s FileList map to obtain all paths per provenance and
  delegates to the ScanResult helper for every contained result.

* Remove the ad-hoc padding code from Scanner.kt and replace it with a
  single call to scannerRun.padNoneLicenseFindings() when
  includeFilesWithoutFindings is enabled.

* Introduce ScannerRunTest.padNoneLicenseFindings()
  Ensures a file without licence findings receives exactly one NONE entry.

This refactor keeps behaviour unchanged while centralising the padding
logic in the model layer, improving reuse and testability.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 11, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 11, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 11, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now
  implement `operator fun plus(...)`, enabling declarative, type-safe
  merging of scan data.
    * Each operator validates invariants (e.g. identical provenance,
      identical scanner / environment / config) and throws an
      `IllegalArgumentException` on mismatch.
    * For compatible objects, collections are union-merged; time ranges
      are widened (`startTime = min`, `endTime = max`).

* `ScannerRun.plus()`
    * Merges file lists and scan results by `provenance to scanner`
      using the new operators.
    * Combines issues / scanners maps by key union.
    * Carries forward earliest start and latest end timestamps.

* **Unit tests**
    * Extensive coverage in `ScannerRunTest` for all merge scenarios,
      including negative cases (different config / environment) and
      positive cases for times, provenances, issues, scanners, file
      lists and scan results.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
@maennchen maennchen force-pushed the jm/scan_deduplication branch from 56c12eb to 671db0d Compare July 11, 2025 11:28
maennchen added a commit to maennchen/ort that referenced this pull request Jul 11, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now
  implement `operator fun plus(...)`, enabling declarative, type-safe
  merging of scan data.
    * Each operator validates invariants (e.g. identical provenance,
      identical scanner / environment / config) and throws an
      `IllegalArgumentException` on mismatch.
    * For compatible objects, collections are union-merged; time ranges
      are widened (`startTime = min`, `endTime = max`).

* `ScannerRun.plus()`
    * Merges file lists and scan results by `provenance to scanner`
      using the new operators.
    * Combines issues / scanners maps by key union.
    * Carries forward earliest start and latest end timestamps.

* **Unit tests**
    * Extensive coverage in `ScannerRunTest` for all merge scenarios,
      including negative cases (different config / environment) and
      positive cases for times, provenances, issues, scanners, file
      lists and scan results.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
fviernau pushed a commit that referenced this pull request Jul 11, 2025
Prep work for upcoming PR #10502.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 11, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now
  implement `operator fun plus(...)`, enabling declarative, type-safe
  merging of scan data.
    * Each operator validates invariants (e.g. identical provenance,
      identical scanner / environment / config) and throws an
      `IllegalArgumentException` on mismatch.
    * For compatible objects, collections are union-merged; time ranges
      are widened (`startTime = min`, `endTime = max`).

* `ScannerRun.plus()`
    * Merges file lists and scan results by `provenance to scanner`
      using the new operators.
    * Combines issues / scanners maps by key union.
    * Carries forward earliest start and latest end timestamps.

* **Unit tests**
    * Extensive coverage in `ScannerRunTest` for all merge scenarios,
      including negative cases (different config / environment) and
      positive cases for times, provenances, issues, scanners, file
      lists and scan results.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
@maennchen maennchen force-pushed the jm/scan_deduplication branch from 671db0d to 6a583fb Compare July 11, 2025 13:14
maennchen added a commit to maennchen/ort that referenced this pull request Jul 29, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now
  implement `operator fun plus(...)`, enabling declarative, type-safe
  merging of scan data.
    * Each operator validates invariants (e.g. identical provenance,
      identical scanner / environment / config) and throws an
      `IllegalArgumentException` on mismatch.
    * For compatible objects, collections are union-merged; time ranges
      are widened (`startTime = min`, `endTime = max`).

* `ScannerRun.plus()`
    * Merges file lists and scan results by `provenance to scanner`
      using the new operators.
    * Combines issues / scanners maps by key union.
    * Carries forward earliest start and latest end timestamps.

* **Unit tests**
    * Extensive coverage in `ScannerRunTest` for all merge scenarios,
      including negative cases (different config / environment) and
      positive cases for times, provenances, issues, scanners, file
      lists and scan results.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 29, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now
  implement `operator fun plus(...)`, enabling declarative, type-safe
  merging of scan data.
    * Each operator validates invariants (e.g. identical provenance,
      identical scanner / environment / config) and throws an
      `IllegalArgumentException` on mismatch.
    * For compatible objects, collections are union-merged; time ranges
      are widened (`startTime = min`, `endTime = max`).

* `ScannerRun.plus()`
    * Merges file lists and scan results by `provenance to scanner`
      using the new operators.
    * Combines issues / scanners maps by key union.
    * Carries forward earliest start and latest end timestamps.

* **Unit tests**
    * Extensive coverage in `ScannerRunTest` for all merge scenarios,
      including negative cases (different config / environment) and
      positive cases for times, provenances, issues, scanners, file
      lists and scan results.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
maennchen added a commit to maennchen/ort that referenced this pull request Jul 29, 2025
Prep work for upcoming PR oss-review-toolkit#10502.

* `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now
  implement `operator fun plus(...)`, enabling declarative, type-safe
  merging of scan data.
    * Each operator validates invariants (e.g. identical provenance,
      identical scanner / environment / config) and throws an
      `IllegalArgumentException` on mismatch.
    * For compatible objects, collections are union-merged; time ranges
      are widened (`startTime = min`, `endTime = max`).

* `ScannerRun.plus()`
    * Merges file lists and scan results by `provenance to scanner`
      using the new operators.
    * Combines issues / scanners maps by key union.
    * Carries forward earliest start and latest end timestamps.

* **Unit tests**
    * Extensive coverage in `ScannerRunTest` for all merge scenarios,
      including negative cases (different config / environment) and
      positive cases for times, provenances, issues, scanners, file
      lists and scan results.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
@maennchen maennchen force-pushed the jm/scan_deduplication branch from 6a583fb to f6a6d27 Compare July 29, 2025 11:56
sschuberth pushed a commit that referenced this pull request Jul 29, 2025
Prep work for upcoming PR #10502.

* `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now
  implement `operator fun plus(...)`, enabling declarative, type-safe
  merging of scan data.
    * Each operator validates invariants (e.g. identical provenance,
      identical scanner / environment / config) and throws an
      `IllegalArgumentException` on mismatch.
    * For compatible objects, collections are union-merged; time ranges
      are widened (`startTime = min`, `endTime = max`).

* `ScannerRun.plus()`
    * Merges file lists and scan results by `provenance to scanner`
      using the new operators.
    * Combines issues / scanners maps by key union.
    * Carries forward earliest start and latest end timestamps.

* **Unit tests**
    * Extensive coverage in `ScannerRunTest` for all merge scenarios,
      including negative cases (different config / environment) and
      positive cases for times, provenances, issues, scanners, file
      lists and scan results.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
@maennchen maennchen force-pushed the jm/scan_deduplication branch from f6a6d27 to d780617 Compare July 29, 2025 12:11
When the SpdxDocumentFile package manager is used, the *project* and all
contained *packages* often resolve to the **same VCS provenance** (e.g. the
root of the Git repository).
Before this change ORT stored two separate `ScanResult`s for such a
provenance – one keyed to the project, one keyed to the package.

That caused two follow-on problems:

* Both results appeared in the `OrtResult`, so evaluators saw **duplicate
  findings** for the *same* source tree.
* Because projects and packages are handled by different rules the package
  result was additionally **padded with a `SpdxConstants.NONE` finding**
  whenever `includeFilesWithoutFindings` was enabled.
  The evaluator therefore compared *real* license findings from the project
  result with `NONE` from the package result and failed with a violation.

This patch

* merges `ScannerRun`, and
* performs the "pad with NONE" step only **after** deduplication, so every
  path is represented exactly once.

As a consequence the evaluator now receives one consistent set of license
findings per provenance / scanner, eliminating the false mismatch.

Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
@maennchen maennchen force-pushed the jm/scan_deduplication branch from d780617 to d536a3e Compare July 29, 2025 14:31
@sschuberth sschuberth dismissed fviernau’s stale review July 29, 2025 14:47

Comments addressed.

@sschuberth sschuberth enabled auto-merge (rebase) July 29, 2025 14:47
@sschuberth sschuberth merged commit 927e47c into oss-review-toolkit:main Jul 29, 2025
25 checks passed
@maennchen maennchen deleted the jm/scan_deduplication branch July 29, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants