-
Notifications
You must be signed in to change notification settings - Fork 347
feat(scanner): Merge duplicate scan results that share a provenance #10502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(scanner): Merge duplicate scan results that share a provenance #10502
Conversation
038c082
to
40190e3
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #10502 +/- ##
============================================
+ Coverage 57.28% 57.30% +0.01%
- Complexity 1644 1648 +4
============================================
Files 341 341
Lines 12722 12722
Branches 1206 1206
============================================
+ Hits 7288 7290 +2
+ Misses 4971 4969 -2
Partials 463 463
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
40190e3
to
ed1df5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @maennchen,
thanks for your contribution!
Do you mind adding a test that demonstrates the issue with the duplicated scan results and that your changes fix these issues?
Edit: As there is currently no existing test for the scan function that use an ORT result, I won't insist on this change.
Edit 2: I Found these existing tests which already test the scan()
function:
ort/scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Lines 63 to 102 in c086b3b
"Scanning all packages corresponding to a single VCS" should { | |
val analyzerResult = createAnalyzerResult(pkg0, pkg1, pkg2, pkg3, pkg4) | |
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap()) | |
"return the expected ORT result" { | |
val expectedResult = readResource("/scanner-integration-all-pkgs-expected-ort-result.yml") | |
patchActualResult(ortResult.toYaml(), patchStartAndEndTime = true) shouldBe | |
patchExpectedResult(expectedResult) | |
} | |
"return the expected (merged) scan results" { | |
val expectedResult = readResource("/scanner-integration-expected-scan-results.yml") | |
val scanResults = ortResult.getScanResults().toSortedMap() | |
patchActualResult(scanResults.toYaml(), patchStartAndEndTime = true) shouldBe | |
patchExpectedResult(expectedResult) | |
} | |
"return the expected (merged) file lists" { | |
val expectedResult = readResource("/scanner-integration-expected-file-lists.yml") | |
val fileLists = ortResult.getFileLists().toSortedMap() | |
fileLists.toYaml() shouldBe patchExpectedResult(expectedResult) | |
} | |
} | |
"Scanning a subset of the packages corresponding to a single VCS" should { | |
"return the expected ORT result" { | |
val analyzerResult = createAnalyzerResult(pkg1, pkg3) | |
val expectedResult = readResource("/scanner-integration-subset-pkgs-expected-ort-result.yml") | |
val ortResult = createScanner().scan(analyzerResult, skipExcluded = false, emptyMap()) | |
patchActualResult(ortResult.toYaml(), patchStartAndEndTime = true) shouldBe | |
patchExpectedResult(expectedResult) | |
} | |
} |
Can you have a look at them and extend them to test your use-case?
@MarcelBochtler I added a test to test the deduplication. Running the same test on
$ ./gradlew scanner:funTest --tests "org.ossreviewtoolkit.scanner.scanners.ScannerIntegrationFunTest"
|
4432de1
to
263be95
Compare
263be95
to
276bc2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test. This really helps to understand the necessity of this change.
The change looks good to me. Any other opinions @oss-review-toolkit/kotlin-devs ?
scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Outdated
Show resolved
Hide resolved
scanner/src/funTest/resources/scanner-integration-all-pkgs-expected-ort-result.yml
Outdated
Show resolved
Hide resolved
scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Outdated
Show resolved
Hide resolved
276bc2f
to
dfb5f87
Compare
The code can be seen in action here: https://github.com/elixir-lang/elixir/actions/runs/16086802339/job/45398947306 |
scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Outdated
Show resolved
Hide resolved
scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Outdated
Show resolved
Hide resolved
Prep work for PR oss-review-toolkit#10502. * Add ScanResult.padNoneLicenseFindings() Inserts one synthetic LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path that lacks a license finding. * Add ScannerRun.padNoneLicenseFindings() Calls the above helper for each contained ScanResult, using the run’s FileList data to obtain the full set of paths. * Replace the in-place code in Scanner with a single call to ScannerRun.padNoneLicenseFindings() when includeFilesWithoutFindings is true. * Introduce ScannerRunTest.padNoneLicenseFindings() Verifies that a path without findings receives exactly one NONE entry. This refactor keeps behaviour unchanged while isolating the padding logic in the domain model, improving reuse and testability and keeping later PRs focused. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * Add ScanResult.padNoneLicenseFindings(paths) Creates a copy whose summary contains a single LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in *paths* that currently lacks a licence finding. * Add ScannerRun.padNoneLicenseFindings() Uses the run’s FileList map to obtain all paths per provenance and delegates to the ScanResult helper for every contained result. * Remove the ad-hoc padding code from Scanner.kt and replace it with a single call to scannerRun.padNoneLicenseFindings() when includeFilesWithoutFindings is enabled. * Introduce ScannerRunTest.padNoneLicenseFindings() Ensures a file without licence findings receives exactly one NONE entry. This refactor keeps behaviour unchanged while centralising the padding logic in the model layer, improving reuse and testability. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * Add ScanResult.padNoneLicenseFindings(paths) Creates a copy whose summary contains a single LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in *paths* that currently lacks a licence finding. * Add ScannerRun.padNoneLicenseFindings() Uses the run’s FileList map to obtain all paths per provenance and delegates to the ScanResult helper for every contained result. * Remove the ad-hoc padding code from Scanner.kt and replace it with a single call to scannerRun.padNoneLicenseFindings() when includeFilesWithoutFindings is enabled. * Introduce ScannerRunTest.padNoneLicenseFindings() Ensures a file without licence findings receives exactly one NONE entry. This refactor keeps behaviour unchanged while centralising the padding logic in the model layer, improving reuse and testability. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
…ovenance Prep work for upcoming PR oss-review-toolkit#10502. * `ScannerRun.init` - Add a `require` check that no two `ScanResult`s share the same provenance. - Add a similar check that no two `FileList`s share the same provenance. The offending provenance is rendered via `toYaml()` to aid debugging. * `ScannerRunTest` - Introduce two unit tests that assert an `IllegalArgumentException` is thrown when duplicates are present for scan results or file lists, respectively. These runtime checks make improper scan runs surface early and provide clear error messages, which will simplify the upcoming deduplication work. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
…ovenance Prep work for upcoming PR oss-review-toolkit#10502. * `ScannerRun.init` - Add a `require` check that no two `ScanResult`s share the same provenance and scanner. - Add a similar check that no two `FileList`s share the same provenance. The offending provenance is rendered via `toYaml()` to aid debugging. * `ScannerRunTest` - Introduce two unit tests that assert an `IllegalArgumentException` is thrown when duplicates are present for scan results or file lists, respectively. These runtime checks make improper scan runs surface early and provide clear error messages, which will simplify the upcoming deduplication work. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * Add ScanResult.padNoneLicenseFindings(paths) Creates a copy whose summary contains a single LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in *paths* that currently lacks a licence finding. * Add ScannerRun.padNoneLicenseFindings() Uses the run’s FileList map to obtain all paths per provenance and delegates to the ScanResult helper for every contained result. * Remove the ad-hoc padding code from Scanner.kt and replace it with a single call to scannerRun.padNoneLicenseFindings() when includeFilesWithoutFindings is enabled. * Introduce ScannerRunTest.padNoneLicenseFindings() Ensures a file without licence findings receives exactly one NONE entry. This refactor keeps behaviour unchanged while centralising the padding logic in the model layer, improving reuse and testability. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * Add ScanResult.padNoneLicenseFindings(paths) Creates a copy whose summary contains a single LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in *paths* that currently lacks a licence finding. * Add ScannerRun.padNoneLicenseFindings() Uses the run’s FileList map to obtain all paths per provenance and delegates to the ScanResult helper for every contained result. * Remove the ad-hoc padding code from Scanner.kt and replace it with a single call to scannerRun.padNoneLicenseFindings() when includeFilesWithoutFindings is enabled. * Introduce ScannerRunTest.padNoneLicenseFindings() Ensures a file without licence findings receives exactly one NONE entry. This refactor keeps behaviour unchanged while centralising the padding logic in the model layer, improving reuse and testability. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * `ScannerRun.init` - Add a `require` check that no two `ScanResult`s share the same provenance and scanner. - Add a similar check that no two `FileList`s share the same provenance. The offending provenance is rendered via `toYaml()` to aid debugging. * `ScannerRunTest` - Introduce two unit tests that assert an `IllegalArgumentException` is thrown when duplicates are present for scan results or file lists, respectively. These runtime checks make improper scan runs surface early and provide clear error messages, which will simplify the upcoming deduplication work. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now implement `operator fun plus(...)`, enabling declarative, type-safe merging of scan data. * Each operator validates invariants (e.g. identical provenance, identical scanner / environment / config) and throws an `IllegalArgumentException` on mismatch. * For compatible objects, collections are union-merged; time ranges are widened (`startTime = min`, `endTime = max`). * `ScannerRun.plus()` * Merges file lists and scan results by `provenance to scanner` using the new operators. * Combines issues / scanners maps by key union. * Carries forward earliest start and latest end timestamps. * **Unit tests** * Extensive coverage in `ScannerRunTest` for all merge scenarios, including negative cases (different config / environment) and positive cases for times, provenances, issues, scanners, file lists and scan results. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
ed52821
to
56c12eb
Compare
Prep work for upcoming PR oss-review-toolkit#10502. * Add ScanResult.padNoneLicenseFindings(paths) Creates a copy whose summary contains a single LicenseFinding(SpdxConstants.NONE, UNKNOWN_LINE) for every path in *paths* that currently lacks a licence finding. * Add ScannerRun.padNoneLicenseFindings() Uses the run’s FileList map to obtain all paths per provenance and delegates to the ScanResult helper for every contained result. * Remove the ad-hoc padding code from Scanner.kt and replace it with a single call to scannerRun.padNoneLicenseFindings() when includeFilesWithoutFindings is enabled. * Introduce ScannerRunTest.padNoneLicenseFindings() Ensures a file without licence findings receives exactly one NONE entry. This refactor keeps behaviour unchanged while centralising the padding logic in the model layer, improving reuse and testability. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now implement `operator fun plus(...)`, enabling declarative, type-safe merging of scan data. * Each operator validates invariants (e.g. identical provenance, identical scanner / environment / config) and throws an `IllegalArgumentException` on mismatch. * For compatible objects, collections are union-merged; time ranges are widened (`startTime = min`, `endTime = max`). * `ScannerRun.plus()` * Merges file lists and scan results by `provenance to scanner` using the new operators. * Combines issues / scanners maps by key union. * Carries forward earliest start and latest end timestamps. * **Unit tests** * Extensive coverage in `ScannerRunTest` for all merge scenarios, including negative cases (different config / environment) and positive cases for times, provenances, issues, scanners, file lists and scan results. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
56c12eb
to
671db0d
Compare
Prep work for upcoming PR oss-review-toolkit#10502. * `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now implement `operator fun plus(...)`, enabling declarative, type-safe merging of scan data. * Each operator validates invariants (e.g. identical provenance, identical scanner / environment / config) and throws an `IllegalArgumentException` on mismatch. * For compatible objects, collections are union-merged; time ranges are widened (`startTime = min`, `endTime = max`). * `ScannerRun.plus()` * Merges file lists and scan results by `provenance to scanner` using the new operators. * Combines issues / scanners maps by key union. * Carries forward earliest start and latest end timestamps. * **Unit tests** * Extensive coverage in `ScannerRunTest` for all merge scenarios, including negative cases (different config / environment) and positive cases for times, provenances, issues, scanners, file lists and scan results. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR #10502. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now implement `operator fun plus(...)`, enabling declarative, type-safe merging of scan data. * Each operator validates invariants (e.g. identical provenance, identical scanner / environment / config) and throws an `IllegalArgumentException` on mismatch. * For compatible objects, collections are union-merged; time ranges are widened (`startTime = min`, `endTime = max`). * `ScannerRun.plus()` * Merges file lists and scan results by `provenance to scanner` using the new operators. * Combines issues / scanners maps by key union. * Carries forward earliest start and latest end timestamps. * **Unit tests** * Extensive coverage in `ScannerRunTest` for all merge scenarios, including negative cases (different config / environment) and positive cases for times, provenances, issues, scanners, file lists and scan results. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
671db0d
to
6a583fb
Compare
Prep work for upcoming PR oss-review-toolkit#10502. * `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now implement `operator fun plus(...)`, enabling declarative, type-safe merging of scan data. * Each operator validates invariants (e.g. identical provenance, identical scanner / environment / config) and throws an `IllegalArgumentException` on mismatch. * For compatible objects, collections are union-merged; time ranges are widened (`startTime = min`, `endTime = max`). * `ScannerRun.plus()` * Merges file lists and scan results by `provenance to scanner` using the new operators. * Combines issues / scanners maps by key union. * Carries forward earliest start and latest end timestamps. * **Unit tests** * Extensive coverage in `ScannerRunTest` for all merge scenarios, including negative cases (different config / environment) and positive cases for times, provenances, issues, scanners, file lists and scan results. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now implement `operator fun plus(...)`, enabling declarative, type-safe merging of scan data. * Each operator validates invariants (e.g. identical provenance, identical scanner / environment / config) and throws an `IllegalArgumentException` on mismatch. * For compatible objects, collections are union-merged; time ranges are widened (`startTime = min`, `endTime = max`). * `ScannerRun.plus()` * Merges file lists and scan results by `provenance to scanner` using the new operators. * Combines issues / scanners maps by key union. * Carries forward earliest start and latest end timestamps. * **Unit tests** * Extensive coverage in `ScannerRunTest` for all merge scenarios, including negative cases (different config / environment) and positive cases for times, provenances, issues, scanners, file lists and scan results. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
Prep work for upcoming PR oss-review-toolkit#10502. * `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now implement `operator fun plus(...)`, enabling declarative, type-safe merging of scan data. * Each operator validates invariants (e.g. identical provenance, identical scanner / environment / config) and throws an `IllegalArgumentException` on mismatch. * For compatible objects, collections are union-merged; time ranges are widened (`startTime = min`, `endTime = max`). * `ScannerRun.plus()` * Merges file lists and scan results by `provenance to scanner` using the new operators. * Combines issues / scanners maps by key union. * Carries forward earliest start and latest end timestamps. * **Unit tests** * Extensive coverage in `ScannerRunTest` for all merge scenarios, including negative cases (different config / environment) and positive cases for times, provenances, issues, scanners, file lists and scan results. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
6a583fb
to
f6a6d27
Compare
Prep work for upcoming PR #10502. * `FileList`, `ScanSummary`, `ScanResult` and `ScannerRun` now implement `operator fun plus(...)`, enabling declarative, type-safe merging of scan data. * Each operator validates invariants (e.g. identical provenance, identical scanner / environment / config) and throws an `IllegalArgumentException` on mismatch. * For compatible objects, collections are union-merged; time ranges are widened (`startTime = min`, `endTime = max`). * `ScannerRun.plus()` * Merges file lists and scan results by `provenance to scanner` using the new operators. * Combines issues / scanners maps by key union. * Carries forward earliest start and latest end timestamps. * **Unit tests** * Extensive coverage in `ScannerRunTest` for all merge scenarios, including negative cases (different config / environment) and positive cases for times, provenances, issues, scanners, file lists and scan results. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
f6a6d27
to
d780617
Compare
scanner/src/funTest/kotlin/scanners/ScannerIntegrationFunTest.kt
Outdated
Show resolved
Hide resolved
When the SpdxDocumentFile package manager is used, the *project* and all contained *packages* often resolve to the **same VCS provenance** (e.g. the root of the Git repository). Before this change ORT stored two separate `ScanResult`s for such a provenance – one keyed to the project, one keyed to the package. That caused two follow-on problems: * Both results appeared in the `OrtResult`, so evaluators saw **duplicate findings** for the *same* source tree. * Because projects and packages are handled by different rules the package result was additionally **padded with a `SpdxConstants.NONE` finding** whenever `includeFilesWithoutFindings` was enabled. The evaluator therefore compared *real* license findings from the project result with `NONE` from the package result and failed with a violation. This patch * merges `ScannerRun`, and * performs the "pad with NONE" step only **after** deduplication, so every path is represented exactly once. As a consequence the evaluator now receives one consistent set of license findings per provenance / scanner, eliminating the false mismatch. Signed-off-by: Jonatan Männchen <jonatan@maennchen.ch>
d780617
to
d536a3e
Compare
TODO
+
merge operators for scan-domain objects #10585 firstWhen the SpdxDocumentFile package manager is used, the project and all contained packages often resolve to the same VCS provenance (e.g. the root of the Git repository).
Before this change ORT stored two separate
ScanResult
s for such a provenance – one keyed to the project, one keyed to the package.That caused two follow-on problems:
OrtResult
, so evaluators saw duplicate findings for the same source tree.SpdxConstants.NONE
finding wheneverincludeFilesWithoutFindings
was enabled. The evaluator therefore compared real license findings from the project result withNONE
from the package result and failed with a violation.This patch
ScannerRun
, andAs a consequence the evaluator now receives one consistent set of license findings per provenance / scanner, eliminating the false mismatch.
This is the first time for me writing Kotlin. Sorry if the code is not up to the usual standards.