Fix inconsistent list parsing for multiline financial rows#484
Conversation
Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughListProcessor now pre-splits eligible multi-line ChangesMulti-Line List Node Expansion
Sequence Diagram(s)(Skipped) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 442-456: buildListFromMultiLineTextNode creates PDFList/ListItem
objects but never runs the numbering/interval pipeline, so you must populate the
same structural metadata calculateList produces: after building the PDFList,
obtain or create a TextListInterval for that generated list (reuse
ListLabelsUtils.getListItemsIntervals on the original lines or derive a
TextListInterval from the expansion result), then call
PDFList.setNumberingStyle(...) and PDFList.setCommonPrefix(...) with values from
that interval and set each ListItem.setLabelLength(...) using the
ListItemInfo/createListItemTextInfo-style label extraction on the first TextLine
of each item; finally run/processTextNodeListItemContent (so
DocumentProcessor.setIDs is invoked) on the substituted contents to ensure IDs
and other downstream metadata are assigned.
- Around line 423-429: buildListFromMultiLineTextNode currently collects
non-blank lines from all columns causing incorrect visual order and list
grouping; change it to mirror calculateList by iterating only the first column:
use textNode.getFirstColumn() (or getFirstColumn().getLines()) instead of
textNode.getColumns(), collect non-blank TextLine via getValue().isBlank(), and
preserve existing nonSpaceLines usage; add a null/empty check for
getFirstColumn() to avoid NPEs.
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 70-110: Add a negative-unit test ensuring a SemanticTextNode with
only one labeled line is not converted to a PDFList: create a test (e.g.,
testProcessListsFromSingleLabeledLineIsNotExpanded) that sets StaticContainers
as in existing tests, builds a SemanticParagraph with one labeled TextLine and
an indented continuation TextLine, calls
ListProcessor.processListsFromTextNodes, and asserts no PDFList is produced
(e.g., assertFalse(contents.stream().anyMatch(o -> o instanceof PDFList))); this
guards the labeledLineCount < 2 check in buildListFromMultiLineTextNode and
references ListProcessor.processListsFromTextNodes and
buildListFromMultiLineTextNode to locate the logic to protect.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e8f15e69-e568-4055-b2d9-fd17335c9d35
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java
Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 407-424: expandMultiLineListTextNodes is creating irreversible
SemanticParagraph fragments because splitMultiLineListTextNode can produce
fragments even for cases that will later be rejected (e.g., numeric "doubles"
labels), causing orphaned paragraphs to escape list-detection; fix by adding a
pre-check inside splitMultiLineListTextNode that detects the known failure case
(labels matching ^\d+\.\d+$ or other conditions that make isCorrectList return
false) and immediately return null before allocating fragments so
expandMultiLineListTextNodes leaves the original SemanticTextNode intact;
additionally consider the broader fix: in processListsFromTextNodes track index
ranges produced by each split and after removeNullObjectsFromList restore the
original node for any range that yielded no PDFList, updating the mapping logic
accordingly.
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 112-127: The test
testProcessListsFromSingleLabeledLineIsNotExpanded should not only assert no
PDFList was created but also verify the original node is preserved and
unchanged: after calling ListProcessor.processListsFromTextNodes(contents)
assert the returned list size equals the original size, assert the first element
is still a SemanticParagraph (not null) and assert it still contains the
expected TextLine children (e.g., two TextLine instances or that its child count
is unchanged); use the existing symbols ListProcessor.processListsFromTextNodes,
SemanticParagraph, TextLine and PDFList to locate and implement these additional
assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d7184af7-1332-4257-bff1-0540701358d9
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java
Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 472-489: The restoreExpandedTextNodes method can restore
candidates that were already consumed as non-leading items because calculateList
writes the shared PDFList at the first matched index and nulls out the later
matched indices; update restoreExpandedTextNodes (and its loop over contents for
each ExpandedTextNodeCandidate) to treat any null entry inside the candidate
range as evidence the candidate was consumed and skip restoration (in addition
to the existing PDFList check). Use the existing symbols:
restoreExpandedTextNodes, ExpandedTextNodeCandidate, contents, PDFList,
getStartIndex/getEndIndex, and getOriginalNode to locate and implement the
guard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9724d720-07ec-4d21-bc90-b3ab4836d4eb
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java
Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 492-499: The pre-check isDoubles(List<TextLine> lines) currently
denies any list containing continuation/non-blank lines because it checks every
non-blank line; change it to only inspect labeled lines so it matches the
downstream isDoubles(TextListInterval) semantics: filter the input
List<TextLine> to only those lines that are labeled (e.g., lines where
TextLine.getLabel() is non-null/non-empty or an equivalent property on
TextLine), if there are no labeled lines return false, and then verify every
labeled line's value matches the decimal regex "^\\d+\\.\\d+$"; update the
isDoubles(List<TextLine>) implementation accordingly so expansion→detect→restore
cycles are avoided.
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 70-165: Add a unit test that covers the isDoubles pre-check in
splitMultiLineListTextNode by creating a SemanticParagraph with multiple
TextLine/TextChunk entries whose non-space lines all match the doubles pattern
(e.g., "1.5", "2.3"), call ListProcessor.processListsFromTextNodes, and assert
that no PDFList is produced and the original SemanticParagraph remains (size ==
1 and instanceOf SemanticParagraph). Ensure the test sets StaticContainers
(setIsIgnoreCharactersWithoutUnicode, setIsDataLoader, setAccumulatedNodeMapper)
like the other tests and name it e.g.
testProcessListsFromDoublesNodeIsNotExpanded so it guards the
isDoubles(nonSpaceLines) branch in splitMultiLineListTextNode.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8c55b0ec-455f-409d-a0dd-e9c92da3f14c
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java
Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 136-185: Add a regression test that exercises the
expand-then-restore path by creating a SemanticParagraph with adjacent text
lines that cause splitMultiLineListTextNode() to expand the paragraph but later
be rejected by restoreExpandedTextNodes() (e.g., a numeric list label like "1.
Revenue" followed by a different label style like "A. Other Income"); call
ListProcessor.processListsFromTextNodes() and assert that the original
SemanticParagraph is restored (no PDFList is produced) and contents size equals
the original, referencing splitMultiLineListTextNode, restoreExpandedTextNodes,
ListProcessor.processListsFromTextNodes, SemanticParagraph and PDFList to locate
the relevant code paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 23866b66-7573-41c4-a429-bb7952362f04
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java
Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Summary
Validation
mvn -f java/pom.xml -pl opendataloader-pdf-core -Dtest=ListProcessorTest test./scripts/bench.sh(currently fails before benchmark evaluation because the repo has pre-existing Javadoc errors from unescaped>inXYCutPlusPlusSorter.javaandTriageProcessor.java)Fixes #463
Summary by CodeRabbit
New Features
Tests