Skip to content

Fix inconsistent list parsing for multiline financial rows#484

Open
pandego wants to merge 6 commits into
opendataloader-project:mainfrom
pandego:fix/463-financial-list-parsing
Open

Fix inconsistent list parsing for multiline financial rows#484
pandego wants to merge 6 commits into
opendataloader-project:mainfrom
pandego:fix/463-financial-list-parsing

Conversation

@pandego
Copy link
Copy Markdown

@pandego pandego commented May 4, 2026

Summary

  • detect lists when multiple labeled financial rows were merged into a single semantic text node
  • preserve continuation lines under the current list item when expanding those multiline nodes
  • add focused regression tests for single-node multiline list parsing

Validation

  • mvn -f java/pom.xml -pl opendataloader-pdf-core -Dtest=ListProcessorTest test
  • ./scripts/bench.sh (currently fails before benchmark evaluation because the repo has pre-existing Javadoc errors from unescaped > in XYCutPlusPlusSorter.java and TriageProcessor.java)

Fixes #463

Summary by CodeRabbit

  • New Features

    • Improved PDF list extraction: multi-line paragraphs with labeled list-like lines are split into distinct list items when appropriate; continuation lines are preserved inside their list items; single labeled lines are no longer misdetected as lists; adjacent expanded lists correctly merge and consumed source content is not restored.
  • Tests

    • Added tests covering multi-line expansion, preservation of continuation lines, prevention of false expansions (including decimal-like labels), and correct merging behavior.

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

ListProcessor now pre-splits eligible multi-line SemanticTextNodes into SemanticParagraph segments when they resemble labeled multi-line lists, runs existing list-detection on the expanded content, then restores the original SemanticTextNode only for expanded ranges that were not consumed during list construction. Five unit tests were added.

Changes

Multi-Line List Node Expansion

Layer / File(s) Summary
Preprocess callsite
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
processListsFromTextNodes(...) now calls expandMultiLineListTextNodes(...) before list detection and restoreExpandedTextNodes(...) after list construction; returns DocumentProcessor.removeNullObjectsFromList(contents).
Expansion iterator
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
Added expandMultiLineListTextNodes() that walks contents, replaces qualifying SemanticTextNodes with segmented SemanticParagraphs, and records expanded index ranges as candidates.
Splitter / segmentation
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
Added splitMultiLineListTextNode() which: collects first-column non-blank lines, requires ≥2 non-blank lines and ≥2 labeled lines, rejects expansion when labeled lines are all “doubles” (^\d+\.\d+$), and produces SemanticParagraph segments starting at labeled lines, grouping following unlabeled continuation lines.
Postprocess / restore
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
Added restoreExpandedTextNodes() plus helper types (ExpandedTextNodesResult, ExpandedTextNodeCandidate) to revert an expanded range back to its original SemanticTextNode only if none of the candidate indices were consumed (i.e., replaced by null or a PDFList).
Helper predicate
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
Added isDoubles(List<TextLine>) to detect and reject expansion when all labeled lines match ^\d+\.\d+$.
Existing list logic (unchanged flow)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
The existing list-interval detection and list construction run unchanged against the (possibly) expanded contents.
Tests
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java
Added five tests: testProcessListsFromSingleMultilineTextNode, testProcessListsFromSingleMultilineTextNodeKeepsContinuationLines, testProcessListsFromSingleLabeledLineIsNotExpanded, testProcessListsFromAdjacentExpandedNodesDoesNotRestoreConsumedNode, and testProcessListsFromDoublesNodeIsNotExpanded.

Sequence Diagram(s)

(Skipped)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • opendataloader-project/opendataloader-pdf#369: Also modifies ListProcessor.java and affects list-detection behavior (this PR adds multiline expansion/restoration while #369 adjusts backward-scan limits in list item processing).

Suggested reviewers

  • MaximPlusov
  • LonelyMidoriya
  • hnc-jglee
  • hyunhee-jo
  • bundolee
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive The PR partially addresses issue #463 by handling multiline list expansion, but does not demonstrate all required objectives including consistent cross-PDF detection, hierarchical nesting preservation, or layout heuristic improvements. Clarify how the multiline expansion ensures consistent detection across similar PDFs, verify hierarchical list reconstruction works end-to-end, and confirm regression tests cover both provided PDF examples.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main code change: expanding multiline financial rows into separate list items for consistent parsing.
Out of Scope Changes check ✅ Passed All changes focus on the list processing logic and its tests, directly supporting the multiline list expansion objective without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 4, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 442-456: buildListFromMultiLineTextNode creates PDFList/ListItem
objects but never runs the numbering/interval pipeline, so you must populate the
same structural metadata calculateList produces: after building the PDFList,
obtain or create a TextListInterval for that generated list (reuse
ListLabelsUtils.getListItemsIntervals on the original lines or derive a
TextListInterval from the expansion result), then call
PDFList.setNumberingStyle(...) and PDFList.setCommonPrefix(...) with values from
that interval and set each ListItem.setLabelLength(...) using the
ListItemInfo/createListItemTextInfo-style label extraction on the first TextLine
of each item; finally run/processTextNodeListItemContent (so
DocumentProcessor.setIDs is invoked) on the substituted contents to ensure IDs
and other downstream metadata are assigned.
- Around line 423-429: buildListFromMultiLineTextNode currently collects
non-blank lines from all columns causing incorrect visual order and list
grouping; change it to mirror calculateList by iterating only the first column:
use textNode.getFirstColumn() (or getFirstColumn().getLines()) instead of
textNode.getColumns(), collect non-blank TextLine via getValue().isBlank(), and
preserve existing nonSpaceLines usage; add a null/empty check for
getFirstColumn() to avoid NPEs.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 70-110: Add a negative-unit test ensuring a SemanticTextNode with
only one labeled line is not converted to a PDFList: create a test (e.g.,
testProcessListsFromSingleLabeledLineIsNotExpanded) that sets StaticContainers
as in existing tests, builds a SemanticParagraph with one labeled TextLine and
an indented continuation TextLine, calls
ListProcessor.processListsFromTextNodes, and asserts no PDFList is produced
(e.g., assertFalse(contents.stream().anyMatch(o -> o instanceof PDFList))); this
guards the labeledLineCount < 2 check in buildListFromMultiLineTextNode and
references ListProcessor.processListsFromTextNodes and
buildListFromMultiLineTextNode to locate the logic to protect.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e8f15e69-e568-4055-b2d9-fd17335c9d35

📥 Commits

Reviewing files that changed from the base of the PR and between a0c5e66 and b35e66e.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 407-424: expandMultiLineListTextNodes is creating irreversible
SemanticParagraph fragments because splitMultiLineListTextNode can produce
fragments even for cases that will later be rejected (e.g., numeric "doubles"
labels), causing orphaned paragraphs to escape list-detection; fix by adding a
pre-check inside splitMultiLineListTextNode that detects the known failure case
(labels matching ^\d+\.\d+$ or other conditions that make isCorrectList return
false) and immediately return null before allocating fragments so
expandMultiLineListTextNodes leaves the original SemanticTextNode intact;
additionally consider the broader fix: in processListsFromTextNodes track index
ranges produced by each split and after removeNullObjectsFromList restore the
original node for any range that yielded no PDFList, updating the mapping logic
accordingly.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 112-127: The test
testProcessListsFromSingleLabeledLineIsNotExpanded should not only assert no
PDFList was created but also verify the original node is preserved and
unchanged: after calling ListProcessor.processListsFromTextNodes(contents)
assert the returned list size equals the original size, assert the first element
is still a SemanticParagraph (not null) and assert it still contains the
expected TextLine children (e.g., two TextLine instances or that its child count
is unchanged); use the existing symbols ListProcessor.processListsFromTextNodes,
SemanticParagraph, TextLine and PDFList to locate and implement these additional
assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7184af7-1332-4257-bff1-0540701358d9

📥 Commits

Reviewing files that changed from the base of the PR and between b35e66e and 5429daf.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 472-489: The restoreExpandedTextNodes method can restore
candidates that were already consumed as non-leading items because calculateList
writes the shared PDFList at the first matched index and nulls out the later
matched indices; update restoreExpandedTextNodes (and its loop over contents for
each ExpandedTextNodeCandidate) to treat any null entry inside the candidate
range as evidence the candidate was consumed and skip restoration (in addition
to the existing PDFList check). Use the existing symbols:
restoreExpandedTextNodes, ExpandedTextNodeCandidate, contents, PDFList,
getStartIndex/getEndIndex, and getOriginalNode to locate and implement the
guard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9724d720-07ec-4d21-bc90-b3ab4836d4eb

📥 Commits

Reviewing files that changed from the base of the PR and between 5429daf and d99c170.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 492-499: The pre-check isDoubles(List<TextLine> lines) currently
denies any list containing continuation/non-blank lines because it checks every
non-blank line; change it to only inspect labeled lines so it matches the
downstream isDoubles(TextListInterval) semantics: filter the input
List<TextLine> to only those lines that are labeled (e.g., lines where
TextLine.getLabel() is non-null/non-empty or an equivalent property on
TextLine), if there are no labeled lines return false, and then verify every
labeled line's value matches the decimal regex "^\\d+\\.\\d+$"; update the
isDoubles(List<TextLine>) implementation accordingly so expansion→detect→restore
cycles are avoided.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 70-165: Add a unit test that covers the isDoubles pre-check in
splitMultiLineListTextNode by creating a SemanticParagraph with multiple
TextLine/TextChunk entries whose non-space lines all match the doubles pattern
(e.g., "1.5", "2.3"), call ListProcessor.processListsFromTextNodes, and assert
that no PDFList is produced and the original SemanticParagraph remains (size ==
1 and instanceOf SemanticParagraph). Ensure the test sets StaticContainers
(setIsIgnoreCharactersWithoutUnicode, setIsDataLoader, setAccumulatedNodeMapper)
like the other tests and name it e.g.
testProcessListsFromDoublesNodeIsNotExpanded so it guards the
isDoubles(nonSpaceLines) branch in splitMultiLineListTextNode.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8c55b0ec-455f-409d-a0dd-e9c92da3f14c

📥 Commits

Reviewing files that changed from the base of the PR and between d99c170 and e3b0ad6.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 136-185: Add a regression test that exercises the
expand-then-restore path by creating a SemanticParagraph with adjacent text
lines that cause splitMultiLineListTextNode() to expand the paragraph but later
be rejected by restoreExpandedTextNodes() (e.g., a numeric list label like "1.
Revenue" followed by a different label style like "A. Other Income"); call
ListProcessor.processListsFromTextNodes() and assert that the original
SemanticParagraph is restored (no PDFList is produced) and contents size equals
the original, referencing splitMultiLineListTextNode, restoreExpandedTextNodes,
ListProcessor.processListsFromTextNodes, SemanticParagraph and PDFList to locate
the relevant code paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 23866b66-7573-41c4-a429-bb7952362f04

📥 Commits

Reviewing files that changed from the base of the PR and between e3b0ad6 and 7d54431.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 90.58824% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...g/opendataloader/pdf/processors/ListProcessor.java 90.58% 3 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistent structural parsing (list vs paragraph) for similar financial tables

2 participants