Fix inconsistent list parsing for multiline financial rows by pandego · Pull Request #484 · opendataloader-project/opendataloader-pdf

pandego · 2026-05-04T18:18:43Z

Summary

detect lists when multiple labeled financial rows were merged into a single semantic text node
preserve continuation lines under the current list item when expanding those multiline nodes
add focused regression tests for single-node multiline list parsing

Validation

mvn -f java/pom.xml -pl opendataloader-pdf-core -Dtest=ListProcessorTest test
./scripts/bench.sh (currently fails before benchmark evaluation because the repo has pre-existing Javadoc errors from unescaped > in XYCutPlusPlusSorter.java and TriageProcessor.java)

Fixes #463

Summary by CodeRabbit

New Features
- Improved PDF list extraction: multi-line paragraphs with labeled list-like lines are split into distinct list items when appropriate; continuation lines are preserved inside their list items; single labeled lines are no longer misdetected as lists; adjacent expanded lists correctly merge and consumed source content is not restored.
Tests
- Added tests covering multi-line expansion, preservation of continuation lines, prevention of false expansions (including decimal-like labels), and correct merging behavior.

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai · 2026-05-04T18:19:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

ListProcessor now pre-splits eligible multi-line SemanticTextNodes into SemanticParagraph segments when they resemble labeled multi-line lists, runs existing list-detection on the expanded content, then restores the original SemanticTextNode only for expanded ranges that were not consumed during list construction. Five unit tests were added.

Changes

Multi-Line List Node Expansion

Layer / File(s)	Summary
Preprocess callsite `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`	`processListsFromTextNodes(...)` now calls `expandMultiLineListTextNodes(...)` before list detection and `restoreExpandedTextNodes(...)` after list construction; returns `DocumentProcessor.removeNullObjectsFromList(contents)`.
Expansion iterator `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`	Added `expandMultiLineListTextNodes()` that walks `contents`, replaces qualifying `SemanticTextNode`s with segmented `SemanticParagraph`s, and records expanded index ranges as candidates.
Splitter / segmentation `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`	Added `splitMultiLineListTextNode()` which: collects first-column non-blank lines, requires ≥2 non-blank lines and ≥2 labeled lines, rejects expansion when labeled lines are all “doubles” (`^\d+\.\d+$`), and produces `SemanticParagraph` segments starting at labeled lines, grouping following unlabeled continuation lines.
Postprocess / restore `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`	Added `restoreExpandedTextNodes()` plus helper types (`ExpandedTextNodesResult`, `ExpandedTextNodeCandidate`) to revert an expanded range back to its original `SemanticTextNode` only if none of the candidate indices were consumed (i.e., replaced by `null` or a `PDFList`).
Helper predicate `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`	Added `isDoubles(List<TextLine>)` to detect and reject expansion when all labeled lines match `^\d+\.\d+$`.
Existing list logic (unchanged flow) `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`	The existing list-interval detection and list construction run unchanged against the (possibly) expanded `contents`.
Tests `java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`	Added five tests: `testProcessListsFromSingleMultilineTextNode`, `testProcessListsFromSingleMultilineTextNodeKeepsContinuationLines`, `testProcessListsFromSingleLabeledLineIsNotExpanded`, `testProcessListsFromAdjacentExpandedNodesDoesNotRestoreConsumedNode`, and `testProcessListsFromDoublesNodeIsNotExpanded`.

Sequence Diagram(s)

(Skipped)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

opendataloader-project/opendataloader-pdf#369: Also modifies ListProcessor.java and affects list-detection behavior (this PR adds multiline expansion/restoration while #369 adjusts backward-scan limits in list item processing).

Suggested reviewers

MaximPlusov
LonelyMidoriya
hnc-jglee
hyunhee-jo
bundolee

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check	❓ Inconclusive	The PR partially addresses issue `#463` by handling multiline list expansion, but does not demonstrate all required objectives including consistent cross-PDF detection, hierarchical nesting preservation, or layout heuristic improvements.	Clarify how the multiline expansion ensures consistent detection across similar PDFs, verify hierarchical list reconstruction works end-to-end, and confirm regression tests cover both provided PDF examples.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main code change: expanding multiline financial rows into separate list items for consistent parsing.
Out of Scope Changes check	✅ Passed	All changes focus on the list processing logic and its tests, directly supporting the multiline list expansion objective without introducing unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CLAassistant · 2026-05-04T18:19:20Z

All committers have signed the CLA.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 442-456: buildListFromMultiLineTextNode creates PDFList/ListItem
objects but never runs the numbering/interval pipeline, so you must populate the
same structural metadata calculateList produces: after building the PDFList,
obtain or create a TextListInterval for that generated list (reuse
ListLabelsUtils.getListItemsIntervals on the original lines or derive a
TextListInterval from the expansion result), then call
PDFList.setNumberingStyle(...) and PDFList.setCommonPrefix(...) with values from
that interval and set each ListItem.setLabelLength(...) using the
ListItemInfo/createListItemTextInfo-style label extraction on the first TextLine
of each item; finally run/processTextNodeListItemContent (so
DocumentProcessor.setIDs is invoked) on the substituted contents to ensure IDs
and other downstream metadata are assigned.
- Around line 423-429: buildListFromMultiLineTextNode currently collects
non-blank lines from all columns causing incorrect visual order and list
grouping; change it to mirror calculateList by iterating only the first column:
use textNode.getFirstColumn() (or getFirstColumn().getLines()) instead of
textNode.getColumns(), collect non-blank TextLine via getValue().isBlank(), and
preserve existing nonSpaceLines usage; add a null/empty check for
getFirstColumn() to avoid NPEs.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 70-110: Add a negative-unit test ensuring a SemanticTextNode with
only one labeled line is not converted to a PDFList: create a test (e.g.,
testProcessListsFromSingleLabeledLineIsNotExpanded) that sets StaticContainers
as in existing tests, builds a SemanticParagraph with one labeled TextLine and
an indented continuation TextLine, calls
ListProcessor.processListsFromTextNodes, and asserts no PDFList is produced
(e.g., assertFalse(contents.stream().anyMatch(o -> o instanceof PDFList))); this
guards the labeledLineCount < 2 check in buildListFromMultiLineTextNode and
references ListProcessor.processListsFromTextNodes and
buildListFromMultiLineTextNode to locate the logic to protect.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e8f15e69-e568-4055-b2d9-fd17335c9d35

📥 Commits

Reviewing files that changed from the base of the PR and between a0c5e66 and b35e66e.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 407-424: expandMultiLineListTextNodes is creating irreversible
SemanticParagraph fragments because splitMultiLineListTextNode can produce
fragments even for cases that will later be rejected (e.g., numeric "doubles"
labels), causing orphaned paragraphs to escape list-detection; fix by adding a
pre-check inside splitMultiLineListTextNode that detects the known failure case
(labels matching ^\d+\.\d+$ or other conditions that make isCorrectList return
false) and immediately return null before allocating fragments so
expandMultiLineListTextNodes leaves the original SemanticTextNode intact;
additionally consider the broader fix: in processListsFromTextNodes track index
ranges produced by each split and after removeNullObjectsFromList restore the
original node for any range that yielded no PDFList, updating the mapping logic
accordingly.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 112-127: The test
testProcessListsFromSingleLabeledLineIsNotExpanded should not only assert no
PDFList was created but also verify the original node is preserved and
unchanged: after calling ListProcessor.processListsFromTextNodes(contents)
assert the returned list size equals the original size, assert the first element
is still a SemanticParagraph (not null) and assert it still contains the
expected TextLine children (e.g., two TextLine instances or that its child count
is unchanged); use the existing symbols ListProcessor.processListsFromTextNodes,
SemanticParagraph, TextLine and PDFList to locate and implement these additional
assertions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7184af7-1332-4257-bff1-0540701358d9

📥 Commits

Reviewing files that changed from the base of the PR and between b35e66e and 5429daf.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 472-489: The restoreExpandedTextNodes method can restore
candidates that were already consumed as non-leading items because calculateList
writes the shared PDFList at the first matched index and nulls out the later
matched indices; update restoreExpandedTextNodes (and its loop over contents for
each ExpandedTextNodeCandidate) to treat any null entry inside the candidate
range as evidence the candidate was consumed and skip restoration (in addition
to the existing PDFList check). Use the existing symbols:
restoreExpandedTextNodes, ExpandedTextNodeCandidate, contents, PDFList,
getStartIndex/getEndIndex, and getOriginalNode to locate and implement the
guard.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9724d720-07ec-4d21-bc90-b3ab4836d4eb

📥 Commits

Reviewing files that changed from the base of the PR and between 5429daf and d99c170.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java`:
- Around line 492-499: The pre-check isDoubles(List<TextLine> lines) currently
denies any list containing continuation/non-blank lines because it checks every
non-blank line; change it to only inspect labeled lines so it matches the
downstream isDoubles(TextListInterval) semantics: filter the input
List<TextLine> to only those lines that are labeled (e.g., lines where
TextLine.getLabel() is non-null/non-empty or an equivalent property on
TextLine), if there are no labeled lines return false, and then verify every
labeled line's value matches the decimal regex "^\\d+\\.\\d+$"; update the
isDoubles(List<TextLine>) implementation accordingly so expansion→detect→restore
cycles are avoided.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 70-165: Add a unit test that covers the isDoubles pre-check in
splitMultiLineListTextNode by creating a SemanticParagraph with multiple
TextLine/TextChunk entries whose non-space lines all match the doubles pattern
(e.g., "1.5", "2.3"), call ListProcessor.processListsFromTextNodes, and assert
that no PDFList is produced and the original SemanticParagraph remains (size ==
1 and instanceOf SemanticParagraph). Ensure the test sets StaticContainers
(setIsIgnoreCharactersWithoutUnicode, setIsDataLoader, setAccumulatedNodeMapper)
like the other tests and name it e.g.
testProcessListsFromDoublesNodeIsNotExpanded so it guards the
isDoubles(nonSpaceLines) branch in splitMultiLineListTextNode.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8c55b0ec-455f-409d-a0dd-e9c92da3f14c

📥 Commits

Reviewing files that changed from the base of the PR and between d99c170 and e3b0ad6.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java`:
- Around line 136-185: Add a regression test that exercises the
expand-then-restore path by creating a SemanticParagraph with adjacent text
lines that cause splitMultiLineListTextNode() to expand the paragraph but later
be rejected by restoreExpandedTextNodes() (e.g., a numeric list label like "1.
Revenue" followed by a different label style like "A. Other Income"); call
ListProcessor.processListsFromTextNodes() and assert that the original
SemanticParagraph is restored (no PDFList is produced) and contents size equals
the original, referencing splitMultiLineListTextNode, restoreExpandedTextNodes,
ListProcessor.processListsFromTextNodes, SemanticParagraph and PDFList to locate
the relevant code paths.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 23866b66-7573-41c4-a429-bb7952362f04

📥 Commits

Reviewing files that changed from the base of the PR and between e3b0ad6 and 7d54431.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

codecov · 2026-05-08T09:03:20Z

Codecov Report

❌ Patch coverage is 90.58824% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...g/opendataloader/pdf/processors/ListProcessor.java	90.58%	3 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

Fix detect multiline list items in single text nodes

b35e66e

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

pandego requested review from LonelyMidoriya, MaximPlusov, bundolee, hnc-jglee and hyunhee-jo as code owners May 4, 2026 18:18

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Update multiline list expansion for review feedback

5429daf

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java Outdated

Comment thread ...endataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Fix preserve original nodes after list expansion

d99c170

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java

Fix restore guard for consumed expanded nodes

e3b0ad6

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ListProcessor.java

Comment thread ...endataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Fix doubles guard for multiline list expansion

7d54431

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread ...endataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/ListProcessorTest.java

Add restore-path regression coverage

1197fcf

Signed-off-by: pandego <7780875+pandego@users.noreply.github.com>

Conversation

pandego commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

CLAassistant commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pandego commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

CLAassistant commented May 4, 2026 •

edited

Loading