-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Add merge_peers and always_emit_headings options to ChunkDoclingDocumentComponent #11684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…gDocumentComponent `pragma: allowlist secret`
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughTwo new boolean input parameters ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 3 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project check has failed because the head coverage (42.12%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #11684 +/- ##
==========================================
+ Coverage 35.20% 35.22% +0.01%
==========================================
Files 1521 1521
Lines 72922 72923 +1
Branches 10936 10936
==========================================
+ Hits 25674 25686 +12
+ Misses 45853 45843 -10
+ Partials 1395 1394 -1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/lfx/src/lfx/_assets/component_index.json`:
- Line 64090: Summary: remove the unsupported always_emit_headings parameter and
input. Fix: in ChunkDoclingDocumentComponent remove the Message/Bool Input
definition for "always_emit_headings" from the inputs list and remove any
build_config toggles referencing "always_emit_headings" in update_build_config;
also remove the argument always_emit_headings=bool(self.always_emit_headings)
passed into the HybridChunker() instantiation inside chunk_documents (and any
uses of self.always_emit_headings). References to change: the inputs list entry
named "always_emit_headings", the update_build_config branch that sets
build_config["always_emit_headings"][...] and the HybridChunker(...) call in
chunk_documents.
In `@src/lfx/src/lfx/components/docling/chunk_docling_document.py`:
- Around line 183-187: The instantiation of HybridChunker is passing an
unsupported parameter always_emit_headings which will raise a TypeError; remove
the always_emit_headings argument from the HybridChunker(...) call (leave
tokenizer=tokenizer and merge_peers=bool(self.merge_peers)), or if you intend to
control heading inclusion, replace it with the supported parameter
include_heading_hierarchy and pass the appropriate boolean (e.g.,
include_heading_hierarchy=bool(self.include_heading_hierarchy)) so the
HybridChunker call uses only valid kwargs.
🧹 Nitpick comments (1)
src/lfx/src/lfx/_assets/component_index.json (1)
72454-72454: Unrelated dependency version bumps included in this PR.Hunks 5–14 update
2.5.0andvlmrunto0.5.4across multiple components. These changes are unrelated to the stated PR objective (addingmerge_peersandalways_emit_headings). Consider whether these should be in a separate PR for cleaner change tracking, or confirm they were intentionally bundled (e.g., via an index regeneration script).
Introduce two new options,
merge_peersandalways_emit_headings, to enhance the functionality of the ChunkDoclingDocumentComponent. These options allow for merging undersized chunks with shared metadata and emitting headings for empty sections, respectively.Summary by CodeRabbit