Discussion on `AutoMergingRetriever` and `HierarchicalDocumentSplitter` #78

TuanaCelik · 2024-09-05T12:43:35Z

TuanaCelik
Sep 5, 2024

This is the discussion board for the experimental AutoMergingRetriever and HierarchicalDocumentSplitter components

These components are used to split documents with a reference to the 'parent' document, and then based on a threshold setting, to return the parent documents if a certain number of 'child' documents are retrieved.

The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual chunks alone.

📚Full Documentation of the AutoMergingRetriever
📚Full Documentation of the HierarchicalDocumentSplitter

🧑‍🍳 Try the Cookbook here

Note: Experimental features live in this repository for a fixed period of time. We don't guarantee that we will continue maintaining experimental features. But if they are successful, stable and if you like it, we will move the feature to the core Haystack package.

PS: The first version of this experiment was implemented by @davidsbatista 🚀

TuanaCelik · 2024-09-06T10:10:36Z

TuanaCelik
Sep 6, 2024
Author

Ok y'all, I have been looking at the draft cookbook we are about to share for this component so I decided, why don't I add the first comment already:

Imo, for the Hierarchical document splitter to be useable in pipelines, we should also have an output called parent_documents that outputs the __level = 1 documents. Otherwise, to be able to use the component in a pipeline, we have to probably use the conditional router which becomes unnecessarily complex. The common use case will probably just be indexing parent doucmemtns, as shown by David already 🙏

1 reply

davidsbatista Sep 11, 2024
Maintainer

I don't think it makes much sense to hard-code the HierarchicalDocumentSplitter to have an extra output parent_documents that outputs the __level = 1. The HierarchicalDocumentSplitter aims to build a hierarchy based on the input documents and the number of levels/blocks the user specifies. It's then on the user to decide which documents from which level he/she wants and what to do with them.

GoXian · 2025-02-24T02:39:10Z

GoXian
Feb 24, 2025

这是实验和组件的讨论板AutoMergingRetriever``HierarchicalDocumentSplitter

这些组件用于拆分引用“父”文档的文档，然后根据阈值设置，在检索到一定数量的“子”文档时返回父文档。

理由是，假设一个段落被拆分为多个表示为叶文档的块，并且如果对于给定的查询，匹配了多个块，则整个段落可能比单独的单个块更具信息量。

📚AutoMergingRetriever 📚的完整文档HierarchicalDocumentSplitter 的完整文档

🧑 🍳 在此处试用 Cookbook

**注意：**实验性功能在此存储库中保留固定时间。我们不保证我们会继续维护实验性功能。但是，如果他们成功、稳定并且如果您喜欢，我们会将该功能移至核心 Haystack 包。

PS：这个实验的第一个版本是由 🚀

Please update the document, now the document page appears 404.

2 replies

davidsbatista Feb 24, 2025
Maintainer

@GoXian thanks for letting us know - I've updated the doc links

GoXian Feb 24, 2025

Thank you

davidsbatista · 2025-04-03T09:20:01Z

davidsbatista
Apr 3, 2025
Maintainer

This component was moved into haystack main.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion on `AutoMergingRetriever` and `HierarchicalDocumentSplitter` #78

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Discussion on AutoMergingRetriever and HierarchicalDocumentSplitter #78

Uh oh!

Uh oh!

TuanaCelik Sep 5, 2024

Replies: 3 comments · 3 replies

Uh oh!

TuanaCelik Sep 6, 2024 Author

Uh oh!

Uh oh!

davidsbatista Sep 11, 2024 Maintainer

Uh oh!

GoXian Feb 24, 2025

Uh oh!

davidsbatista Feb 24, 2025 Maintainer

Uh oh!

GoXian Feb 24, 2025

Uh oh!

davidsbatista Apr 3, 2025 Maintainer

Discussion on `AutoMergingRetriever` and `HierarchicalDocumentSplitter` #78

TuanaCelik
Sep 5, 2024

Replies: 3 comments 3 replies

TuanaCelik
Sep 6, 2024
Author

davidsbatista Sep 11, 2024
Maintainer

GoXian
Feb 24, 2025

davidsbatista Feb 24, 2025
Maintainer

davidsbatista
Apr 3, 2025
Maintainer