I’ve implemented a DocumentSplitter that supports Chinese text #9368
mc112611
started this conversation in
Show and tell
Replies: 1 comment
-
Hey @mc112611 thanks for your work on this! Would you be willing to open a PR with your changes? If you don't have the time to do so, that's totally fine, but I'd recommend opening this as a feature request so our team can track it and get to it when we have time! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi deepset-ai team,
I extended the DocumentSplitter class and overrode several methods. The original component uses NLTK for sentence splitting and whitespace for tokenization, which works well for English but not for Chinese.
In my version, called chinese_DocumentSplitter, I integrated Chinese-specific sentence and word tokenizers. When instantiated with language="zh", it can split Chinese documents properly while remaining fully compatible with other Haystack components, as it preserves the same data flow format.
Example usage:
I'd also like to offer a small suggestion. I believe one major reason Haystack doesn't fully support multilingual splitting is that tokenization and sentence boundary detection differ across languages.
Perhaps future versions could abstract sentence and word splitting into separately exposed functions. This would allow users working with other languages to easily plug in custom tokenizers by simply modifying those functions.
Thank you all at deepset-ai for your great work — Haystack continues to be a powerful and user-friendly framework for working with LLMs.
I hope this can help users who primarily work with Chinese documents.
Below is my modified version of DocumentSplitter with added support for Chinese.
It is based on haystack-ai==2.12.1, and includes a main block to test the splitting functionality directly.
Beta Was this translation helpful? Give feedback.
All reactions