Skip to content

Conversation

Weves
Copy link
Contributor

@Weves Weves commented Jun 6, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-2055/switch-to-chonkie

How Has This Been Tested?

Existing UT

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@Weves Weves requested a review from a team as a code owner June 6, 2025 02:23
Copy link

vercel bot commented Jun 6, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 11, 2025 4:42pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Major shift from LlamaIndex to Chonkie for document chunking, significantly reducing dependency footprint (21MB vs 80-171MB) while improving chunking performance.

  • backend/onyx/indexing/chunker.py: Replaces LlamaIndex's SentenceSplitter with Chonkie's SentenceChunker, removing lazy imports and streamlining tokenization
  • backend/onyx/utils/timing.py: Added new timed context manager for simpler execution time measurements using time.monotonic()
  • Chunking performance improvements reported: 33x faster tag chunking, 2x faster sentence chunking, maintaining all essential text processing capabilities
  • Note: Consider adding performance benchmarks specific to our RAG use case to validate the reported improvements
  • Double-check that the token counter wrapper maintains exact compatibility with previous tokenization behavior

3 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile

@Weves Weves force-pushed the switch-to-chonkie branch from 301ab58 to 7fecac5 Compare June 11, 2025 16:39
@Weves Weves merged commit c040b1c into main Jun 11, 2025
11 checks passed
@Weves Weves deleted the switch-to-chonkie branch June 11, 2025 21:12
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* Switch to chonkie from llamaindex chunker

* Remove un-intended changes

* Order requirements

* Upgrade chonkie version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant