Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented May 7, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-1945/make-drive-fast-again

We were seeing a lot of slowness with the drive connector, with occasional hangs that completely interrupted indexing. We believe this was due to many duplicate API calls, in some cases leading to some silent rate limiting from the google apis. We were running 50 generator threads in parallel to get 16 documents, returning a checkpoint, then entering with that checkpoint and starting 50 more generators... etc. Now we tie the number of threads to the number of user emails we process per checkpoint, and finish those users before returning a checkpoint. It remains to be seen whether this will work well across the board, but it will certainly greatly improve API call efficiency (and therefore speed).

How Has This Been Tested?

Tested manually

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@evan-onyx evan-onyx requested a review from a team as a code owner May 7, 2025 03:04
Copy link

vercel bot commented May 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 7, 2025 9:53pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR optimizes the Google Drive connector's performance by reducing API call overhead and improving thread management to prevent rate limiting and hangs during document indexing.

  • Reduced BATCHES_PER_CHECKPOINT from 10 to 1 in /backend/onyx/connectors/google_drive/connector.py to minimize redundant API calls
  • Added timeout and RefreshError handling to prevent connector hangs
  • Optimized thread management by matching thread count to active user emails instead of fixed 50 threads
  • Added per-user drive ID tracking to prevent duplicate processing
  • Improved logging for better debugging and monitoring of API interactions

1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

@evan-onyx evan-onyx added this pull request to the merge queue May 7, 2025
Merged via the queue into main with commit 0eab6ab May 7, 2025
11 checks passed
@evan-onyx evan-onyx deleted the perf/drive-checkpoint-speedup branch May 7, 2025 23:45
ferdinandl007 pushed a commit to ferdinandl007/onyx that referenced this pull request May 8, 2025
* fix slowness

* no more silent failing for users

* nits

* no silly info transfer
ferdinandl007 pushed a commit to ferdinandl007/onyx that referenced this pull request May 9, 2025
* fix slowness

* no more silent failing for users

* nits

* no silly info transfer
ZhipengHe pushed a commit to ZhipengHe/onyx that referenced this pull request Jun 6, 2025
* fix slowness

* no more silent failing for users

* nits

* no silly info transfer
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* fix slowness

* no more silent failing for users

* nits

* no silly info transfer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants