-
Notifications
You must be signed in to change notification settings - Fork 2.1k
fix drive slowness #4668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix drive slowness #4668
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR optimizes the Google Drive connector's performance by reducing API call overhead and improving thread management to prevent rate limiting and hangs during document indexing.
- Reduced BATCHES_PER_CHECKPOINT from 10 to 1 in
/backend/onyx/connectors/google_drive/connector.py
to minimize redundant API calls - Added timeout and RefreshError handling to prevent connector hangs
- Optimized thread management by matching thread count to active user emails instead of fixed 50 threads
- Added per-user drive ID tracking to prevent duplicate processing
- Improved logging for better debugging and monitoring of API interactions
1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
d8c10ca
to
6827dc3
Compare
* fix slowness * no more silent failing for users * nits * no silly info transfer
* fix slowness * no more silent failing for users * nits * no silly info transfer
* fix slowness * no more silent failing for users * nits * no silly info transfer
* fix slowness * no more silent failing for users * nits * no silly info transfer
Description
Fixes https://linear.app/danswer/issue/DAN-1945/make-drive-fast-again
We were seeing a lot of slowness with the drive connector, with occasional hangs that completely interrupted indexing. We believe this was due to many duplicate API calls, in some cases leading to some silent rate limiting from the google apis. We were running 50 generator threads in parallel to get 16 documents, returning a checkpoint, then entering with that checkpoint and starting 50 more generators... etc. Now we tie the number of threads to the number of user emails we process per checkpoint, and finish those users before returning a checkpoint. It remains to be seen whether this will work well across the board, but it will certainly greatly improve API call efficiency (and therefore speed).
How Has This Been Tested?
Tested manually
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.