Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Aug 24, 2025

Description

see title

How Has This Been Tested?

tested in UI

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Enforces a 20 MB cap on Google Drive file downloads and exports. Files that exceed the limit are skipped early to avoid memory spikes and timeouts.

  • Bug Fixes
    • Add size_threshold to all Drive download/export paths and pass it through extraction.
    • Abort MediaIoBaseDownload when progress exceeds the threshold; log a warning and return empty bytes.
    • Consolidate download logic into _donwload_request for both get_media and export_media.

@evan-onyx evan-onyx requested a review from a team as a code owner August 24, 2025 03:49
Copy link

vercel bot commented Aug 24, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Aug 25, 2025 1:04am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements file size limiting for Google Drive downloads to prevent memory issues from large files. The changes add a size_threshold parameter that flows through the entire Google Drive download pipeline, implementing chunked downloading with progress monitoring that aborts downloads exceeding the threshold.

Key modifications include:

  • Adding size_threshold parameter to download_request() function signature
  • Creating a new _donwload_request() helper function that uses MediaIoBaseDownload with chunked downloading (chunk size set to 2x the threshold)
  • Implementing progress monitoring that stops downloads when resumable_progress > size_threshold
  • Threading the size_threshold parameter through _download_and_extract_sections_basic() and other related functions
  • Early termination returns empty bytes when threshold is exceeded

The implementation uses Google's chunked download API to monitor progress and prevent full download of oversized files, which improves system stability by avoiding memory exhaustion from large Google Drive files. This change integrates with the existing Google Drive connector architecture while adding size enforcement at the download level.

Confidence score: 2/5

  • This PR has critical implementation issues that could cause runtime failures
  • Score significantly lowered due to a typo in function name _donwload_request (missing 'w') which will cause NameError exceptions
  • Pay close attention to the new download functions and verify the typo is fixed before merging

1 file reviewed, 3 comments

Edit Code Review Bot Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file

React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.

Copy link
Contributor

@justin-tahara justin-tahara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we can address the comments I think this should be good.

Maybe we think a bit more about the magic number portion and then clean this up

@justin-tahara
Copy link
Contributor

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This review covers only the changes made since the last review (commit 381269a), not the entire PR.

The most recent changes address previously identified issues in the Google Drive file download size enforcement implementation. The developer has made several important corrections:

  1. Fixed function name typo: Corrected _donwload_request to _download_request
  2. Fixed log message typo: Changed 'Skipping2.' to 'Skipping.'
  3. Improved chunk size calculation: Modified from size_threshold * 2 to size_threshold + CHUNK_SIZE_BUFFER (64 bytes)
  4. Added null safety: Implemented a check for download_progress before accessing its properties
  5. Added new constant: Introduced CHUNK_SIZE_BUFFER = 64 for more precise chunk size control

These changes refine the size threshold enforcement mechanism that prevents memory spikes and timeouts when downloading large Google Drive files. The chunk size adjustment is particularly important as it ensures the system only downloads slightly more than the threshold (20MB + 64 bytes) rather than potentially double the limit, making the size cap more effective. The null safety check prevents potential AttributeError exceptions during the download progress monitoring.

Confidence score: 4/5

  • This PR addresses all previously identified issues and is safe to merge with minimal risk
  • Score reflects the fixes to typos, logical improvements to chunk sizing, and addition of proper null safety checks
  • Pay close attention to the download logic in _download_request function to ensure the size threshold enforcement works as expected in production

2 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Copy link
Contributor

@justin-tahara justin-tahara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment about the import but besides that this looks good! Thank you Evan!! Let's get this in and look into the next set of OOM's

langgraph==0.2.72
langgraph-checkpoint==2.0.13
langgraph-sdk==0.1.44
lazy_imports==1.0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integration tests were failing bc we were missing this import

@evan-onyx evan-onyx enabled auto-merge August 25, 2025 17:18
@evan-onyx evan-onyx added this pull request to the merge queue Aug 25, 2025
Merged via the queue into main with commit c1e1aa9 Aug 25, 2025
15 of 16 checks passed
@evan-onyx evan-onyx deleted the fix/drive-file-size-download branch August 25, 2025 19:28
wenxi-onyx pushed a commit that referenced this pull request Aug 25, 2025
* fix: downloads are never larger than 20mb

* JT comments

* import to fix integration tests
wenxi-onyx pushed a commit that referenced this pull request Aug 25, 2025
* fix: downloads are never larger than 20mb

* JT comments

* import to fix integration tests
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* fix: downloads are never larger than 20mb

* JT comments

* import to fix integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants