-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: skip huge files on sdk fallback #5421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR addresses a critical memory optimization issue in the SharePoint connector's SDK fallback mechanism. The changes introduce two key improvements:
File Size Cap Enforcement: The SharePoint connector previously had a vulnerability where its SDK fallback path (get_content()
method) would download files of any size into memory without size checking. This PR adds a new _download_via_sdk_with_cap()
function that implements streaming download with a hard byte limit using the SDK's download_session()
method with chunk callbacks. When files exceed the configured size threshold, they are now properly skipped with appropriate error handling.
Memory-Optimized Image Processing: The embedded image extraction process was refactored to use a callback pattern instead of materializing all images into memory at once. Previously, documents with large embedded images could potentially double memory usage (e.g., a 19MB image document using ~38MB at peak). The new approach processes images one-by-one via an image_callback
parameter added to the file processing pipeline (extract_text_and_images
, read_pdf_file
, docx_to_text_and_images
).
These changes align the SDK fallback path with the same size protections that existed in the primary download path, and leverage existing callback infrastructure in the extract_text_and_images()
API. The modifications ensure consistent memory management across all SharePoint download scenarios while maintaining backward compatibility through optional parameters.
Confidence score: 4/5
- This PR addresses a real memory vulnerability with well-targeted fixes that follow existing patterns in the codebase
- Score reflects solid implementation using established callback patterns and proper error handling, though the complexity of the streaming logic warrants careful testing
- Pay close attention to the new
_download_via_sdk_with_cap()
function and the image callback integration points
2 files reviewed, 3 comments
return doc.markdown, list(extract_docx_images(to_bytesio(file))) | ||
# If a callback is provided, iterate and stream images without accumulating | ||
try: | ||
for img_file_bytes, file_name in extract_docx_images(to_bytesio(file)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Variable name shadowing - file_name
parameter shadows the function parameter on line 330
for img_file_bytes, file_name in extract_docx_images(to_bytesio(file)): | |
for img_file_bytes, img_file_name in extract_docx_images(to_bytesio(file)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
raise RuntimeError("size_cap_exceeded") | ||
|
||
# modifies the driveitem to change its download behavior | ||
driveitem.download_session(buf, chunk_downloaded=on_chunk, chunk_size=chunk_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: The comment says 'modifies the driveitem' but this is potentially misleading - it configures the download session behavior rather than modifying the item itself.
try: | ||
# Use this to test the sdk size cap | ||
# raise requests.RequestException("test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Remove commented test code before merging to production.
try: | |
# Use this to test the sdk size cap | |
# raise requests.RequestException("test") | |
try: | |
content_bytes = _download_with_cap( |
Context Used: Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 2 files
Prompt for AI agents (all 2 issues)
Understand the root cause of the following 2 issues and fix them.
<file name="backend/onyx/connectors/sharepoint/connector.py">
<violation number="1" location="backend/onyx/connectors/sharepoint/connector.py:411">
Embedded images are now appended before the text section due to streaming callback, reversing prior section order; this may affect consumers relying on text-first ordering.</violation>
</file>
<file name="backend/onyx/file_processing/extract_file_text.py">
<violation number="1" location="backend/onyx/file_processing/extract_file_text.py:300">
Guard image_callback in read_pdf_file with try/except to prevent a callback error from aborting the entire PDF extraction.</violation>
</file>
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
image_section.link = driveitem.web_url | ||
sections.append(image_section) | ||
|
||
extraction_result = extract_text_and_images( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embedded images are now appended before the text section due to streaming callback, reversing prior section order; this may affect consumers relying on text-first ordering.
Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 411:
<comment>Embedded images are now appended before the text section due to streaming callback, reversing prior section order; this may affect consumers relying on text-first ordering.</comment>
<file context>
@@ -370,24 +398,27 @@ def _convert_driveitem_to_document_with_permissions(
image_section.link = driveitem.web_url
sections.append(image_section)
+ extraction_result = extract_text_and_images(
+ file=io.BytesIO(content_bytes),
+ file_name=driveitem.name,
</file context>
extracted_images.append((img_bytes, image_name)) | ||
if image_callback is not None: | ||
# Stream image out immediately | ||
image_callback(img_bytes, image_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard image_callback in read_pdf_file with try/except to prevent a callback error from aborting the entire PDF extraction.
Prompt for AI agents
Address the following comment on backend/onyx/file_processing/extract_file_text.py at line 300:
<comment>Guard image_callback in read_pdf_file with try/except to prevent a callback error from aborting the entire PDF extraction.</comment>
<file context>
@@ -292,7 +295,11 @@ def read_pdf_file(
- extracted_images.append((img_bytes, image_name))
+ if image_callback is not None:
+ # Stream image out immediately
+ image_callback(img_bytes, image_name)
+ else:
+ extracted_images.append((img_bytes, image_name))
</file context>
return doc.markdown, list(extract_docx_images(to_bytesio(file))) | ||
# If a callback is provided, iterate and stream images without accumulating | ||
try: | ||
for img_file_bytes, file_name in extract_docx_images(to_bytesio(file)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
5 Jobs Failed: Run Integration Tests v2 / build-backend-image failed on "Login to Private Registry"
Run Integration Tests v2 / build-integration-image failed on "Login to Private Registry"
Run Integration Tests v2 / build-model-server-image failed on "Login to Private Registry"
Run Integration Tests v2 / required failed on "Run actions/github-script@v7"
1 job failed running on non-Blacksmith runners. Summary: 6 successful workflows, 2 failed workflows
Last updated: 2025-09-16 01:21:21 UTC |
Description
Fixes https://linear.app/danswer/issue/DAN-2531/sharepoint-sdk-no-size-limit
When we were falling back to using the sharepoint SDK, we were not limiting file size download.
Also, we were materializing all images in a doc into a separate list. Worst case this could double the memory used, so processing them one by one reduces the chance of this happening (a doc with one 19MB image will still use double memory but there isn't much we can do about that)
How Has This Been Tested?
tested in UI, will verify that automated tests still pass
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Add a hard size cap to SharePoint SDK fallback downloads and skip files that exceed it. Stream embedded images during extraction to reduce peak memory and avoid duplicating large buffers.
Bug Fixes
Refactors