Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Sep 15, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-2531/sharepoint-sdk-no-size-limit

When we were falling back to using the sharepoint SDK, we were not limiting file size download.

Also, we were materializing all images in a doc into a separate list. Worst case this could double the memory used, so processing them one by one reduces the chance of this happening (a doc with one 19MB image will still use double memory but there isn't much we can do about that)

How Has This Been Tested?

tested in UI, will verify that automated tests still pass

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Add a hard size cap to SharePoint SDK fallback downloads and skip files that exceed it. Stream embedded images during extraction to reduce peak memory and avoid duplicating large buffers.

  • Bug Fixes

    • Enforce the existing size limit during SDK streaming downloads; skip oversized files instead of fetching them fully.
    • Align SDK fallback behavior with the HTTP download path’s size cap.
  • Refactors

    • Introduce an image_callback for PDF/DOCX extraction and use it in the SharePoint connector to store images incrementally.
    • Stop materializing all embedded images in memory to lower peak usage on large documents.

@evan-onyx evan-onyx requested a review from a team as a code owner September 15, 2025 20:38
Copy link

vercel bot commented Sep 15, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Sep 16, 2025 0:57am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR addresses a critical memory optimization issue in the SharePoint connector's SDK fallback mechanism. The changes introduce two key improvements:

File Size Cap Enforcement: The SharePoint connector previously had a vulnerability where its SDK fallback path (get_content() method) would download files of any size into memory without size checking. This PR adds a new _download_via_sdk_with_cap() function that implements streaming download with a hard byte limit using the SDK's download_session() method with chunk callbacks. When files exceed the configured size threshold, they are now properly skipped with appropriate error handling.

Memory-Optimized Image Processing: The embedded image extraction process was refactored to use a callback pattern instead of materializing all images into memory at once. Previously, documents with large embedded images could potentially double memory usage (e.g., a 19MB image document using ~38MB at peak). The new approach processes images one-by-one via an image_callback parameter added to the file processing pipeline (extract_text_and_images, read_pdf_file, docx_to_text_and_images).

These changes align the SDK fallback path with the same size protections that existed in the primary download path, and leverage existing callback infrastructure in the extract_text_and_images() API. The modifications ensure consistent memory management across all SharePoint download scenarios while maintaining backward compatibility through optional parameters.

Confidence score: 4/5

  • This PR addresses a real memory vulnerability with well-targeted fixes that follow existing patterns in the codebase
  • Score reflects solid implementation using established callback patterns and proper error handling, though the complexity of the streaming logic warrants careful testing
  • Pay close attention to the new _download_via_sdk_with_cap() function and the image callback integration points

2 files reviewed, 3 comments

Edit Code Review Bot Settings | Greptile

return doc.markdown, list(extract_docx_images(to_bytesio(file)))
# If a callback is provided, iterate and stream images without accumulating
try:
for img_file_bytes, file_name in extract_docx_images(to_bytesio(file)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Variable name shadowing - file_name parameter shadows the function parameter on line 330

Suggested change
for img_file_bytes, file_name in extract_docx_images(to_bytesio(file)):
for img_file_bytes, img_file_name in extract_docx_images(to_bytesio(file)):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

raise RuntimeError("size_cap_exceeded")

# modifies the driveitem to change its download behavior
driveitem.download_session(buf, chunk_downloaded=on_chunk, chunk_size=chunk_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The comment says 'modifies the driveitem' but this is potentially misleading - it configures the download session behavior rather than modifying the item itself.

Comment on lines 345 to +347
try:
# Use this to test the sdk size cap
# raise requests.RequestException("test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Remove commented test code before merging to production.

Suggested change
try:
# Use this to test the sdk size cap
# raise requests.RequestException("test")
try:
content_bytes = _download_with_cap(

Context Used: Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files

Prompt for AI agents (all 2 issues)

Understand the root cause of the following 2 issues and fix them.


<file name="backend/onyx/connectors/sharepoint/connector.py">

<violation number="1" location="backend/onyx/connectors/sharepoint/connector.py:411">
Embedded images are now appended before the text section due to streaming callback, reversing prior section order; this may affect consumers relying on text-first ordering.</violation>
</file>

<file name="backend/onyx/file_processing/extract_file_text.py">

<violation number="1" location="backend/onyx/file_processing/extract_file_text.py:300">
Guard image_callback in read_pdf_file with try/except to prevent a callback error from aborting the entire PDF extraction.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

image_section.link = driveitem.web_url
sections.append(image_section)

extraction_result = extract_text_and_images(
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedded images are now appended before the text section due to streaming callback, reversing prior section order; this may affect consumers relying on text-first ordering.

Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 411:

<comment>Embedded images are now appended before the text section due to streaming callback, reversing prior section order; this may affect consumers relying on text-first ordering.</comment>

<file context>
@@ -370,24 +398,27 @@ def _convert_driveitem_to_document_with_permissions(
             image_section.link = driveitem.web_url
             sections.append(image_section)
 
+        extraction_result = extract_text_and_images(
+            file=io.BytesIO(content_bytes),
+            file_name=driveitem.name,
</file context>
Fix with Cubic

extracted_images.append((img_bytes, image_name))
if image_callback is not None:
# Stream image out immediately
image_callback(img_bytes, image_name)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guard image_callback in read_pdf_file with try/except to prevent a callback error from aborting the entire PDF extraction.

Prompt for AI agents
Address the following comment on backend/onyx/file_processing/extract_file_text.py at line 300:

<comment>Guard image_callback in read_pdf_file with try/except to prevent a callback error from aborting the entire PDF extraction.</comment>

<file context>
@@ -292,7 +295,11 @@ def read_pdf_file(
-                    extracted_images.append((img_bytes, image_name))
+                    if image_callback is not None:
+                        # Stream image out immediately
+                        image_callback(img_bytes, image_name)
+                    else:
+                        extracted_images.append((img_bytes, image_name))
</file context>
Fix with Cubic

return doc.markdown, list(extract_docx_images(to_bytesio(file)))
# If a callback is provided, iterate and stream images without accumulating
try:
for img_file_bytes, file_name in extract_docx_images(to_bytesio(file)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

blacksmith-sh bot commented Sep 16, 2025

5 Jobs Failed:

Run Integration Tests v2 / build-backend-image failed on "Login to Private Registry"
[...]
env:
  PRIVATE_REGISTRY: experimental-registry.blacksmith.sh:5000
  PRIVATE_REGISTRY_USERNAME: ***
  PRIVATE_REGISTRY_PASSWORD: ***
  OPENAI_API_KEY: ***
  SLACK_BOT_TOKEN: ***
  CONFLUENCE_TEST_SPACE_URL: ***
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: ***/onyx
Logging into experimental-registry.blacksmith.sh:5000...
Error: Error response from daemon: login attempt to https://experimental-registry.blacksmith.sh:5000/v2/ failed with status: 401 Unauthorized
Run Integration Tests v2 / build-integration-image failed on "Login to Private Registry"
[...]
env:
  PRIVATE_REGISTRY: experimental-registry.blacksmith.sh:5000
  PRIVATE_REGISTRY_USERNAME: ***
  PRIVATE_REGISTRY_PASSWORD: ***
  OPENAI_API_KEY: ***
  SLACK_BOT_TOKEN: ***
  CONFLUENCE_TEST_SPACE_URL: ***
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: ***/onyx
Logging into experimental-registry.blacksmith.sh:5000...
Error: Error response from daemon: login attempt to https://experimental-registry.blacksmith.sh:5000/v2/ failed with status: 401 Unauthorized
Run Integration Tests v2 / build-model-server-image failed on "Login to Private Registry"
[...]
env:
  PRIVATE_REGISTRY: experimental-registry.blacksmith.sh:5000
  PRIVATE_REGISTRY_USERNAME: ***
  PRIVATE_REGISTRY_PASSWORD: ***
  OPENAI_API_KEY: ***
  SLACK_BOT_TOKEN: ***
  CONFLUENCE_TEST_SPACE_URL: ***
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: ***/onyx
Logging into experimental-registry.blacksmith.sh:5000...
Error: Error response from daemon: login attempt to https://experimental-registry.blacksmith.sh:5000/v2/ failed with status: 401 Unauthorized
Run Integration Tests v2 / required failed on "Run actions/github-script@v7"
[...]
  retry-exempt-status-codes: 400,401,403,404,422
env:
  PRIVATE_REGISTRY: experimental-registry.blacksmith.sh:5000
  PRIVATE_REGISTRY_USERNAME: ***
  PRIVATE_REGISTRY_PASSWORD: ***
  OPENAI_API_KEY: ***
  SLACK_BOT_TOKEN: ***
  CONFLUENCE_TEST_SPACE_URL: ***
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: ***/onyx
Error: One or more upstream jobs failed or were cancelled.

1 job failed running on non-Blacksmith runners.


Summary: 6 successful workflows, 2 failed workflows

Last updated: 2025-09-16 01:21:21 UTC

@evan-onyx evan-onyx merged commit 4c7a2e4 into main Sep 16, 2025
25 of 34 checks passed
@evan-onyx evan-onyx deleted the fix/sharepoint-docfetching-sdk-mem branch September 16, 2025 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants