Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Sep 17, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-2574/excrel-file-size-blowup
This branch was originally used to add memory logging to sharepoint connector to find OOMs. I left the utility function in (commented out), but for convenience this PR now includes the fix for our recent sharepoint OOMs.

There is a known issue with MarkitDown: microsoft/markitdown#183
that causes excel file sizes to explode during parsing. Our previous code handled this, and at the moment the best way to fix this is to revert to the previous approach. We should check back with the issue in a few months to see if we can return to the old approach, since it would be ideal to have a unified approach for office files as we tried to do before.

How Has This Been Tested?

tested in UI

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Adds detailed memory usage logging across the SharePoint connector and docprocessing pipeline to diagnose OOM issues. Introduces a memory_logger utility to record RSS/VMS and deep object sizes at key points.

  • New Features

    • Added onyx.utils.memory_logger.log_memory_usage to log RSS/VMS and optional deep object size.
    • Instrumented SharePoint connector: HTTP/SDK downloads, text/image extraction, drive/page discovery and pagination, document conversion, and checkpoint processing.
    • Instrumented docprocessing tasks around batch load and indexing pipeline execution.
    • Throttled logs in large loops (e.g., every N items) to limit noise. No functional behavior changes.
  • Dependencies

    • Added pympler for deep object size measurement.

@evan-onyx evan-onyx requested a review from a team as a code owner September 17, 2025 20:31
Copy link

vercel bot commented Sep 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Sep 20, 2025 0:04am

@evan-onyx evan-onyx force-pushed the debug/sharepoint-mem-usage branch from 3c5f80d to d6786f1 Compare September 17, 2025 20:32
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds extensive memory logging infrastructure to debug out-of-memory (OOM) issues specifically affecting the SharePoint connector. The changes introduce a new utility module memory_logger.py that provides memory monitoring capabilities using psutil for process-level memory tracking and pympler for object-level size measurement.

The implementation adds the pympler==1.1 dependency to enable deep object size analysis and creates a log_memory_usage() function that logs both RSS/VMS memory statistics and optionally measures specific object sizes. This utility is then extensively instrumented throughout the SharePoint connector (sharepoint/connector.py) at critical execution points including file downloads, text extraction, document conversion, site page processing, and pagination operations.

Additionally, the document processing pipeline (docprocessing/tasks.py) receives memory logging at key stages like batch loading, indexing pipeline execution, and exception handling. The logging frequency varies by operation type - every 5-10 iterations for loops, and at start/end points for major operations - providing granular visibility into memory consumption patterns.

This debugging infrastructure integrates with the existing codebase by leveraging the current logging framework and adding monitoring points at logical boundaries where memory usage could spike. The implementation follows the fail-loud principle by including proper error handling for both memory measurement operations.

Confidence score: 2/5

  • This PR introduces temporary debugging code that should be removed before production deployment
  • Score reflects the extensive debugging instrumentation that could impact performance and the temporary nature of the changes
  • Pay close attention to the memory_logger.py utility and all SharePoint connector instrumentation points

Context used:

Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)

4 files reviewed, 2 comments

Edit Code Review Bot Settings | Greptile

if specific_object is not None:
try:
# recursively calculate the size of the object
obj_size = asizeof.asizeof(specific_object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: asizeof.asizeof() can be expensive for large nested objects as it recursively traverses all references. Consider adding a size limit or timeout for very large objects to prevent the debugging itself from causing performance issues.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 4 files

Prompt for AI agents (all 6 issues)

Understand the root cause of the following 6 issues and fix them.


<file name="backend/onyx/connectors/sharepoint/connector.py">

<violation number="1" location="backend/onyx/connectors/sharepoint/connector.py:850">
Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.</violation>

<violation number="2" location="backend/onyx/connectors/sharepoint/connector.py:899">
Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.</violation>

<violation number="3" location="backend/onyx/connectors/sharepoint/connector.py:1188">
Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.</violation>

<violation number="4" location="backend/onyx/connectors/sharepoint/connector.py:1252">
Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.</violation>
</file>

<file name="backend/onyx/utils/memory_logger.py">

<violation number="1" location="backend/onyx/utils/memory_logger.py:5">
Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.</violation>
</file>

<file name="backend/onyx/background/celery/tasks/docprocessing/tasks.py">

<violation number="1" location="backend/onyx/background/celery/tasks/docprocessing/tasks.py:1406">
Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

driveitems = query.execute_query()
log_memory_usage(
f"_get_drive_items_for_drive_name:after_query:{drive_name}",
driveitems,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.

Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 850:

<comment>Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.</comment>

<file context>
@@ -757,11 +837,19 @@ def _get_drive_items_for_drive_name(
                 driveitems = query.execute_query()
+                log_memory_usage(
+                    f&quot;_get_drive_items_for_drive_name:after_query:{drive_name}&quot;,
+                    driveitems,
+                    &quot;driveitems_list&quot;,
+                )
</file context>
Suggested change
driveitems,
None,

✅ Addressed in d5990d5

# We should process each page incrementally to avoid unbounded growth.
while "@odata.nextLink" in pages_data:
log_memory_usage(
"_fetch_site_pages:before_pagination", all_pages, "accumulated_pages"
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.

Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 1188:

<comment>Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.</comment>

<file context>
@@ -1012,25 +1178,41 @@ def _fetch_site_pages(
         #       We should process each page incrementally to avoid unbounded growth.
         while &quot;@odata.nextLink&quot; in pages_data:
+            log_memory_usage(
+                &quot;_fetch_site_pages:before_pagination&quot;, all_pages, &quot;accumulated_pages&quot;
+            )
             next_url = pages_data[&quot;@odata.nextLink&quot;]
</file context>
Suggested change
"_fetch_site_pages:before_pagination", all_pages, "accumulated_pages"
"_fetch_site_pages:before_pagination"

✅ Addressed in d5990d5

final_driveitems = list(driveitems)
log_memory_usage(
f"_get_drive_items_for_drive_name:end:{drive_name}",
final_driveitems,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.

Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 899:

<comment>Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.</comment>

<file context>
@@ -805,7 +893,13 @@ def _get_drive_items_for_drive_name(
+                final_driveitems = list(driveitems)
+                log_memory_usage(
+                    f&quot;_get_drive_items_for_drive_name:end:{drive_name}&quot;,
+                    final_driveitems,
+                    &quot;final_driveitems&quot;,
+                )
</file context>

✅ Addressed in d5990d5

from typing import Any

import psutil
from pympler import asizeof
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.

Prompt for AI agents
Address the following comment on backend/onyx/utils/memory_logger.py at line 5:

<comment>Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.</comment>

<file context>
@@ -0,0 +1,49 @@
+from typing import Any
+
+import psutil
+from pympler import asizeof
+
+from onyx.utils.logger import setup_logger
</file context>

✅ Addressed in d5990d5


log_memory_usage(
f"docprocessing_task:after_indexing_pipeline:batch_{batch_num}:attempt_{index_attempt_id}",
index_pipeline_result,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing

Prompt for AI agents
Address the following comment on backend/onyx/background/celery/tasks/docprocessing/tasks.py at line 1406:

<comment>Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing</comment>

<file context>
@@ -1381,6 +1401,12 @@ def _docprocessing_task(
 
+            log_memory_usage(
+                f&quot;docprocessing_task:after_indexing_pipeline:batch_{batch_num}:attempt_{index_attempt_id}&quot;,
+                index_pipeline_result,
+                f&quot;batch_{batch_num}_pipeline_result&quot;,
+            )
</file context>
Suggested change
index_pipeline_result,
None,

✅ Addressed in d5990d5

return token

def _fetch_slim_documents_from_sharepoint(self) -> GenerateSlimDocumentOutput:
log_memory_usage("_fetch_slim_documents_from_sharepoint:start")
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.

Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 1252:

<comment>Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.</comment>

<file context>
@@ -1061,11 +1249,22 @@ def _acquire_token(self) -&gt; dict[str, Any]:
         return token
 
     def _fetch_slim_documents_from_sharepoint(self) -&gt; GenerateSlimDocumentOutput:
+        log_memory_usage(&quot;_fetch_slim_documents_from_sharepoint:start&quot;)
         site_descriptors = self.site_descriptors or self.fetch_sites()
 
</file context>
Fix with Cubic

@evan-onyx evan-onyx force-pushed the debug/sharepoint-mem-usage branch from d5990d5 to 68e8a6e Compare September 19, 2025 21:00
@evan-onyx evan-onyx changed the title sharepoint memory issue debugging fix: sharepoint memory via excel parsing Sep 19, 2025
Copy link
Contributor

@justin-tahara justin-tahara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nits that I would like to have resolved before we merge.

Copy link
Contributor Author

evan-onyx commented Sep 19, 2025

3 Jobs Failed:

Run Integration Tests v2 / integration-tests (tests/connector, tests-connector) failed on "Run Integration Tests for tests-connector"
[...]
------------------------------ Captured log setup ------------------------------
INFO     onyx.utils.logger:reset.py:414 Resetting Postgres...
INFO     onyx.utils.logger:reset.py:260 Downgrading Postgres... (1/10)
INFO     onyx.utils.logger:reset.py:286 Upgrading Postgres...
INFO     onyx.utils.logger:reset.py:289 Setting up Postgres...
NOTICE   onyx.utils.logger:setup.py:289 Verifying default connector/credential exist.
NOTICE   onyx.utils.logger:setup.py:295 Loading input prompts and user folders
INFO     onyx.utils.logger:reset.py:416 Resetting Vespa...
NOTICE   onyx.utils.logger:swap_index.py:85 Vespa index swap (attempt 1/10)...
NOTICE   onyx.utils.logger:index.py:239 Deploying Vespa application package to http://index:19071/application/v2/tenant/default/prepareandactivate
NOTICE   onyx.utils.logger:swap_index.py:96 Vespa index swap complete.
NOTICE   onyx.utils.logger:setup.py:258 Setting up Vespa (attempt 1/10)...
NOTICE   onyx.utils.logger:index.py:239 Deploying Vespa application package to http://index:19071/application/v2/tenant/default/prepareandactivate
NOTICE   onyx.utils.logger:setup.py:274 Vespa setup complete.
INFO     onyx.utils.logger:reset.py:418 Resetting FileStore...
=========================== short test summary info ============================
FAILED tests/integration/tests/connector/test_connector_creation.py::test_connector_pause_while_indexing
=================== 1 failed, 4 passed in 304.05s (0:05:04) ====================
Error: Final attempt failed. Child_process exited with error code 1
Run Integration Tests v2 / required failed on "Run actions/github-script@v7"
[...]
  retry-exempt-status-codes: 400,401,403,404,422
env:
  PRIVATE_REGISTRY: experimental-registry.blacksmith.sh:5000
  PRIVATE_REGISTRY_USERNAME: ***
  PRIVATE_REGISTRY_PASSWORD: ***
  OPENAI_API_KEY: ***
  SLACK_BOT_TOKEN: ***
  CONFLUENCE_TEST_SPACE_URL: ***
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: ***/onyx
Error: One or more upstream jobs failed or were cancelled.

1 job failed running on non-Blacksmith runners.


Summary: 7 successful workflows, 2 failed workflows

Last updated: 2025-09-20 00:43:57 UTC

@evan-onyx evan-onyx merged commit f4d135d into main Sep 20, 2025
51 of 53 checks passed
@evan-onyx evan-onyx deleted the debug/sharepoint-mem-usage branch September 20, 2025 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants