fix: sharepoint memory via excel parsing #5444

evan-onyx · 2025-09-17T20:31:41Z

Description

Fixes https://linear.app/danswer/issue/DAN-2574/excrel-file-size-blowup
This branch was originally used to add memory logging to sharepoint connector to find OOMs. I left the utility function in (commented out), but for convenience this PR now includes the fix for our recent sharepoint OOMs.

There is a known issue with MarkitDown: microsoft/markitdown#183
that causes excel file sizes to explode during parsing. Our previous code handled this, and at the moment the best way to fix this is to revert to the previous approach. We should check back with the issue in a few months to see if we can return to the old approach, since it would be ideal to have a unified approach for office files as we tried to do before.

How Has This Been Tested?

tested in UI

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

Summary by cubic

Adds detailed memory usage logging across the SharePoint connector and docprocessing pipeline to diagnose OOM issues. Introduces a memory_logger utility to record RSS/VMS and deep object sizes at key points.

New Features
- Added onyx.utils.memory_logger.log_memory_usage to log RSS/VMS and optional deep object size.
- Instrumented SharePoint connector: HTTP/SDK downloads, text/image extraction, drive/page discovery and pagination, document conversion, and checkpoint processing.
- Instrumented docprocessing tasks around batch load and indexing pipeline execution.
- Throttled logs in large loops (e.g., every N items) to limit noise. No functional behavior changes.
Dependencies
- Added pympler for deep object size measurement.

vercel · 2025-09-17T20:31:47Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
internal-search	Ready	Preview	Comment	Sep 20, 2025 0:04am

greptile-apps

Greptile Summary

This PR adds extensive memory logging infrastructure to debug out-of-memory (OOM) issues specifically affecting the SharePoint connector. The changes introduce a new utility module memory_logger.py that provides memory monitoring capabilities using psutil for process-level memory tracking and pympler for object-level size measurement.

The implementation adds the pympler==1.1 dependency to enable deep object size analysis and creates a log_memory_usage() function that logs both RSS/VMS memory statistics and optionally measures specific object sizes. This utility is then extensively instrumented throughout the SharePoint connector (sharepoint/connector.py) at critical execution points including file downloads, text extraction, document conversion, site page processing, and pagination operations.

Additionally, the document processing pipeline (docprocessing/tasks.py) receives memory logging at key stages like batch loading, indexing pipeline execution, and exception handling. The logging frequency varies by operation type - every 5-10 iterations for loops, and at start/end points for major operations - providing granular visibility into memory consumption patterns.

This debugging infrastructure integrates with the existing codebase by leveraging the current logging framework and adding monitoring points at logical boundaries where memory usage could spike. The implementation follows the fail-loud principle by including proper error handling for both memory measurement operations.

Confidence score: 2/5

This PR introduces temporary debugging code that should be removed before production deployment
Score reflects the extensive debugging instrumentation that could impact performance and the temporary nature of the changes
Pay close attention to the memory_logger.py utility and all SharePoint connector instrumentation points

Context used:

Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)

_{4 files reviewed, 2 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-09-17T20:33:04Z

backend/onyx/utils/memory_logger.py

+        if specific_object is not None:
+            try:
+                # recursively calculate the size of the object
+                obj_size = asizeof.asizeof(specific_object)


style: asizeof.asizeof() can be expensive for large nested objects as it recursively traverses all references. Consider adding a size limit or timeout for very large objects to prevent the debugging itself from causing performance issues.

cubic-dev-ai

6 issues found across 4 files

Prompt for AI agents (all 6 issues)


Understand the root cause of the following 6 issues and fix them.


<file name="backend/onyx/connectors/sharepoint/connector.py">

<violation number="1" location="backend/onyx/connectors/sharepoint/connector.py:850">
Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.</violation>

<violation number="2" location="backend/onyx/connectors/sharepoint/connector.py:899">
Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.</violation>

<violation number="3" location="backend/onyx/connectors/sharepoint/connector.py:1188">
Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.</violation>

<violation number="4" location="backend/onyx/connectors/sharepoint/connector.py:1252">
Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.</violation>
</file>

<file name="backend/onyx/utils/memory_logger.py">

<violation number="1" location="backend/onyx/utils/memory_logger.py:5">
Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.</violation>
</file>

<file name="backend/onyx/background/celery/tasks/docprocessing/tasks.py">

<violation number="1" location="backend/onyx/background/celery/tasks/docprocessing/tasks.py:1406">
Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing</violation>
</file>

_{React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

cubic-dev-ai · 2025-09-17T20:48:51Z

backend/onyx/connectors/sharepoint/connector.py

                driveitems = query.execute_query()
+                log_memory_usage(
+                    f"_get_drive_items_for_drive_name:after_query:{drive_name}",
+                    driveitems,


Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.

Prompt for AI agents

Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 850: <comment>Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.</comment> <file context> @@ -757,11 +837,19 @@ def _get_drive_items_for_drive_name( driveitems = query.execute_query() + log_memory_usage( + f"_get_drive_items_for_drive_name:after_query:{drive_name}", + driveitems, + "driveitems_list", + ) </file context>

Suggested change

driveitems,

None,

✅ Addressed in d5990d5

cubic-dev-ai · 2025-09-17T20:48:51Z

backend/onyx/connectors/sharepoint/connector.py

        #       We should process each page incrementally to avoid unbounded growth.
        while "@odata.nextLink" in pages_data:
+            log_memory_usage(
+                "_fetch_site_pages:before_pagination", all_pages, "accumulated_pages"


Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.

Prompt for AI agents

Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 1188: <comment>Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.</comment> <file context> @@ -1012,25 +1178,41 @@ def _fetch_site_pages( # We should process each page incrementally to avoid unbounded growth. while "@odata.nextLink" in pages_data: + log_memory_usage( + "_fetch_site_pages:before_pagination", all_pages, "accumulated_pages" + ) next_url = pages_data["@odata.nextLink"] </file context>

Suggested change

"_fetch_site_pages:before_pagination", all_pages, "accumulated_pages"

"_fetch_site_pages:before_pagination"

✅ Addressed in d5990d5

cubic-dev-ai · 2025-09-17T20:48:51Z

backend/onyx/connectors/sharepoint/connector.py

+                final_driveitems = list(driveitems)
+                log_memory_usage(
+                    f"_get_drive_items_for_drive_name:end:{drive_name}",
+                    final_driveitems,


Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.

Prompt for AI agents

Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 899: <comment>Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.</comment> <file context> @@ -805,7 +893,13 @@ def _get_drive_items_for_drive_name( + final_driveitems = list(driveitems) + log_memory_usage( + f"_get_drive_items_for_drive_name:end:{drive_name}", + final_driveitems, + "final_driveitems", + ) </file context>

✅ Addressed in d5990d5

cubic-dev-ai · 2025-09-17T20:48:51Z

backend/onyx/utils/memory_logger.py

+from typing import Any
+
+import psutil
+from pympler import asizeof


Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.

Prompt for AI agents

Address the following comment on backend/onyx/utils/memory_logger.py at line 5: <comment>Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.</comment> <file context> @@ -0,0 +1,49 @@ +from typing import Any + +import psutil +from pympler import asizeof + +from onyx.utils.logger import setup_logger </file context>

✅ Addressed in d5990d5

cubic-dev-ai · 2025-09-17T20:48:52Z

backend/onyx/background/celery/tasks/docprocessing/tasks.py


+            log_memory_usage(
+                f"docprocessing_task:after_indexing_pipeline:batch_{batch_num}:attempt_{index_attempt_id}",
+                index_pipeline_result,


Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing

Prompt for AI agents

Address the following comment on backend/onyx/background/celery/tasks/docprocessing/tasks.py at line 1406: <comment>Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing</comment> <file context> @@ -1381,6 +1401,12 @@ def _docprocessing_task( + log_memory_usage( + f"docprocessing_task:after_indexing_pipeline:batch_{batch_num}:attempt_{index_attempt_id}", + index_pipeline_result, + f"batch_{batch_num}_pipeline_result", + ) </file context>

Suggested change

index_pipeline_result,

None,

✅ Addressed in d5990d5

cubic-dev-ai · 2025-09-17T20:48:52Z

backend/onyx/connectors/sharepoint/connector.py

        return token

    def _fetch_slim_documents_from_sharepoint(self) -> GenerateSlimDocumentOutput:
+        log_memory_usage("_fetch_slim_documents_from_sharepoint:start")


Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.

Prompt for AI agents

Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 1252: <comment>Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.</comment> <file context> @@ -1061,11 +1249,22 @@ def _acquire_token(self) -> dict[str, Any]: return token def _fetch_slim_documents_from_sharepoint(self) -> GenerateSlimDocumentOutput: + log_memory_usage("_fetch_slim_documents_from_sharepoint:start") site_descriptors = self.site_descriptors or self.fetch_sites() </file context>

justin-tahara

Some small nits that I would like to have resolved before we merge.

backend/onyx/utils/memory_logger.py

backend/onyx/file_processing/extract_file_text.py

evan-onyx · 2025-09-19T22:47:17Z

3 Jobs Failed:

Run Integration Tests v2 / integration-tests (tests/connector, tests-connector) failed on "Run Integration Tests for tests-connector"

[...]
------------------------------ Captured log setup ------------------------------
INFO     onyx.utils.logger:reset.py:414 Resetting Postgres...
INFO     onyx.utils.logger:reset.py:260 Downgrading Postgres... (1/10)
INFO     onyx.utils.logger:reset.py:286 Upgrading Postgres...
INFO     onyx.utils.logger:reset.py:289 Setting up Postgres...
NOTICE   onyx.utils.logger:setup.py:289 Verifying default connector/credential exist.
NOTICE   onyx.utils.logger:setup.py:295 Loading input prompts and user folders
INFO     onyx.utils.logger:reset.py:416 Resetting Vespa...
NOTICE   onyx.utils.logger:swap_index.py:85 Vespa index swap (attempt 1/10)...
NOTICE   onyx.utils.logger:index.py:239 Deploying Vespa application package to http://index:19071/application/v2/tenant/default/prepareandactivate
NOTICE   onyx.utils.logger:swap_index.py:96 Vespa index swap complete.
NOTICE   onyx.utils.logger:setup.py:258 Setting up Vespa (attempt 1/10)...
NOTICE   onyx.utils.logger:index.py:239 Deploying Vespa application package to http://index:19071/application/v2/tenant/default/prepareandactivate
NOTICE   onyx.utils.logger:setup.py:274 Vespa setup complete.
INFO     onyx.utils.logger:reset.py:418 Resetting FileStore...
=========================== short test summary info ============================
FAILED tests/integration/tests/connector/test_connector_creation.py::test_connector_pause_while_indexing
=================== 1 failed, 4 passed in 304.05s (0:05:04) ====================
Error: Final attempt failed. Child_process exited with error code 1

Run Integration Tests v2 / required failed on "Run actions/github-script@v7"

[...]
  retry-exempt-status-codes: 400,401,403,404,422
env:
  PRIVATE_REGISTRY: experimental-registry.blacksmith.sh:5000
  PRIVATE_REGISTRY_USERNAME: ***
  PRIVATE_REGISTRY_PASSWORD: ***
  OPENAI_API_KEY: ***
  SLACK_BOT_TOKEN: ***
  CONFLUENCE_TEST_SPACE_URL: ***
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: ***/onyx
Error: One or more upstream jobs failed or were cancelled.

1 job failed running on non-Blacksmith runners.

Summary: 7 successful workflows, 2 failed workflows

✅ Backport on Merge (1 job succeeded)
✅ External Dependency Unit Tests (5 jobs succeeded)
✅ Helm - Lint and Test Charts (1 job succeeded)
❌ Connector Tests (1 job failed)
✅ Quality Checks PR (1 job succeeded)
✅ Python Unit Tests (1 job succeeded)
✅ Ensure PR references Linear (1 job succeeded)
✅ Python Checks (1 job succeeded)
❌ Run Integration Tests v2 (34 jobs succeeded, 2 jobs failed)

Last updated: 2025-09-20 00:43:57 UTC

evan-onyx requested a review from a team as a code owner September 17, 2025 20:31

evan-onyx force-pushed the debug/sharepoint-mem-usage branch from 3c5f80d to d6786f1 Compare September 17, 2025 20:32

greptile-apps bot reviewed Sep 17, 2025

View reviewed changes

vercel bot deployed to Preview September 17, 2025 20:35 View deployment

cubic-dev-ai bot reviewed Sep 17, 2025

View reviewed changes

vercel bot deployed to Preview September 18, 2025 22:03 View deployment

vercel bot deployed to Preview September 18, 2025 22:08 View deployment

evan-onyx force-pushed the debug/sharepoint-mem-usage branch from d5990d5 to 68e8a6e Compare September 19, 2025 21:00

vercel bot deployed to Preview September 19, 2025 21:04 View deployment

evan-onyx changed the title ~~sharepoint memory issue debugging~~ fix: sharepoint memory via excel parsing Sep 19, 2025

evan-onyx added 6 commits September 19, 2025 14:25

sharepoint memory issue debugging

c760e5c

even more specific logs

06abbd1

file names and extensions etc

7e31420

more meta

831b3d1

fixed spreadsheet mem issue

14124bf

typing

beb5aac

evan-onyx force-pushed the debug/sharepoint-mem-usage branch from 68e8a6e to beb5aac Compare September 19, 2025 21:27

vercel bot deployed to Preview September 19, 2025 21:29 View deployment

justin-tahara approved these changes Sep 19, 2025

View reviewed changes

backend/onyx/utils/memory_logger.py Show resolved Hide resolved

backend/onyx/file_processing/extract_file_text.py Show resolved Hide resolved

JT comments

2626b8b

vercel bot deployed to Preview September 20, 2025 00:04 View deployment

evan-onyx merged commit f4d135d into main Sep 20, 2025
51 of 53 checks passed

evan-onyx deleted the debug/sharepoint-mem-usage branch September 20, 2025 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: sharepoint memory via excel parsing #5444

fix: sharepoint memory via excel parsing #5444

evan-onyx commented Sep 17, 2025 •

edited

Loading

Uh oh!

vercel bot commented Sep 17, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Sep 17, 2025

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

Uh oh!

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

Uh oh!

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

Uh oh!

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

Uh oh!

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

Uh oh!

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

Uh oh!

justin-tahara left a comment

Uh oh!

Uh oh!

Uh oh!

evan-onyx commented Sep 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

	"_fetch_site_pages:before_pagination", all_pages, "accumulated_pages"
	"_fetch_site_pages:before_pagination"

fix: sharepoint memory via excel parsing #5444

fix: sharepoint memory via excel parsing #5444

Conversation

evan-onyx commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Summary by cubic

Uh oh!

vercel bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Context used:

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justin-tahara left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

evan-onyx commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evan-onyx commented Sep 17, 2025 •

edited

Loading

vercel bot commented Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

evan-onyx commented Sep 19, 2025 •

edited

Loading