-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: sharepoint memory via excel parsing #5444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for GitHub.
|
3c5f80d
to
d6786f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds extensive memory logging infrastructure to debug out-of-memory (OOM) issues specifically affecting the SharePoint connector. The changes introduce a new utility module memory_logger.py
that provides memory monitoring capabilities using psutil
for process-level memory tracking and pympler
for object-level size measurement.
The implementation adds the pympler==1.1
dependency to enable deep object size analysis and creates a log_memory_usage()
function that logs both RSS/VMS memory statistics and optionally measures specific object sizes. This utility is then extensively instrumented throughout the SharePoint connector (sharepoint/connector.py
) at critical execution points including file downloads, text extraction, document conversion, site page processing, and pagination operations.
Additionally, the document processing pipeline (docprocessing/tasks.py
) receives memory logging at key stages like batch loading, indexing pipeline execution, and exception handling. The logging frequency varies by operation type - every 5-10 iterations for loops, and at start/end points for major operations - providing granular visibility into memory consumption patterns.
This debugging infrastructure integrates with the existing codebase by leveraging the current logging framework and adding monitoring points at logical boundaries where memory usage could spike. The implementation follows the fail-loud principle by including proper error handling for both memory measurement operations.
Confidence score: 2/5
- This PR introduces temporary debugging code that should be removed before production deployment
- Score reflects the extensive debugging instrumentation that could impact performance and the temporary nature of the changes
- Pay close attention to the memory_logger.py utility and all SharePoint connector instrumentation points
Context used:
Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)
4 files reviewed, 2 comments
backend/onyx/utils/memory_logger.py
Outdated
if specific_object is not None: | ||
try: | ||
# recursively calculate the size of the object | ||
obj_size = asizeof.asizeof(specific_object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: asizeof.asizeof() can be expensive for large nested objects as it recursively traverses all references. Consider adding a size limit or timeout for very large objects to prevent the debugging itself from causing performance issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 issues found across 4 files
Prompt for AI agents (all 6 issues)
Understand the root cause of the following 6 issues and fix them.
<file name="backend/onyx/connectors/sharepoint/connector.py">
<violation number="1" location="backend/onyx/connectors/sharepoint/connector.py:850">
Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.</violation>
<violation number="2" location="backend/onyx/connectors/sharepoint/connector.py:899">
Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.</violation>
<violation number="3" location="backend/onyx/connectors/sharepoint/connector.py:1188">
Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.</violation>
<violation number="4" location="backend/onyx/connectors/sharepoint/connector.py:1252">
Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.</violation>
</file>
<file name="backend/onyx/utils/memory_logger.py">
<violation number="1" location="backend/onyx/utils/memory_logger.py:5">
Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.</violation>
</file>
<file name="backend/onyx/background/celery/tasks/docprocessing/tasks.py">
<violation number="1" location="backend/onyx/background/celery/tasks/docprocessing/tasks.py:1406">
Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing</violation>
</file>
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
driveitems = query.execute_query() | ||
log_memory_usage( | ||
f"_get_drive_items_for_drive_name:after_query:{drive_name}", | ||
driveitems, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.
Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 850:
<comment>Avoid deep-size logging of driveitems during checkpoint processing; it adds significant overhead.</comment>
<file context>
@@ -757,11 +837,19 @@ def _get_drive_items_for_drive_name(
driveitems = query.execute_query()
+ log_memory_usage(
+ f"_get_drive_items_for_drive_name:after_query:{drive_name}",
+ driveitems,
+ "driveitems_list",
+ )
</file context>
driveitems, | |
None, |
✅ Addressed in d5990d5
# We should process each page incrementally to avoid unbounded growth. | ||
while "@odata.nextLink" in pages_data: | ||
log_memory_usage( | ||
"_fetch_site_pages:before_pagination", all_pages, "accumulated_pages" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.
Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 1188:
<comment>Deep-size logging of accumulated pages during pagination is expensive; avoid passing all_pages to log_memory_usage.</comment>
<file context>
@@ -1012,25 +1178,41 @@ def _fetch_site_pages(
# We should process each page incrementally to avoid unbounded growth.
while "@odata.nextLink" in pages_data:
+ log_memory_usage(
+ "_fetch_site_pages:before_pagination", all_pages, "accumulated_pages"
+ )
next_url = pages_data["@odata.nextLink"]
</file context>
"_fetch_site_pages:before_pagination", all_pages, "accumulated_pages" | |
"_fetch_site_pages:before_pagination" |
✅ Addressed in d5990d5
final_driveitems = list(driveitems) | ||
log_memory_usage( | ||
f"_get_drive_items_for_drive_name:end:{drive_name}", | ||
final_driveitems, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.
Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 899:
<comment>Repeated deep-size logging of the growing accumulator list inside the loop is costly; log only counts or omit the object.</comment>
<file context>
@@ -805,7 +893,13 @@ def _get_drive_items_for_drive_name(
+ final_driveitems = list(driveitems)
+ log_memory_usage(
+ f"_get_drive_items_for_drive_name:end:{drive_name}",
+ final_driveitems,
+ "final_driveitems",
+ )
</file context>
✅ Addressed in d5990d5
backend/onyx/utils/memory_logger.py
Outdated
from typing import Any | ||
|
||
import psutil | ||
from pympler import asizeof |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.
Prompt for AI agents
Address the following comment on backend/onyx/utils/memory_logger.py at line 5:
<comment>Avoid importing pympler at module import time; lazily import inside the conditional to prevent hard dependency and reduce overhead.</comment>
<file context>
@@ -0,0 +1,49 @@
+from typing import Any
+
+import psutil
+from pympler import asizeof
+
+from onyx.utils.logger import setup_logger
</file context>
✅ Addressed in d5990d5
|
||
log_memory_usage( | ||
f"docprocessing_task:after_indexing_pipeline:batch_{batch_num}:attempt_{index_attempt_id}", | ||
index_pipeline_result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing
Prompt for AI agents
Address the following comment on backend/onyx/background/celery/tasks/docprocessing/tasks.py at line 1406:
<comment>Recursive size logging on index_pipeline_result traverses Exception tracebacks and can massively degrade or hang processing</comment>
<file context>
@@ -1381,6 +1401,12 @@ def _docprocessing_task(
+ log_memory_usage(
+ f"docprocessing_task:after_indexing_pipeline:batch_{batch_num}:attempt_{index_attempt_id}",
+ index_pipeline_result,
+ f"batch_{batch_num}_pipeline_result",
+ )
</file context>
index_pipeline_result, | |
None, |
✅ Addressed in d5990d5
return token | ||
|
||
def _fetch_slim_documents_from_sharepoint(self) -> GenerateSlimDocumentOutput: | ||
log_memory_usage("_fetch_slim_documents_from_sharepoint:start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.
Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 1252:
<comment>Client-secret auth never sets sp_tenant_domain, causing RuntimeError when retrieving slim documents.</comment>
<file context>
@@ -1061,11 +1249,22 @@ def _acquire_token(self) -> dict[str, Any]:
return token
def _fetch_slim_documents_from_sharepoint(self) -> GenerateSlimDocumentOutput:
+ log_memory_usage("_fetch_slim_documents_from_sharepoint:start")
site_descriptors = self.site_descriptors or self.fetch_sites()
</file context>
d5990d5
to
68e8a6e
Compare
68e8a6e
to
beb5aac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small nits that I would like to have resolved before we merge.
3 Jobs Failed: Run Integration Tests v2 / integration-tests (tests/connector, tests-connector) failed on "Run Integration Tests for tests-connector"
Run Integration Tests v2 / required failed on "Run actions/github-script@v7"
1 job failed running on non-Blacksmith runners. Summary: 7 successful workflows, 2 failed workflows
Last updated: 2025-09-20 00:43:57 UTC |
Description
Fixes https://linear.app/danswer/issue/DAN-2574/excrel-file-size-blowup
This branch was originally used to add memory logging to sharepoint connector to find OOMs. I left the utility function in (commented out), but for convenience this PR now includes the fix for our recent sharepoint OOMs.
There is a known issue with MarkitDown: microsoft/markitdown#183
that causes excel file sizes to explode during parsing. Our previous code handled this, and at the moment the best way to fix this is to revert to the previous approach. We should check back with the issue in a few months to see if we can return to the old approach, since it would be ideal to have a unified approach for office files as we tried to do before.
How Has This Been Tested?
tested in UI
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Adds detailed memory usage logging across the SharePoint connector and docprocessing pipeline to diagnose OOM issues. Introduces a memory_logger utility to record RSS/VMS and deep object sizes at key points.
New Features
Dependencies