don't fail on fake files #4735

evan-onyx · 2025-05-19T21:30:30Z

Description

https://linear.app/danswer/issue/DAN-1991/drive-pptx-and-xlsx-invalid-zipfile
There are cases where .xlsx and .pptx files are invalid because they are actually recovery files. We don't need the connector to fail in these cases. So, we add limited error catching that lets us skip these files.

How Has This Been Tested?

n/a, should be strictly an improvement as it's just error handling

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

vercel · 2025-05-19T21:30:34Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 19, 2025 9:40pm

greptile-apps

PR Summary

Added error handling for invalid Excel (.xlsx) and PowerPoint (.pptx) files, particularly recovery files, to prevent connector failures while maintaining proper error logging.

Added BadZipFile exception handling in /backend/onyx/file_processing/extract_file_text.py for both pptx_to_text and xlsx_to_text functions
Implemented special debug-level logging for Excel recovery files (starting with '~') in /backend/onyx/file_processing/extract_file_text.py
Added file_name parameter propagation through extract_text_and_images for more informative error messages
Modified /backend/onyx/connectors/google_drive/doc_conversion.py to only return text sections when extraction succeeds

_{2 file(s) reviewed, 1 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-05-19T21:31:09Z

backend/onyx/file_processing/extract_file_text.py

+        if file_name.startswith("~"):
+            logger.debug(error_str + " (this is expected for files with ~)")
+        else:
+            logger.warning(error_str)


style: Consider using a constant for the '~' prefix to make the code more maintainable and document its significance (recovery file indicator)

* don't fail on fake files * solve at the source * oops * oops2

evan-onyx added 2 commits May 19, 2025 14:21

don't fail on fake files

f4a8a0f

solve at the source

33b8140

evan-onyx marked this pull request as ready for review May 19, 2025 21:30

evan-onyx requested a review from a team as a code owner May 19, 2025 21:30

greptile-apps bot reviewed May 19, 2025

View reviewed changes

evan-onyx added 2 commits May 19, 2025 14:32

oops

bab7f3f

oops2

62937a6

vercel bot deployed to Preview May 19, 2025 21:40 View deployment

Weves approved these changes May 19, 2025

View reviewed changes

Weves added this pull request to the merge queue May 19, 2025

Merged via the queue into main with commit b60884d May 20, 2025
11 of 12 checks passed

Weves deleted the bugfix/drive-excel-recovery branch May 20, 2025 00:09

ferdinandl007 pushed a commit to ferdinandl007/onyx that referenced this pull request May 27, 2025

don't fail on fake files (onyx-dot-app#4735)

dbe51e2

* don't fail on fake files * solve at the source * oops * oops2

aronszanto pushed a commit to aronszanto/onyx that referenced this pull request May 27, 2025

don't fail on fake files (onyx-dot-app#4735)

fd87d44

* don't fail on fake files * solve at the source * oops * oops2

ZhipengHe pushed a commit to ZhipengHe/onyx that referenced this pull request Jun 6, 2025

don't fail on fake files (onyx-dot-app#4735)

6b0f802

* don't fail on fake files * solve at the source * oops * oops2

AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025

don't fail on fake files (onyx-dot-app#4735)

eecc2da

* don't fail on fake files * solve at the source * oops * oops2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

don't fail on fake files #4735

don't fail on fake files #4735

Uh oh!

evan-onyx commented May 19, 2025 •

edited

Loading

Uh oh!

vercel bot commented May 19, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot May 19, 2025

Uh oh!

Uh oh!

Uh oh!

don't fail on fake files #4735

don't fail on fake files #4735

Uh oh!

Conversation

evan-onyx commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Uh oh!

vercel bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

greptile-apps bot May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

evan-onyx commented May 19, 2025 •

edited

Loading

vercel bot commented May 19, 2025 •

edited

Loading