Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented May 19, 2025

Description

https://linear.app/danswer/issue/DAN-1991/drive-pptx-and-xlsx-invalid-zipfile
There are cases where .xlsx and .pptx files are invalid because they are actually recovery files. We don't need the connector to fail in these cases. So, we add limited error catching that lets us skip these files.

How Has This Been Tested?

n/a, should be strictly an improvement as it's just error handling

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Copy link

vercel bot commented May 19, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 19, 2025 9:40pm

@evan-onyx evan-onyx marked this pull request as ready for review May 19, 2025 21:30
@evan-onyx evan-onyx requested a review from a team as a code owner May 19, 2025 21:30
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Added error handling for invalid Excel (.xlsx) and PowerPoint (.pptx) files, particularly recovery files, to prevent connector failures while maintaining proper error logging.

  • Added BadZipFile exception handling in /backend/onyx/file_processing/extract_file_text.py for both pptx_to_text and xlsx_to_text functions
  • Implemented special debug-level logging for Excel recovery files (starting with '~') in /backend/onyx/file_processing/extract_file_text.py
  • Added file_name parameter propagation through extract_text_and_images for more informative error messages
  • Modified /backend/onyx/connectors/google_drive/doc_conversion.py to only return text sections when extraction succeeds

2 file(s) reviewed, 1 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +358 to +361
if file_name.startswith("~"):
logger.debug(error_str + " (this is expected for files with ~)")
else:
logger.warning(error_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider using a constant for the '~' prefix to make the code more maintainable and document its significance (recovery file indicator)

@Weves Weves added this pull request to the merge queue May 19, 2025
Merged via the queue into main with commit b60884d May 20, 2025
11 of 12 checks passed
@Weves Weves deleted the bugfix/drive-excel-recovery branch May 20, 2025 00:09
ferdinandl007 pushed a commit to ferdinandl007/onyx that referenced this pull request May 27, 2025
* don't fail on fake files

* solve at the source

* oops

* oops2
aronszanto pushed a commit to aronszanto/onyx that referenced this pull request May 27, 2025
* don't fail on fake files

* solve at the source

* oops

* oops2
ZhipengHe pushed a commit to ZhipengHe/onyx that referenced this pull request Jun 6, 2025
* don't fail on fake files

* solve at the source

* oops

* oops2
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* don't fail on fake files

* solve at the source

* oops

* oops2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants