-
Notifications
You must be signed in to change notification settings - Fork 2k
don't fail on fake files #4735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't fail on fake files #4735
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Added error handling for invalid Excel (.xlsx) and PowerPoint (.pptx) files, particularly recovery files, to prevent connector failures while maintaining proper error logging.
- Added
BadZipFile
exception handling in/backend/onyx/file_processing/extract_file_text.py
for bothpptx_to_text
andxlsx_to_text
functions - Implemented special debug-level logging for Excel recovery files (starting with '~') in
/backend/onyx/file_processing/extract_file_text.py
- Added
file_name
parameter propagation throughextract_text_and_images
for more informative error messages - Modified
/backend/onyx/connectors/google_drive/doc_conversion.py
to only return text sections when extraction succeeds
2 file(s) reviewed, 1 comment(s)
Edit PR Review Bot Settings | Greptile
if file_name.startswith("~"): | ||
logger.debug(error_str + " (this is expected for files with ~)") | ||
else: | ||
logger.warning(error_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider using a constant for the '~' prefix to make the code more maintainable and document its significance (recovery file indicator)
* don't fail on fake files * solve at the source * oops * oops2
* don't fail on fake files * solve at the source * oops * oops2
* don't fail on fake files * solve at the source * oops * oops2
* don't fail on fake files * solve at the source * oops * oops2
Description
https://linear.app/danswer/issue/DAN-1991/drive-pptx-and-xlsx-invalid-zipfile
There are cases where .xlsx and .pptx files are invalid because they are actually recovery files. We don't need the connector to fail in these cases. So, we add limited error catching that lets us skip these files.
How Has This Been Tested?
n/a, should be strictly an improvement as it's just error handling
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.