-
Notifications
You must be signed in to change notification settings - Fork 2k
add docx fallback #4983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add docx fallback #4983
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Added a fallback mechanism to docx file processing that attempts to read files as plain text when DOCX parsing fails, improving handling of corrupted or mislabeled files.
- Implemented fallback text extraction in
backend/onyx/file_processing/extract_file_text.py
when DOCX parsing fails - Change aligns with fail-loudly principle but provides graceful degradation for potentially recoverable content
- Security consideration: Added handling of untrusted files that could be maliciously mislabeled as DOCX
1 file reviewed, 1 comment
Edit PR Review Bot Settings | Greptile
try: | ||
doc = docx.Document(file) | ||
except BadZipFile as e: | ||
logger.warning(f"Failed to extract text from {file_name or 'docx file'}: {e}") | ||
return "", [] | ||
logger.warning( | ||
f"Failed to extract docx {file_name or 'docx file'}: {e}. Attempting to read as text file." | ||
) | ||
|
||
# May be an invalid docx, but still a valid text file | ||
file.seek(0) | ||
encoding = detect_encoding(file) | ||
text_content_raw, _ = read_text_file( | ||
file, encoding=encoding, ignore_onyx_metadata=False | ||
) | ||
return text_content_raw or "", [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Verify file type before fallback. An invalid docx file could be malicious - should validate it's actually a text file using is_text_file() before attempting to read it.
Description
[Provide a brief description of the changes in this PR]
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.