Skip to content

Conversation

wenxi-onyx
Copy link
Member

@wenxi-onyx wenxi-onyx commented Jul 2, 2025

Description

[Provide a brief description of the changes in this PR]

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@wenxi-onyx wenxi-onyx requested a review from a team as a code owner July 2, 2025 22:12
Copy link

vercel bot commented Jul 2, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 2, 2025 10:15pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Added a fallback mechanism to docx file processing that attempts to read files as plain text when DOCX parsing fails, improving handling of corrupted or mislabeled files.

  • Implemented fallback text extraction in backend/onyx/file_processing/extract_file_text.py when DOCX parsing fails
  • Change aligns with fail-loudly principle but provides graceful degradation for potentially recoverable content
  • Security consideration: Added handling of untrusted files that could be maliciously mislabeled as DOCX

1 file reviewed, 1 comment
Edit PR Review Bot Settings | Greptile

Comment on lines 312 to +325
try:
doc = docx.Document(file)
except BadZipFile as e:
logger.warning(f"Failed to extract text from {file_name or 'docx file'}: {e}")
return "", []
logger.warning(
f"Failed to extract docx {file_name or 'docx file'}: {e}. Attempting to read as text file."
)

# May be an invalid docx, but still a valid text file
file.seek(0)
encoding = detect_encoding(file)
text_content_raw, _ = read_text_file(
file, encoding=encoding, ignore_onyx_metadata=False
)
return text_content_raw or "", []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Verify file type before fallback. An invalid docx file could be malicious - should validate it's actually a text file using is_text_file() before attempting to read it.

@Weves Weves merged commit d3c5a4f into main Jul 10, 2025
12 of 15 checks passed
@Weves Weves deleted the bugfix/invalid_docx_fallback branch July 10, 2025 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants