add docx fallback #4983

wenxi-onyx · 2025-07-02T22:12:26Z

Description

[Provide a brief description of the changes in this PR]

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

vercel · 2025-07-02T22:12:32Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 2, 2025 10:15pm

greptile-apps

PR Summary

Added a fallback mechanism to docx file processing that attempts to read files as plain text when DOCX parsing fails, improving handling of corrupted or mislabeled files.

Implemented fallback text extraction in backend/onyx/file_processing/extract_file_text.py when DOCX parsing fails
Change aligns with fail-loudly principle but provides graceful degradation for potentially recoverable content
Security consideration: Added handling of untrusted files that could be maliciously mislabeled as DOCX

_{1 file reviewed, 1 comment}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-07-02T22:13:03Z

backend/onyx/file_processing/extract_file_text.py

    try:
        doc = docx.Document(file)
    except BadZipFile as e:
-        logger.warning(f"Failed to extract text from {file_name or 'docx file'}: {e}")
-        return "", []
+        logger.warning(
+            f"Failed to extract docx {file_name or 'docx file'}: {e}. Attempting to read as text file."
+        )
+
+        # May be an invalid docx, but still a valid text file
+        file.seek(0)
+        encoding = detect_encoding(file)
+        text_content_raw, _ = read_text_file(
+            file, encoding=encoding, ignore_onyx_metadata=False
+        )
+        return text_content_raw or "", []


logic: Verify file type before fallback. An invalid docx file could be malicious - should validate it's actually a text file using is_text_file() before attempting to read it.

add docx fallback

b55ebd4

wenxi-onyx requested a review from a team as a code owner July 2, 2025 22:12

greptile-apps bot reviewed Jul 2, 2025

View reviewed changes

vercel bot deployed to Preview July 2, 2025 22:15 View deployment

Weves approved these changes Jul 10, 2025

View reviewed changes

Weves merged commit d3c5a4f into main Jul 10, 2025
12 of 15 checks passed

Weves deleted the bugfix/invalid_docx_fallback branch July 10, 2025 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add docx fallback #4983

add docx fallback #4983

Uh oh!

wenxi-onyx commented Jul 2, 2025 •

edited

Loading

Uh oh!

vercel bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

add docx fallback #4983

add docx fallback #4983

Uh oh!

Conversation

wenxi-onyx commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Uh oh!

vercel bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

greptile-apps bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wenxi-onyx commented Jul 2, 2025 •

edited

Loading

vercel bot commented Jul 2, 2025 •

edited

Loading