Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Aug 5, 2025

Description

switching to markitdown

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@evan-onyx evan-onyx requested a review from a team as a code owner August 5, 2025 04:30
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR enhances error handling in the DOCX file processing functionality by adding ValueError to the existing exception handling in the docx_to_text_and_images function. The change expands the exception catching from just BadZipFile to include ValueError as well, implementing graceful degradation when processing problematic DOCX files.

The modification occurs in the file processing pipeline where DOCX files are parsed using the python-docx library's Document() constructor. When either a BadZipFile or ValueError exception is encountered during parsing, the system now falls back to treating the file as plain text instead of failing completely. This fallback mechanism uses the existing detect_encoding() and read_text_file() functions to extract whatever content is possible from the problematic file.

This change fits into the broader file processing architecture by maintaining the existing error recovery pattern already established for BadZipFile exceptions. The function remains part of the comprehensive file extraction system that handles various document formats, and the graceful degradation ensures that the overall document processing pipeline continues to function even when encountering malformed or corrupted DOCX files.

PR Description Notes:

  • The PR description is incomplete, containing only placeholder text without actual details about the changes or testing performed

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it only adds defensive error handling
  • Score reflects the simple nature of the change and proven error handling pattern already in use
  • Pay close attention to the extract_file_text.py file to ensure the fallback behavior works as intended

1 file reviewed, no comments

Edit Code Review Bot Settings | Greptile

@evan-onyx evan-onyx closed this Aug 5, 2025
@evan-onyx evan-onyx deleted the fix/tricky-template-parsing branch August 5, 2025 04:31
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

No issues found across 1 file. Review in cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant