Skip to content

Conversation

justin-tahara
Copy link
Contributor

@justin-tahara justin-tahara commented Sep 23, 2025

Description

[Provide a brief description of the changes in this PR]
We saw a ton of logs within our docprocessing pods that looked like this:

ERROR:    ERROR    09/22/2025 10:44:33 PM indexing_pipeline.py:517 : onyx.utils.logger [t:9f76e33b] [Index Attempt: 5711] [CC Pair: 43] Error processing image section: Summarization failed. Messages: [SystemMessage(content='\nYou are an assistant for summarizing images for retrieval.\nSummarize the content of the following image and be as precise as possible.\nThe summary will be embedded and used to retrieve the original image.\nTherefore, write a concise summary of the image that is optimized for retrieval.\n', additional_kwargs={}, response_metadata={}), HumanMessage(content=[{'type': 'text', 'text': "The image has the file name 'page_29_image_Im1189.png.png'.\n\nDescribe precisely and concisely what the image shows.\n"}, {'type': 'image_url', 'image_url': {'url': 'data:image/jpeg;base64,iVBORw0KGgoAAAANSUhEUgAABeEAAAABCAIAAADWwvLtAAAAl0lEQVR4nO2SsQ2DQBAE92zsPqiE9EuglW/grwy38VW4DzIyEvQ4ILEIHFn2SNxEF1wwWo2VUiTlnBV8A3ePMd+ptaaUwoGjAUmUsAbBYQdiQtCA9CnGGgSH0DgASRSyxsk1pnnJj+c0LybrriZpbU2b3W+X1rbD8zj049D/wIqQ6MnDCI0PEPoUZg2CBsGBw9/7dPf9eAF/aWiUfwtADAAAAABJRU5ErkJggg=='}}], additional_kwargs={}, response_metadata={})]

The issue with these logs is that the way we format and sent this message to the LLM states that this is a jpeg file but in reality it is a png file which can be confirmed with the name of the file.

This PR aims to ensure that the file type that is passed through is correctly set in the message that is sent over to the LLM to ensure that the LLM can properly summarize and extract information out of the image if there is any context that is there.

How Has This Been Tested?

[Describe the tests you ran to verify your changes]
Ran locally and tested with files that were not jpeg and validated that they worked.

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@justin-tahara justin-tahara requested a review from a team as a code owner September 23, 2025 00:32
Copy link

vercel bot commented Sep 23, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Sep 23, 2025 0:37am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR fixes a critical bug in the image summarization pipeline where all images were being mislabeled with incorrect MIME types when sent to the LLM. Previously, the _encode_image_for_llm_prompt function hardcoded all images as 'image/jpeg' regardless of their actual format (PNG, GIF, WEBP, etc.), causing LLMs to fail at processing non-JPEG images since they couldn't properly decode the mismatched format.

The solution introduces proper MIME type detection using magic number analysis through the get_image_type_from_bytes utility function. The fix includes:

  1. Dynamic MIME type detection: Replacing the hardcoded 'image/jpeg' with actual format detection based on file headers
  2. Custom exception handling: Adding UnsupportedImageFormatError for graceful handling of unsupported formats
  3. Graceful degradation: When unsupported formats are encountered, the system now logs the issue and skips summarization instead of crashing

This change integrates seamlessly with the existing image summarization pipeline in summarize_image_with_error_handling, where the new error handling allows the system to continue processing other images even when encountering unsupported formats. The fix addresses the specific error logs mentioned in the PR description where PNG files labeled as JPEG were causing LLM processing failures.

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it fixes a clear bug with proper error handling
  • Score reflects a straightforward bug fix with appropriate error handling, though lacks comprehensive test coverage
  • Pay close attention to the image processing pipeline and ensure the new MIME type detection works across all supported formats

1 file reviewed, 1 comment

Edit Code Review Bot Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (all 1 issues)

Understand the root cause of the following 1 issues and fix them.


<file name="backend/onyx/file_processing/image_summarization.py">

<violation number="1" location="backend/onyx/file_processing/image_summarization.py:84">
Catching only UnsupportedImageFormatError lets other summarization failures bubble up, contradicting the wrapper’s contract and producing noisy error logs. Include ValueError so typical LLM failures return None instead of raising.

(Based on your team&#39;s feedback about avoiding logging raw exceptions that may contain sensitive URLs/tokens; broadening the catch prevents leaking exception strings into logs.) [FEEDBACK_USED]</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

return summarize_image_pipeline(llm, image_data, user_prompt, system_prompt)
try:
return summarize_image_pipeline(llm, image_data, user_prompt, system_prompt)
except UnsupportedImageFormatError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching only UnsupportedImageFormatError lets other summarization failures bubble up, contradicting the wrapper’s contract and producing noisy error logs. Include ValueError so typical LLM failures return None instead of raising. (Based on your team's feedback about avoiding logging raw exceptions that may contain sensitive URLs/tokens; broadening the catch prevents leaking exception strings into logs.)
Prompt for AI agents ~~~ Address the following comment on backend/onyx/file_processing/image_summarization.py at line 84: Catching only UnsupportedImageFormatError lets other summarization failures bubble up, contradicting the wrapper’s contract and producing noisy error logs. Include ValueError so typical LLM failures return None instead of raising. (Based on your team's feedback about avoiding logging raw exceptions that may contain sensitive URLs/tokens; broadening the catch prevents leaking exception strings into logs.) @@ -74,7 +79,14 @@ def summarize_image_with_error_handling( - return summarize_image_pipeline(llm, image_data, user_prompt, system_prompt) + try: + return summarize_image_pipeline(llm, image_data, user_prompt, system_prompt) + except UnsupportedImageFormatError: + logger.info( + "Skipping image summarization due to unsupported MIME type for %s", ~~~
```suggestion except (UnsupportedImageFormatError, ValueError): ```

@evan-onyx evan-onyx enabled auto-merge September 23, 2025 01:12
@justin-tahara justin-tahara merged commit a049835 into main Sep 23, 2025
56 of 58 checks passed
@justin-tahara justin-tahara deleted the jtahara/fix-image-summarization-mime-types branch September 23, 2025 01:48
brijsiyag-meesho pushed a commit to brijsiyag-meesho/onyx that referenced this pull request Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants