Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Jul 30, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-2301/unify-file-processing

unify file processing under the creation of user files; no more separate unnecessary processing/storing

How Has This Been Tested?

in UI

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Unified file processing so all uploaded files are handled through user file creation, removing separate processing and storage paths.

  • Refactors
    • Removed special-case file handling in chat upload and connector endpoints.
    • All files, including extracted text for docs, are now processed and stored via user file creation.

@evan-onyx evan-onyx requested a review from a team as a code owner July 30, 2025 21:37
Copy link

vercel bot commented Jul 30, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 7, 2025 10:49pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR consolidates file processing across the Onyx codebase by unifying all file handling under the create_user_files function. The refactor eliminates duplicate file processing logic that was scattered across different endpoints and modules.

Key changes include:

  1. File Validation Refactor: The is_valid_image_type function in file_validation.py was condensed from multiple return statements to a single boolean expression, maintaining identical logic while adopting a more functional style.

  2. Indexing Pipeline Cleanup: Token count calculation logic in indexing_pipeline.py was restructured to use early continue statements instead of nested conditionals, improving readability and following better code practices.

  3. Connector Support Enhancement: Added support for CheckpointedConnector in the celery utilities, enabling document ID extraction from checkpoint-based connectors through a new batched_docs utility function.

  4. Chat Upload Unification: The most significant change removes custom file processing logic from upload_files_for_chat in favor of calling the centralized create_user_files function, eliminating code duplication between different upload endpoints.

  5. Document Processing Standardization: Updated the connector module to use generic extract_file_text instead of the specialized convert_docx_to_txt function, and introduced ChatFileType enum for consistent file type detection.

The refactor creates a single source of truth for file processing operations, reducing maintenance burden and ensuring consistent behavior across all file upload pathways in the application.

Confidence score: 3/5

  • This PR introduces several significant changes that may cause integration issues, particularly around the chat upload flow and connector processing
  • The breaking change in return structure from upload_files_for_chat and inconsistent FileOrigin usage could cause downstream problems
  • Files backend/onyx/server/query_and_chat/chat_backend.py and backend/onyx/server/documents/connector.py need more attention due to major logic changes

7 files reviewed, 3 comments

Edit Code Review Bot Settings | Greptile

Comment on lines 719 to 763
# file_store = get_default_file_store()

file_content = file.file.read() # Read the file content
# file_info: list[tuple[str, str | None, ChatFileType]] = []
# for file in files:
# file_type = mime_type_to_chat_file_type(file.content_type)

# NOTE: Image conversion to JPEG used to be enforced here.
# This was removed to:
# 1. Preserve original file content for downloads
# 2. Maintain transparency in formats like PNG
# 3. Ameliorate issue with file conversion
file_content_io = io.BytesIO(file_content)
# file_content = file.file.read() # Read the file content

new_content_type = file.content_type
# # NOTE: Image conversion to JPEG used to be enforced here.
# # This was removed to:
# # 1. Preserve original file content for downloads
# # 2. Maintain transparency in formats like PNG
# # 3. Ameliorate issue with file conversion
# file_content_io = io.BytesIO(file_content)

# Store the file normally
file_id = file_store.save_file(
content=file_content_io,
display_name=file.filename,
file_origin=FileOrigin.CHAT_UPLOAD,
file_type=new_content_type or file_type.value,
)
# new_content_type = file.content_type

# 4) If the file is a doc, extract text and store that separately
if file_type == ChatFileType.DOC:
# Re-wrap bytes in a fresh BytesIO so we start at position 0
extracted_text_io = io.BytesIO(file_content)
extracted_text = extract_file_text(
file=extracted_text_io, # use the bytes we already read
file_name=file.filename or "",
)
# # Store the file normally
# file_id = file_store.save_file(
# content=file_content_io,
# display_name=file.filename,
# file_origin=FileOrigin.CHAT_UPLOAD,
# file_type=new_content_type or file_type.value,
# )

text_file_id = file_store.save_file(
content=io.BytesIO(extracted_text.encode()),
display_name=file.filename,
file_origin=FileOrigin.CHAT_UPLOAD,
file_type="text/plain",
)
# Return the text file as the "main" file descriptor for doc types
file_info.append((text_file_id, file.filename, ChatFileType.PLAIN_TEXT))
else:
file_info.append((file_id, file.filename, file_type))

# 5) Create a user file for each uploaded file
user_files = create_user_files([file], RECENT_DOCS_FOLDER_ID, user, db_session)
for user_file in user_files:
# 6) Create connector
connector_base = ConnectorBase(
name=f"UserFile-{int(time.time())}",
source=DocumentSource.FILE,
input_type=InputType.LOAD_STATE,
connector_specific_config={
"file_locations": [user_file.file_id],
"zip_metadata": {},
},
refresh_freq=None,
prune_freq=None,
indexing_start=None,
)
connector = create_connector(
db_session=db_session,
connector_data=connector_base,
)
# # 4) If the file is a doc, extract text and store that separately
# if file_type == ChatFileType.DOC:
# # Re-wrap bytes in a fresh BytesIO so we start at position 0
# extracted_text_io = io.BytesIO(file_content)
# extracted_text = extract_file_text(
# file=extracted_text_io, # use the bytes we already read
# file_name=file.filename or "",
# )

# text_file_id = file_store.save_file(
# content=io.BytesIO(extracted_text.encode()),
# display_name=file.filename,
# file_origin=FileOrigin.CHAT_UPLOAD,
# file_type="text/plain",
# )
# # Return the text file as the "main" file descriptor for doc types
# file_info.append((text_file_id, file.filename, ChatFileType.PLAIN_TEXT))
# else:
# file_info.append((file_id, file.filename, file_type))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Large block of commented code should be removed before merging to production rather than left as comments

Context Used: Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

2 issues found across 7 files • Review in cubic

React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.

user_file_id_to_token_count[user_file_id] = None
continue

user_file_id = doc_id_to_user_file_id[document_id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal of the earlier if not user_file_id: continue means a user_file_id of 0 will now be processed even though 0 is often treated as an invalid identifier; this changes previous behaviour and could insert bad data under key 0. (Based on your team's feedback about preserving existing behaviour unless intentionally changed.)

Prompt for AI agents
Address the following comment on backend/onyx/indexing/indexing_pipeline.py at line 875:

<comment>Removal of the earlier `if not user_file_id: continue` means a user_file_id of 0 will now be processed even though 0 is often treated as an invalid identifier; this changes previous behaviour and could insert bad data under key `0`. (Based on your team&#39;s feedback about preserving existing behaviour unless intentionally changed.)</comment>

<file context>
@@ -867,30 +867,31 @@ def index_doc_batch(
         for document_id in updatable_ids:
             # Only calculate token counts for documents that have a user file ID
             if (
-                document_id in doc_id_to_user_file_id
-                and doc_id_to_user_file_id[document_id] is not None
+                document_id not in doc_id_to_user_file_id
+                or doc_id_to_user_file_id[document_id] is None
             ):
-                user_file_id = doc_id_to_user_file_id[document_id]
</file context>

document_id in doc_id_to_user_file_id
and doc_id_to_user_file_id[document_id] is not None
document_id not in doc_id_to_user_file_id
or doc_id_to_user_file_id[document_id] is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer removing this one and keeping the check below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per greptile comment

file_info: list[tuple[str, str | None, ChatFileType]] = []
for file in files:
file_type = mime_type_to_chat_file_type(file.content_type)
# file_store = get_default_file_store()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything that was being done here is done by create_user_files already except for the doc-specific stuff that was moved to upload_files

@evan-onyx evan-onyx changed the title file processing refactor refactor: file processing Aug 7, 2025
checkpoint_connector_generator: CheckpointOutput[CT],
batch_size: int,
) -> Generator[list[Document], None, None]:
batch: list[Document] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a comment letting people know we are discarding failures

checkpoint_generator = runnable_connector.load_from_checkpoint(
start=start, end=end, checkpoint=checkpoint
)
doc_batch_generator = batched_docs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep failed documents

@evan-onyx evan-onyx force-pushed the fix/file-upload-refactor branch from 3dc34dd to 6f53c43 Compare August 7, 2025 22:45
@evan-onyx evan-onyx added this pull request to the merge queue Aug 8, 2025
Merged via the queue into main with commit 297720c Aug 8, 2025
14 of 15 checks passed
@evan-onyx evan-onyx deleted the fix/file-upload-refactor branch August 8, 2025 01:47
wenxi-onyx pushed a commit that referenced this pull request Aug 11, 2025
* file processing refactor

* mypy

* CW comments

* address CW
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* file processing refactor

* mypy

* CW comments

* address CW
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants