-
Notifications
You must be signed in to change notification settings - Fork 2k
refactor: file processing #5136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR consolidates file processing across the Onyx codebase by unifying all file handling under the create_user_files
function. The refactor eliminates duplicate file processing logic that was scattered across different endpoints and modules.
Key changes include:
-
File Validation Refactor: The
is_valid_image_type
function infile_validation.py
was condensed from multiple return statements to a single boolean expression, maintaining identical logic while adopting a more functional style. -
Indexing Pipeline Cleanup: Token count calculation logic in
indexing_pipeline.py
was restructured to use early continue statements instead of nested conditionals, improving readability and following better code practices. -
Connector Support Enhancement: Added support for
CheckpointedConnector
in the celery utilities, enabling document ID extraction from checkpoint-based connectors through a newbatched_docs
utility function. -
Chat Upload Unification: The most significant change removes custom file processing logic from
upload_files_for_chat
in favor of calling the centralizedcreate_user_files
function, eliminating code duplication between different upload endpoints. -
Document Processing Standardization: Updated the connector module to use generic
extract_file_text
instead of the specializedconvert_docx_to_txt
function, and introducedChatFileType
enum for consistent file type detection.
The refactor creates a single source of truth for file processing operations, reducing maintenance burden and ensuring consistent behavior across all file upload pathways in the application.
Confidence score: 3/5
- This PR introduces several significant changes that may cause integration issues, particularly around the chat upload flow and connector processing
- The breaking change in return structure from
upload_files_for_chat
and inconsistentFileOrigin
usage could cause downstream problems - Files
backend/onyx/server/query_and_chat/chat_backend.py
andbackend/onyx/server/documents/connector.py
need more attention due to major logic changes
7 files reviewed, 3 comments
# file_store = get_default_file_store() | ||
|
||
file_content = file.file.read() # Read the file content | ||
# file_info: list[tuple[str, str | None, ChatFileType]] = [] | ||
# for file in files: | ||
# file_type = mime_type_to_chat_file_type(file.content_type) | ||
|
||
# NOTE: Image conversion to JPEG used to be enforced here. | ||
# This was removed to: | ||
# 1. Preserve original file content for downloads | ||
# 2. Maintain transparency in formats like PNG | ||
# 3. Ameliorate issue with file conversion | ||
file_content_io = io.BytesIO(file_content) | ||
# file_content = file.file.read() # Read the file content | ||
|
||
new_content_type = file.content_type | ||
# # NOTE: Image conversion to JPEG used to be enforced here. | ||
# # This was removed to: | ||
# # 1. Preserve original file content for downloads | ||
# # 2. Maintain transparency in formats like PNG | ||
# # 3. Ameliorate issue with file conversion | ||
# file_content_io = io.BytesIO(file_content) | ||
|
||
# Store the file normally | ||
file_id = file_store.save_file( | ||
content=file_content_io, | ||
display_name=file.filename, | ||
file_origin=FileOrigin.CHAT_UPLOAD, | ||
file_type=new_content_type or file_type.value, | ||
) | ||
# new_content_type = file.content_type | ||
|
||
# 4) If the file is a doc, extract text and store that separately | ||
if file_type == ChatFileType.DOC: | ||
# Re-wrap bytes in a fresh BytesIO so we start at position 0 | ||
extracted_text_io = io.BytesIO(file_content) | ||
extracted_text = extract_file_text( | ||
file=extracted_text_io, # use the bytes we already read | ||
file_name=file.filename or "", | ||
) | ||
# # Store the file normally | ||
# file_id = file_store.save_file( | ||
# content=file_content_io, | ||
# display_name=file.filename, | ||
# file_origin=FileOrigin.CHAT_UPLOAD, | ||
# file_type=new_content_type or file_type.value, | ||
# ) | ||
|
||
text_file_id = file_store.save_file( | ||
content=io.BytesIO(extracted_text.encode()), | ||
display_name=file.filename, | ||
file_origin=FileOrigin.CHAT_UPLOAD, | ||
file_type="text/plain", | ||
) | ||
# Return the text file as the "main" file descriptor for doc types | ||
file_info.append((text_file_id, file.filename, ChatFileType.PLAIN_TEXT)) | ||
else: | ||
file_info.append((file_id, file.filename, file_type)) | ||
|
||
# 5) Create a user file for each uploaded file | ||
user_files = create_user_files([file], RECENT_DOCS_FOLDER_ID, user, db_session) | ||
for user_file in user_files: | ||
# 6) Create connector | ||
connector_base = ConnectorBase( | ||
name=f"UserFile-{int(time.time())}", | ||
source=DocumentSource.FILE, | ||
input_type=InputType.LOAD_STATE, | ||
connector_specific_config={ | ||
"file_locations": [user_file.file_id], | ||
"zip_metadata": {}, | ||
}, | ||
refresh_freq=None, | ||
prune_freq=None, | ||
indexing_start=None, | ||
) | ||
connector = create_connector( | ||
db_session=db_session, | ||
connector_data=connector_base, | ||
) | ||
# # 4) If the file is a doc, extract text and store that separately | ||
# if file_type == ChatFileType.DOC: | ||
# # Re-wrap bytes in a fresh BytesIO so we start at position 0 | ||
# extracted_text_io = io.BytesIO(file_content) | ||
# extracted_text = extract_file_text( | ||
# file=extracted_text_io, # use the bytes we already read | ||
# file_name=file.filename or "", | ||
# ) | ||
|
||
# text_file_id = file_store.save_file( | ||
# content=io.BytesIO(extracted_text.encode()), | ||
# display_name=file.filename, | ||
# file_origin=FileOrigin.CHAT_UPLOAD, | ||
# file_type="text/plain", | ||
# ) | ||
# # Return the text file as the "main" file descriptor for doc types | ||
# file_info.append((text_file_id, file.filename, ChatFileType.PLAIN_TEXT)) | ||
# else: | ||
# file_info.append((file_id, file.filename, file_type)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Large block of commented code should be removed before merging to production rather than left as comments
Context Used: Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cubic analysis
2 issues found across 7 files • Review in cubic
React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
user_file_id_to_token_count[user_file_id] = None | ||
continue | ||
|
||
user_file_id = doc_id_to_user_file_id[document_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removal of the earlier if not user_file_id: continue
means a user_file_id of 0 will now be processed even though 0 is often treated as an invalid identifier; this changes previous behaviour and could insert bad data under key 0
. (Based on your team's feedback about preserving existing behaviour unless intentionally changed.)
Prompt for AI agents
Address the following comment on backend/onyx/indexing/indexing_pipeline.py at line 875:
<comment>Removal of the earlier `if not user_file_id: continue` means a user_file_id of 0 will now be processed even though 0 is often treated as an invalid identifier; this changes previous behaviour and could insert bad data under key `0`. (Based on your team's feedback about preserving existing behaviour unless intentionally changed.)</comment>
<file context>
@@ -867,30 +867,31 @@ def index_doc_batch(
for document_id in updatable_ids:
# Only calculate token counts for documents that have a user file ID
if (
- document_id in doc_id_to_user_file_id
- and doc_id_to_user_file_id[document_id] is not None
+ document_id not in doc_id_to_user_file_id
+ or doc_id_to_user_file_id[document_id] is None
):
- user_file_id = doc_id_to_user_file_id[document_id]
</file context>
bca4d28
to
1a6ff9e
Compare
document_id in doc_id_to_user_file_id | ||
and doc_id_to_user_file_id[document_id] is not None | ||
document_id not in doc_id_to_user_file_id | ||
or doc_id_to_user_file_id[document_id] is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer removing this one and keeping the check below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per greptile comment
file_info: list[tuple[str, str | None, ChatFileType]] = [] | ||
for file in files: | ||
file_type = mime_type_to_chat_file_type(file.content_type) | ||
# file_store = get_default_file_store() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything that was being done here is done by create_user_files already except for the doc-specific stuff that was moved to upload_files
checkpoint_connector_generator: CheckpointOutput[CT], | ||
batch_size: int, | ||
) -> Generator[list[Document], None, None]: | ||
batch: list[Document] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add a comment letting people know we are discarding failures
checkpoint_generator = runnable_connector.load_from_checkpoint( | ||
start=start, end=end, checkpoint=checkpoint | ||
) | ||
doc_batch_generator = batched_docs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep failed documents
3dc34dd
to
6f53c43
Compare
* file processing refactor * mypy * CW comments * address CW
* file processing refactor * mypy * CW comments * address CW
Description
Fixes https://linear.app/danswer/issue/DAN-2301/unify-file-processing
unify file processing under the creation of user files; no more separate unnecessary processing/storing
How Has This Been Tested?
in UI
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Unified file processing so all uploaded files are handled through user file creation, removing separate processing and storage paths.