-
Notifications
You must be signed in to change notification settings - Fork 2k
fix(tagging): avoiding tag duplication #5146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces a tag type classification system to solve tag value duplication issues in the document indexing system. The changes add a new TagType
enum with 'SINGLE' and 'LIST' values to distinguish between tags that should have only one value per key versus tags that can legitimately have multiple values for the same key.
The implementation includes four key components: (1) A new TagType
enum in constants.py defining SINGLE and LIST types, (2) A database migration adding a nullable tag_type
column to the tag table using PostgreSQL enum type, (3) Updates to the Tag model in models.py to include the new tag_type field, and (4) Enhanced tag management logic in tag.py that performs cleanup operations during tag creation.
The tag management functions now collect existing tags for a given key-source combination and remove outdated values that are no longer present in the current metadata extraction. For SINGLE type tags, only the latest value is preserved, while LIST type tags maintain all current values but remove any that are no longer relevant. The system handles backward compatibility by defaulting existing tags with null tag_type to SINGLE during migration.
This fits into the broader document indexing architecture where metadata extraction can change over time, and the system needs to maintain data consistency by cleaning up stale tag associations while preserving legitimate multi-value metadata like lists of authors or categories.
Confidence score: 3/5
- This PR introduces complex logic changes that could affect data integrity during tag operations and re-indexing processes
- Score reflects concerns about performance implications of O(n²) tag removal operations and potential database consistency issues from frequent commits
- Pay close attention to backend/onyx/db/tag.py for the tag cleanup logic and migration handling
4 files reviewed, 2 comments
backend/onyx/db/tag.py
Outdated
for previous_tag in previous_tags_for_key_in_document: | ||
if previous_tag.tag_value != tag.tag_value: | ||
document.tags.remove(previous_tag) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: document.tags.remove() has O(n) complexity, making this loop O(n²). Consider collecting tags to remove first, then removing them in batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by collecting all of the relevant document-tag connections and removing them on one SQL.
backend/onyx/db/tag.py
Outdated
for previous_tag in previous_tags_for_key_in_document: | ||
if previous_tag.tag_value not in tag_values: | ||
document.tags.remove(previous_tag) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Same O(n²) performance issue as above - document.tags.remove() in loop has quadratic complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by collecting all of the relevant document-tag connections and removing them on one SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cubic analysis
No issues found across 4 files. Review in cubic
3c59dc2
to
b884c3c
Compare
Description
This PR addresses the problem of duplicated tag values. Tag values that are not in extracted metadata anymore are removed.
In addition, a new column is added that let's other functions distinguish whether a given tag value should be part of a list (in which case the same key can appear twice for the document), or whether it is a single.
Tag data is cleaned up upon re-indexing.
Here is the linear ticket: https://linear.app/danswer/issue/DAN-2263/avoiding-tag-duplication
How Has This Been Tested?
Locally, single-tenant
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Removed duplicate tag values by cleaning up outdated tags during re-indexing and added a tag_type column to distinguish between single and list tag values.