Skip to content

Conversation

joachim-danswer
Copy link
Contributor

@joachim-danswer joachim-danswer commented Aug 1, 2025

Description

This PR addresses the problem of duplicated tag values. Tag values that are not in extracted metadata anymore are removed.

In addition, a new column is added that let's other functions distinguish whether a given tag value should be part of a list (in which case the same key can appear twice for the document), or whether it is a single.

Tag data is cleaned up upon re-indexing.

Here is the linear ticket: https://linear.app/danswer/issue/DAN-2263/avoiding-tag-duplication

How Has This Been Tested?

Locally, single-tenant

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Removed duplicate tag values by cleaning up outdated tags during re-indexing and added a tag_type column to distinguish between single and list tag values.

  • Migration
    • Adds a new tag_type column to the tag table to track if a tag is single or a list.

@joachim-danswer joachim-danswer requested a review from a team as a code owner August 1, 2025 23:07
Copy link

vercel bot commented Aug 1, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 2, 2025 0:27am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces a tag type classification system to solve tag value duplication issues in the document indexing system. The changes add a new TagType enum with 'SINGLE' and 'LIST' values to distinguish between tags that should have only one value per key versus tags that can legitimately have multiple values for the same key.

The implementation includes four key components: (1) A new TagType enum in constants.py defining SINGLE and LIST types, (2) A database migration adding a nullable tag_type column to the tag table using PostgreSQL enum type, (3) Updates to the Tag model in models.py to include the new tag_type field, and (4) Enhanced tag management logic in tag.py that performs cleanup operations during tag creation.

The tag management functions now collect existing tags for a given key-source combination and remove outdated values that are no longer present in the current metadata extraction. For SINGLE type tags, only the latest value is preserved, while LIST type tags maintain all current values but remove any that are no longer relevant. The system handles backward compatibility by defaulting existing tags with null tag_type to SINGLE during migration.

This fits into the broader document indexing architecture where metadata extraction can change over time, and the system needs to maintain data consistency by cleaning up stale tag associations while preserving legitimate multi-value metadata like lists of authors or categories.

Confidence score: 3/5

  • This PR introduces complex logic changes that could affect data integrity during tag operations and re-indexing processes
  • Score reflects concerns about performance implications of O(n²) tag removal operations and potential database consistency issues from frequent commits
  • Pay close attention to backend/onyx/db/tag.py for the tag cleanup logic and migration handling

4 files reviewed, 2 comments

Edit Code Review Bot Settings | Greptile

Comment on lines 78 to 81
for previous_tag in previous_tags_for_key_in_document:
if previous_tag.tag_value != tag.tag_value:
document.tags.remove(previous_tag)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: document.tags.remove() has O(n) complexity, making this loop O(n²). Consider collecting tags to remove first, then removing them in batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by collecting all of the relevant document-tag connections and removing them on one SQL.

Comment on lines 149 to 152
for previous_tag in previous_tags_for_key_in_document:
if previous_tag.tag_value not in tag_values:
document.tags.remove(previous_tag)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Same O(n²) performance issue as above - document.tags.remove() in loop has quadratic complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by collecting all of the relevant document-tag connections and removing them on one SQL.

@joachim-danswer joachim-danswer changed the title bug: avoiding tag value duplication fix(tagging): avoiding tag value duplication Aug 1, 2025
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

No issues found across 4 files. Review in cubic

@joachim-danswer joachim-danswer changed the title fix(tagging): avoiding tag value duplication fix(tagging): avoiding tag duplication Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant