Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Jul 4, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-2162/drive-doc-deduplication
dupe drive ids fixed

How Has This Been Tested?

tested in ui + multitenant

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@evan-onyx evan-onyx requested a review from a team as a code owner July 4, 2025 05:14
Copy link

vercel bot commented Jul 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 6, 2025 0:42am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Fixes duplicate document issues in Google Drive integration by normalizing document IDs across the system, with supporting database migrations and UI improvements.

  • New migration backend/alembic/versions/12635f6655b7_drive_canonical_ids.py implements document ID normalization by removing URL parameters and standardizing paths
  • Modified backend/onyx/connectors/google_drive/doc_conversion.py to clean up Google Drive URLs by removing '/edit', '/view', and query parameters
  • Enhanced web/src/app/admin/connector/[ccPairId]/IndexAttemptErrorsModal.tsx with improved scrollable error messages and consistent table layouts
  • Added defensive 'DROP TABLE IF EXISTS CASCADE' statements in multiple migration files for safer schema updates

7 files reviewed, 1 comment
Edit PR Review Bot Settings | Greptile

Comment on lines +533 to +534
print(f"Failed to update document {current_doc_id}: {e}")
from httpx import HTTPStatusError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: HTTPStatusError import should be at top of file with other imports

Suggested change
print(f"Failed to update document {current_doc_id}: {e}")
from httpx import HTTPStatusError
print(f"Failed to update document {current_doc_id}: {e}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +533 to +534
print(f"Failed to update document {current_doc_id}: {e}")
from httpx import HTTPStatusError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@evan-onyx evan-onyx force-pushed the bugfix/drive-doc-ids branch from 2509bd1 to 90aae5f Compare July 6, 2025 00:38
@evan-onyx evan-onyx enabled auto-merge July 6, 2025 00:59
@evan-onyx evan-onyx added this pull request to the merge queue Jul 6, 2025
Merged via the queue into main with commit 8349d6f Jul 6, 2025
13 of 14 checks passed
@evan-onyx evan-onyx deleted the bugfix/drive-doc-ids branch July 6, 2025 03:07
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* fixed id extraction in drive connector

* WIP migration

* full migration script

* migration works single tenant without duplicates

* tested single tenant with duplicate docs

* migrations and frontend

* tested mutlitenant

* fix connector tests

* make tests pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants