Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Apr 16, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-1830/confluence-duplicate-docs

The confluence API's granularity is only to the minute, so we were re-processing docs from the last checkpoint. This change skips docs seen in the previous checkpoint to avoid this issue.

How Has This Been Tested?

Tested in UI + unit tests

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@evan-onyx evan-onyx requested a review from a team as a code owner April 16, 2025 22:06
Copy link

vercel bot commented Apr 16, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 16, 2025 10:20pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR addresses duplicate document processing in the Confluence connector by implementing checkpoint-based tracking of previously seen documents.

  • Added last_seen_doc_ids to ConfluenceCheckpoint to track and skip previously processed documents, preventing duplicates from API time fuzziness
  • Refactored load_everything_from_checkpoint_connector into two functions to support custom initial checkpoints in /backend/tests/unit/onyx/connectors/utils.py
  • Modified _fetch_document_batches to check document IDs against previous checkpoint before processing in /backend/onyx/connectors/confluence/connector.py
  • Added comprehensive test coverage in test_confluence_checkpointing.py to verify documents aren't reprocessed when using checkpoints

The changes follow best practices with clear boundaries, explicit typing, and fail-fast error handling while maintaining simplicity in the implementation.

3 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

@evan-onyx evan-onyx added this pull request to the merge queue Apr 16, 2025
Merged via the queue into main with commit 5acae2d Apr 16, 2025
10 of 13 checks passed
@evan-onyx evan-onyx deleted the confluence-duplicate-docs branch April 16, 2025 23:58
aronszanto pushed a commit to aronszanto/onyx that referenced this pull request Apr 26, 2025
* fix re-processing of previously seen docs

* performance
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* fix re-processing of previously seen docs

* performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants