-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: don't skip ccpairs if embedding swap in progress #5189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR fixes a critical bug in the embedding model swap logic within the document processing pipeline. The issue was that during embedding model transitions, the system was incorrectly applying normal indexing limits (such as the USER_FILE_INDEXING_LIMIT
of 100 user files) even when all connector-credential pairs needed to be re-indexed with the new embedding model.
The fix introduces logic to detect when an embedding swap is in progress by checking if any search settings have a FUTURE
status. When this condition is detected, the system bypasses normal filtering and collection limits to ensure all indexable connector-credential pairs are gathered for re-indexing. This prevents the problematic scenario where some documents would retain old embeddings while others use new ones, which would lead to inconsistent search results.
The changes span two files:
backend/onyx/db/enums.py
- Adds anis_future()
helper method to theIndexModelStatus
enum, following the same pattern as the existingis_current()
methodbackend/onyx/background/celery/tasks/docprocessing/tasks.py
- Implements the core detection logic that checks for embedding swaps in progress and conditionally fetches all CC pairs instead of applying normal limits
The solution integrates well with the existing codebase architecture, particularly the search settings management system and the connector-credential pair indexing pipeline. The IndexModelStatus
enum already had infrastructure for tracking different states (PAST, PRESENT, FUTURE), and this fix leverages that existing pattern to properly handle the transition states during embedding model swaps.
Confidence score: 4/5
- This PR addresses a clear functional bug with a targeted fix that should resolve embedding swap issues
- The logic is straightforward and well-contained, reducing risk of introducing new problems
- Pay close attention to the docprocessing tasks file to ensure the embedding swap detection logic works correctly in all scenarios
2 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cubic analysis
No issues found across 2 files. Review in cubic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each search setting:
- Fetch "all" relevant cc pairs (only fetch some max # of user files)
- Apply should_index to filter out irrelevant ones
- Package them as a tuple (search_settings, all_cc_pairs_to_index)
Next step just takes this in and executes the indexing, doesn't do any filtering
1f2878a
to
960d44d
Compare
@cubic-dev-ai please review this |
@Weves I've started the AI code review. It'll take a few minutes to complete. |
2 similar comments
@Weves I've started the AI code review. It'll take a few minutes to complete. |
@Weves I've started the AI code review. It'll take a few minutes to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 4 files
f195c38
to
418639d
Compare
418639d
to
ef17f16
Compare
Description
title
How Has This Been Tested?
locally
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Fixed a bug where connector credential pairs were skipped during indexing if an embedding swap was in progress. Now, all relevant pairs are collected and re-indexed when a swap is detected.