Skip to content

Conversation

wenxi-onyx
Copy link
Member

@wenxi-onyx wenxi-onyx commented Aug 12, 2025

Description

title

How Has This Been Tested?

locally

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Fixed a bug where connector credential pairs were skipped during indexing if an embedding swap was in progress. Now, all relevant pairs are collected and re-indexed when a swap is detected.

@wenxi-onyx wenxi-onyx requested a review from a team as a code owner August 12, 2025 23:03
Copy link

vercel bot commented Aug 12, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Aug 29, 2025 11:03pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR fixes a critical bug in the embedding model swap logic within the document processing pipeline. The issue was that during embedding model transitions, the system was incorrectly applying normal indexing limits (such as the USER_FILE_INDEXING_LIMIT of 100 user files) even when all connector-credential pairs needed to be re-indexed with the new embedding model.

The fix introduces logic to detect when an embedding swap is in progress by checking if any search settings have a FUTURE status. When this condition is detected, the system bypasses normal filtering and collection limits to ensure all indexable connector-credential pairs are gathered for re-indexing. This prevents the problematic scenario where some documents would retain old embeddings while others use new ones, which would lead to inconsistent search results.

The changes span two files:

  1. backend/onyx/db/enums.py - Adds an is_future() helper method to the IndexModelStatus enum, following the same pattern as the existing is_current() method
  2. backend/onyx/background/celery/tasks/docprocessing/tasks.py - Implements the core detection logic that checks for embedding swaps in progress and conditionally fetches all CC pairs instead of applying normal limits

The solution integrates well with the existing codebase architecture, particularly the search settings management system and the connector-credential pair indexing pipeline. The IndexModelStatus enum already had infrastructure for tracking different states (PAST, PRESENT, FUTURE), and this fix leverages that existing pattern to properly handle the transition states during embedding model swaps.

Confidence score: 4/5

  • This PR addresses a clear functional bug with a targeted fix that should resolve embedding swap issues
  • The logic is straightforward and well-contained, reducing risk of introducing new problems
  • Pay close attention to the docprocessing tasks file to ensure the embedding swap detection logic works correctly in all scenarios

2 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

No issues found across 2 files. Review in cubic

Copy link
Contributor

@Weves Weves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each search setting:

  1. Fetch "all" relevant cc pairs (only fetch some max # of user files)
  2. Apply should_index to filter out irrelevant ones
  3. Package them as a tuple (search_settings, all_cc_pairs_to_index)

Next step just takes this in and executes the indexing, doesn't do any filtering

@wenxi-onyx wenxi-onyx force-pushed the bugfix/embedding_swap_logic branch from 1f2878a to 960d44d Compare August 19, 2025 22:25
@Weves
Copy link
Contributor

Weves commented Aug 20, 2025

@cubic-dev-ai please review this

Copy link
Contributor

cubic-dev-ai bot commented Aug 20, 2025

@cubic-dev-ai please review this

@Weves I've started the AI code review. It'll take a few minutes to complete.

2 similar comments
Copy link
Contributor

cubic-dev-ai bot commented Aug 20, 2025

@cubic-dev-ai please review this

@Weves I've started the AI code review. It'll take a few minutes to complete.

Copy link
Contributor

cubic-dev-ai bot commented Aug 20, 2025

@cubic-dev-ai please review this

@Weves I've started the AI code review. It'll take a few minutes to complete.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

cubic-dev-ai[bot]

This comment was marked as duplicate.

cubic-dev-ai[bot]

This comment was marked as duplicate.

@Weves Weves disabled auto-merge August 29, 2025 23:39
@Weves Weves merged commit e0fef50 into main Aug 30, 2025
14 of 16 checks passed
@Weves Weves deleted the bugfix/embedding_swap_logic branch August 30, 2025 00:17
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
)

* don't skip ccpairs if embedding swap in progress

* refactor check_for_indexing to properly handle search setting swaps

* mypy

* mypy

* comment debugging log

* nits and more efficient active index attempt check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants