Skip to content

Conversation

rguan72
Copy link
Contributor

@rguan72 rguan72 commented Sep 17, 2025

Description

RoPE cache is the root cause of the HF issues. It's part of the attention calculation (attention KV cache), and sometimes the cache needs to be resized up.

During concurrency, if code is interleaved where one thread resizes the cache up, the other thread will fail.

The max size is 2048 which is not problematic -- it's ok to warm it up to max size at the beginning. The Rope cache only needs to be warmed up once per server restart and seems to take about 4 s locally. If it fails for whatever reason, the pod won't crash (there's a try except block handling it)

Also added env variables for managing the queue pool for sqlalchemy

How Has This Been Tested?

Tested locally and on test.onyx.app

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Pre-warms SentenceTransformer (HF) caches and pins the device to remove first-request stalls and cache thrash. Adds env-driven controls for Celery primary worker concurrency and DB pool overflow for steadier connection usage.

  • Bug Fixes

    • Pre-warm RoPE caches on model load and when max_seq_length increases.
    • Pin device (cuda/mps/cpu) before warmup; respect max_position_embeddings.
  • New Features

    • CELERY_WORKER_PRIMARY_CONCURRENCY sets primary worker_concurrency.
    • CELERY_WORKER_PRIMARY_POOL_OVERFLOW sets SqlAlchemy max_overflow for primary workers.

@rguan72 rguan72 requested a review from a team as a code owner September 17, 2025 00:00
Copy link

vercel bot commented Sep 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Sep 17, 2025 0:34am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces two main changes: improved configurability for Celery worker pool management and HuggingFace embedding model cache pre-warming optimizations.

The Celery pool management improvements add two new environment variables (CELERY_WORKER_PRIMARY_CONCURRENCY and CELERY_WORKER_PRIMARY_POOL_OVERFLOW) to make the primary worker's concurrency and database connection pool overflow configurable. These replace hardcoded values of 4 that were previously used in both the worker configuration and SQLAlchemy engine initialization. This change follows the codebase's pattern of centralizing configuration in app_configs.py and allows for environment-specific tuning of worker resources.

The HuggingFace cache warmup changes attempt to address RoPE (Rotary Position Embedding) cache initialization race conditions in the embedding model system. The implementation adds device selection logic and a pre-warming mechanism that tries to build embedding model caches upfront during model initialization rather than on-demand during inference. This includes direct manipulation of HuggingFace model internals to access RoPE configurations and set custom attributes for tracking cache state.

Both changes aim to improve system performance and resource management - the Celery changes provide deployment flexibility while the embedding changes target concurrency issues in high-load scenarios.

Confidence score: 2/5

  • This PR contains potentially risky changes that could cause production issues, particularly in the embedding model modifications
  • Score lowered due to complex embedding model logic that directly manipulates third-party library internals and incomplete threading protection
  • Pay close attention to backend/model_server/encoders.py which contains brittle code that makes assumptions about HuggingFace model architecture

4 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@rguan72 rguan72 changed the title HF Cache Warmup Fix and Celery Pool Management fix: HF Cache Warmup Fix and Celery Pool Management Sep 17, 2025
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Prompt for AI agents (all 2 issues)

Understand the root cause of the following 2 issues and fix them.


<file name="backend/onyx/configs/app_configs.py">

<violation number="1" location="backend/onyx/configs/app_configs.py:358">
Env-derived int cast lacks ValueError handling; a non-numeric value will crash import. Wrap parsing in try/except with a default.</violation>

<violation number="2" location="backend/onyx/configs/app_configs.py:362">
Env-derived int cast lacks ValueError handling; use try/except with a default to prevent startup failure.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

@rguan72 rguan72 merged commit 32d5e40 into main Sep 17, 2025
58 of 60 checks passed
@rguan72 rguan72 deleted the richard/celery-worker-pool-and-hf-cache-warmup branch September 17, 2025 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants