-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: HF Cache Warmup Fix and Celery Pool Management #5435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces two main changes: improved configurability for Celery worker pool management and HuggingFace embedding model cache pre-warming optimizations.
The Celery pool management improvements add two new environment variables (CELERY_WORKER_PRIMARY_CONCURRENCY
and CELERY_WORKER_PRIMARY_POOL_OVERFLOW
) to make the primary worker's concurrency and database connection pool overflow configurable. These replace hardcoded values of 4 that were previously used in both the worker configuration and SQLAlchemy engine initialization. This change follows the codebase's pattern of centralizing configuration in app_configs.py
and allows for environment-specific tuning of worker resources.
The HuggingFace cache warmup changes attempt to address RoPE (Rotary Position Embedding) cache initialization race conditions in the embedding model system. The implementation adds device selection logic and a pre-warming mechanism that tries to build embedding model caches upfront during model initialization rather than on-demand during inference. This includes direct manipulation of HuggingFace model internals to access RoPE configurations and set custom attributes for tracking cache state.
Both changes aim to improve system performance and resource management - the Celery changes provide deployment flexibility while the embedding changes target concurrency issues in high-load scenarios.
Confidence score: 2/5
- This PR contains potentially risky changes that could cause production issues, particularly in the embedding model modifications
- Score lowered due to complex embedding model logic that directly manipulates third-party library internals and incomplete threading protection
- Pay close attention to
backend/model_server/encoders.py
which contains brittle code that makes assumptions about HuggingFace model architecture
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 4 files
Prompt for AI agents (all 2 issues)
Understand the root cause of the following 2 issues and fix them.
<file name="backend/onyx/configs/app_configs.py">
<violation number="1" location="backend/onyx/configs/app_configs.py:358">
Env-derived int cast lacks ValueError handling; a non-numeric value will crash import. Wrap parsing in try/except with a default.</violation>
<violation number="2" location="backend/onyx/configs/app_configs.py:362">
Env-derived int cast lacks ValueError handling; use try/except with a default to prevent startup failure.</violation>
</file>
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
Description
RoPE cache is the root cause of the HF issues. It's part of the attention calculation (attention KV cache), and sometimes the cache needs to be resized up.
During concurrency, if code is interleaved where one thread resizes the cache up, the other thread will fail.
The max size is 2048 which is not problematic -- it's ok to warm it up to max size at the beginning. The Rope cache only needs to be warmed up once per server restart and seems to take about 4 s locally. If it fails for whatever reason, the pod won't crash (there's a try except block handling it)
Also added env variables for managing the queue pool for sqlalchemy
How Has This Been Tested?
Tested locally and on test.onyx.app
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Pre-warms SentenceTransformer (HF) caches and pins the device to remove first-request stalls and cache thrash. Adds env-driven controls for Celery primary worker concurrency and DB pool overflow for steadier connection usage.
Bug Fixes
New Features