-
Notifications
You must be signed in to change notification settings - Fork 2k
fix(infra): lazy load and don't warm up model server models #5527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR implements a comprehensive memory optimization strategy for the Onyx model server by introducing lazy loading for heavy ML libraries and disabling model warm-up by default. The changes address a significant memory usage issue where model server containers were consuming 900MB (index) to 1.6GB (inference) of RAM, reducing this to approximately 220MB each when RAG functionality isn't actively used.The optimization works through two main mechanisms:
-
Lazy Loading: Heavy ML libraries (
litellm
,transformers
,sentence_transformers
,setfit
) are no longer imported at module level. Instead, they useTYPE_CHECKING
imports for type hints and runtime imports only when the functionality is actually needed. This prevents the memory overhead of loading these libraries during application startup. -
Skip Model Warm-up: A new
SKIP_WARM_UP
configuration flag (defaulting totrue
) prevents models from being pre-loaded and warmed up at startup unless explicitly enabled. This eliminates the upfront memory cost of model initialization when users don't immediately need ML capabilities.
Key architectural changes include:
- Moving
SKIP_WARM_UP
fromapp_configs
toshared_configs
for broader accessibility - Creating a new
configure_litellm()
singleton pattern inonyx/llm/get_litellm.py
for lazy litellm initialization - Implementing lazy imports across multiple modules including chat LLM, image generation, search NLP models, and model server components
- Enhancing the lazy import checking script with more sophisticated per-module ignore patterns
- Updating tests to accommodate the new lazy loading patterns
The changes maintain full backward compatibility while making memory-efficient behavior the default. This is particularly beneficial for containerized deployments where not all instances may need immediate ML capabilities, allowing for more efficient resource utilization and better scalability.
Important Files Changed
Changed Files
Filename | Score | Overview |
---|---|---|
backend/shared_configs/configs.py | 5/5 | Added SKIP_WARM_UP configuration flag defaulting to true to control model warmup behavior |
backend/onyx/llm/get_litellm.py | 4/5 | New file implementing singleton lazy loading pattern for litellm configuration |
backend/onyx/llm/chat_llm.py | 4/5 | Implemented comprehensive lazy loading for litellm with configure_litellm() pattern |
backend/model_server/main.py | 4/5 | Added conditional model warmup logic based on SKIP_WARM_UP flag with proper logging |
backend/model_server/encoders.py | 5/5 | Moved heavy ML library imports to lazy loading pattern using TYPE_CHECKING |
backend/model_server/custom_models.py | 4/5 | Implemented lazy loading for transformers and setfit libraries |
backend/model_server/onyx_torch_model.py | 4/5 | Added lazy loading for DistilBERT model components |
backend/scripts/check_lazy_imports.py | 4/5 | Enhanced lazy import checker with per-module ignore patterns and better configuration |
backend/onyx/natural_language_processing/search_nlp_models.py | 5/5 | Added SKIP_WARM_UP guard to cross-encoder warmup function |
backend/onyx/configs/app_configs.py | 5/5 | Removed SKIP_WARM_UP configuration (moved to shared_configs) |
backend/onyx/tools/tool_implementations/images/image_generation_tool.py | 5/5 | Moved litellm image generation import to lazy loading pattern |
backend/onyx/secondary_llm_flows/starter_message_creation.py | 5/5 | Implemented lazy loading for litellm utilities |
backend/onyx/agents/agent_search/shared_graph_utils/llm.py | 5/5 | Added lazy loading for litellm utility functions |
backend/tests/unit/scripts/test_check_lazy_imports.py | 5/5 | Updated tests to match new lazy import checker API with LazyImportSettings |
backend/tests/unit/onyx/llm/test_chat_llm.py | 5/5 | Updated mock patch paths to reflect new lazy loading implementation |
backend/tests/external_dependency_unit/full_setup.py | 5/5 | Updated to use shared_configs for SKIP_WARM_UP configuration |
Confidence score: 4/5
- This PR is safe to merge with good confidence in the memory optimization benefits and architectural improvements
- Score reflects thorough implementation of lazy loading patterns with proper error handling and backward compatibility
- Pay close attention to backend/scripts/check_lazy_imports.py for potential path matching issues and backend/onyx/llm/chat_llm.py for litellm configuration complexity
Sequence Diagram
sequenceDiagram
participant User
participant MainServer as Model Server Main
participant CustomModels as Custom Models Module
participant Encoders as Encoders Module
participant Libraries as ML Libraries (transformers, setfit, etc.)
User->>MainServer: Start Model Server
MainServer->>MainServer: Initialize FastAPI app
MainServer->>MainServer: Check SKIP_WARM_UP=true (new default)
Note over MainServer: Skip warm-up to save memory
MainServer->>MainServer: Set lifespan handlers
MainServer-->>User: Server ready (~220MB RAM
User->>CustomModels: Request connector classification
CustomModels->>CustomModels: get_connector_classifier_tokenizer()
Note over CustomModels: Lazy import: from transformers import AutoTokenizer
CustomModels->>Libraries: Import transformers only when needed
Libraries-->>CustomModels: Tokenizer instance
CustomModels->>CustomModels: get_local_connector_classifier()
CustomModels-->>User: Classification response
User->>CustomModels: Request content classification
CustomModels->>CustomModels: get_local_information_content_model()
Note over CustomModels: Lazy import: from setfit import SetFitModel
CustomModels->>Libraries: Import setfit only when needed
Libraries-->>CustomModels: SetFitModel instance
CustomModels-->>User: Content classification response
User->>Encoders: Request embedding
Encoders->>Encoders: get_embedding_model()
Note over Encoders: Lazy import: from sentence_transformers import SentenceTransformer
Encoders->>Libraries: Import sentence_transformers only when needed
Libraries-->>Encoders: Model instance
Encoders-->>User: Embedding response
16 files reviewed, 3 comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 16 files
Prompt for AI agents (all 5 issues)
Understand the root cause of the following 5 issues and fix them.
<file name="backend/tests/external_dependency_unit/full_setup.py">
<violation number="1" location="backend/tests/external_dependency_unit/full_setup.py:51">
Mutating SKIP_WARM_UP at runtime may not affect modules that imported it by value, leading to inconsistent skip-warmup behavior based on import order.</violation>
</file>
<file name="backend/onyx/llm/chat_llm.py">
<violation number="1" location="backend/onyx/llm/chat_llm.py:455">
Importing exceptions inside the except block can raise ImportError and mask the original error; import these once before the try or guard the import separately.</violation>
</file>
<file name="backend/shared_configs/configs.py">
<violation number="1" location="backend/shared_configs/configs.py:11">
SKIP_WARM_UP appears unused across the codebase, so this change has no effect. According to linked Linear issue DAN-2656, skipping model warmup is required—ensure this config is actually consumed by the model server startup logic.</violation>
</file>
<file name="backend/scripts/check_lazy_imports.py">
<violation number="1" location="backend/scripts/check_lazy_imports.py:20">
Rule violated: **Use Pydantic over dataclasses**
Use Pydantic BaseModel instead of @dataclass for new models. Replace LazyImportSettings with a Pydantic model and remove @dataclass.</violation>
<violation number="2" location="backend/scripts/check_lazy_imports.py:163">
Path comparison uses OS-specific separators; forward-slash ignore patterns won’t match on Windows. Normalize to POSIX before comparing.</violation>
</file>
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
Description
lazy load heavy model server libraries and don't warm up so that memory doesn't increase if user doesn't rag
ram container sizes go from
index model server 900 mb
inference model server 1.6 gb
to each ~220 MB (700 pkg lazy loading, and the extra for inference model server is from not warming up)
https://linear.app/danswer/issue/DAN-2656/lazy-load-ml-libraries-model-server
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Additional Options