Skip to content

Conversation

edwin-onyx
Copy link
Contributor

@edwin-onyx edwin-onyx commented Sep 28, 2025

Description

lazy load heavy model server libraries and don't warm up so that memory doesn't increase if user doesn't rag

ram container sizes go from
index model server 900 mb
inference model server 1.6 gb

to each ~220 MB (700 pkg lazy loading, and the extra for inference model server is from not warming up)

https://linear.app/danswer/issue/DAN-2656/lazy-load-ml-libraries-model-server

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Additional Options

  • [Optional] Override Linear Check

@edwin-onyx edwin-onyx requested a review from Weves September 30, 2025 00:37
@edwin-onyx edwin-onyx marked this pull request as ready for review September 30, 2025 03:05
@edwin-onyx edwin-onyx requested a review from a team as a code owner September 30, 2025 03:05
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR implements a comprehensive memory optimization strategy for the Onyx model server by introducing lazy loading for heavy ML libraries and disabling model warm-up by default. The changes address a significant memory usage issue where model server containers were consuming 900MB (index) to 1.6GB (inference) of RAM, reducing this to approximately 220MB each when RAG functionality isn't actively used.

The optimization works through two main mechanisms:

  1. Lazy Loading: Heavy ML libraries (litellm, transformers, sentence_transformers, setfit) are no longer imported at module level. Instead, they use TYPE_CHECKING imports for type hints and runtime imports only when the functionality is actually needed. This prevents the memory overhead of loading these libraries during application startup.

  2. Skip Model Warm-up: A new SKIP_WARM_UP configuration flag (defaulting to true) prevents models from being pre-loaded and warmed up at startup unless explicitly enabled. This eliminates the upfront memory cost of model initialization when users don't immediately need ML capabilities.

Key architectural changes include:

  • Moving SKIP_WARM_UP from app_configs to shared_configs for broader accessibility
  • Creating a new configure_litellm() singleton pattern in onyx/llm/get_litellm.py for lazy litellm initialization
  • Implementing lazy imports across multiple modules including chat LLM, image generation, search NLP models, and model server components
  • Enhancing the lazy import checking script with more sophisticated per-module ignore patterns
  • Updating tests to accommodate the new lazy loading patterns

The changes maintain full backward compatibility while making memory-efficient behavior the default. This is particularly beneficial for containerized deployments where not all instances may need immediate ML capabilities, allowing for more efficient resource utilization and better scalability.

Important Files Changed

Changed Files
Filename Score Overview
backend/shared_configs/configs.py 5/5 Added SKIP_WARM_UP configuration flag defaulting to true to control model warmup behavior
backend/onyx/llm/get_litellm.py 4/5 New file implementing singleton lazy loading pattern for litellm configuration
backend/onyx/llm/chat_llm.py 4/5 Implemented comprehensive lazy loading for litellm with configure_litellm() pattern
backend/model_server/main.py 4/5 Added conditional model warmup logic based on SKIP_WARM_UP flag with proper logging
backend/model_server/encoders.py 5/5 Moved heavy ML library imports to lazy loading pattern using TYPE_CHECKING
backend/model_server/custom_models.py 4/5 Implemented lazy loading for transformers and setfit libraries
backend/model_server/onyx_torch_model.py 4/5 Added lazy loading for DistilBERT model components
backend/scripts/check_lazy_imports.py 4/5 Enhanced lazy import checker with per-module ignore patterns and better configuration
backend/onyx/natural_language_processing/search_nlp_models.py 5/5 Added SKIP_WARM_UP guard to cross-encoder warmup function
backend/onyx/configs/app_configs.py 5/5 Removed SKIP_WARM_UP configuration (moved to shared_configs)
backend/onyx/tools/tool_implementations/images/image_generation_tool.py 5/5 Moved litellm image generation import to lazy loading pattern
backend/onyx/secondary_llm_flows/starter_message_creation.py 5/5 Implemented lazy loading for litellm utilities
backend/onyx/agents/agent_search/shared_graph_utils/llm.py 5/5 Added lazy loading for litellm utility functions
backend/tests/unit/scripts/test_check_lazy_imports.py 5/5 Updated tests to match new lazy import checker API with LazyImportSettings
backend/tests/unit/onyx/llm/test_chat_llm.py 5/5 Updated mock patch paths to reflect new lazy loading implementation
backend/tests/external_dependency_unit/full_setup.py 5/5 Updated to use shared_configs for SKIP_WARM_UP configuration

Confidence score: 4/5

  • This PR is safe to merge with good confidence in the memory optimization benefits and architectural improvements
  • Score reflects thorough implementation of lazy loading patterns with proper error handling and backward compatibility
  • Pay close attention to backend/scripts/check_lazy_imports.py for potential path matching issues and backend/onyx/llm/chat_llm.py for litellm configuration complexity

Sequence Diagram

sequenceDiagram
    participant User
    participant MainServer as Model Server Main
    participant CustomModels as Custom Models Module
    participant Encoders as Encoders Module
    participant Libraries as ML Libraries (transformers, setfit, etc.)
    
    User->>MainServer: Start Model Server
    MainServer->>MainServer: Initialize FastAPI app
    MainServer->>MainServer: Check SKIP_WARM_UP=true (new default)
    Note over MainServer: Skip warm-up to save memory
    MainServer->>MainServer: Set lifespan handlers
    MainServer-->>User: Server ready (~220MB RAM
    
    User->>CustomModels: Request connector classification
    CustomModels->>CustomModels: get_connector_classifier_tokenizer()
    Note over CustomModels: Lazy import: from transformers import AutoTokenizer
    CustomModels->>Libraries: Import transformers only when needed
    Libraries-->>CustomModels: Tokenizer instance
    CustomModels->>CustomModels: get_local_connector_classifier()
    CustomModels-->>User: Classification response
    
    User->>CustomModels: Request content classification
    CustomModels->>CustomModels: get_local_information_content_model()
    Note over CustomModels: Lazy import: from setfit import SetFitModel
    CustomModels->>Libraries: Import setfit only when needed
    Libraries-->>CustomModels: SetFitModel instance
    CustomModels-->>User: Content classification response
    
    User->>Encoders: Request embedding
    Encoders->>Encoders: get_embedding_model()
    Note over Encoders: Lazy import: from sentence_transformers import SentenceTransformer
    Encoders->>Libraries: Import sentence_transformers only when needed
    Libraries-->>Encoders: Model instance
    Encoders-->>User: Embedding response
Loading

16 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 16 files

Prompt for AI agents (all 5 issues)

Understand the root cause of the following 5 issues and fix them.


<file name="backend/tests/external_dependency_unit/full_setup.py">

<violation number="1" location="backend/tests/external_dependency_unit/full_setup.py:51">
Mutating SKIP_WARM_UP at runtime may not affect modules that imported it by value, leading to inconsistent skip-warmup behavior based on import order.</violation>
</file>

<file name="backend/onyx/llm/chat_llm.py">

<violation number="1" location="backend/onyx/llm/chat_llm.py:455">
Importing exceptions inside the except block can raise ImportError and mask the original error; import these once before the try or guard the import separately.</violation>
</file>

<file name="backend/shared_configs/configs.py">

<violation number="1" location="backend/shared_configs/configs.py:11">
SKIP_WARM_UP appears unused across the codebase, so this change has no effect. According to linked Linear issue DAN-2656, skipping model warmup is required—ensure this config is actually consumed by the model server startup logic.</violation>
</file>

<file name="backend/scripts/check_lazy_imports.py">

<violation number="1" location="backend/scripts/check_lazy_imports.py:20">
Rule violated: **Use Pydantic over dataclasses**

Use Pydantic BaseModel instead of @dataclass for new models. Replace LazyImportSettings with a Pydantic model and remove @dataclass.</violation>

<violation number="2" location="backend/scripts/check_lazy_imports.py:163">
Path comparison uses OS-specific separators; forward-slash ignore patterns won’t match on Windows. Normalize to POSIX before comparing.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant