Skip to content

[Doc][Serve][LLM] Add doc for deploying DeepSeek #52592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 27, 2025
49 changes: 48 additions & 1 deletion doc/source/serve/llm/serving-llms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,6 @@ Quickstart Examples
-------------------



Deployment through :class:`LLMRouter <ray.serve.llm.LLMRouter>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -249,6 +248,54 @@ For deploying multiple models, you can pass a list of :class:`LLMConfig <ray.ser
serve.run(llm_app, blocking=True)


Example: Deploying DeepSeek
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following example shows how you can deploy DeepSeek R1 or V3:

.. tab-set::

.. tab-item:: Builder Pattern
:sync: builder

.. code-block:: python

from ray import serve
from ray.serve.llm import LLMConfig, LLMRouter, LLMServer

llm_config = LLMConfig(
model_loading_config=dict(
model_id="deepseek",
# Change to model download path
model_source="/path/to/the/model",
),
deployment_config=dict(autoscaling_config=dict(
min_replicas=1,
max_replicas=1,
)),
# Change to the accelerator type of the node
accelerator_type="H100",
runtime_env=dict(env_vars=dict(VLLM_USE_V1="1")),
# Customize engine arguments as needed (e.g. vLLM engine kwargs)
engine_kwargs=dict(
tensor_parallel_size=8,
pipeline_parallel_size=2,
gpu_memory_utilization=0.92,
dtype="auto",
max_num_seqs=40,
max_model_len=16384,
enable_chunked_prefill=True,
enable_prefix_caching=True,
trust_remote_code=True,
),
)

# Deploy the application
deployment = LLMServer.as_deployment(
llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)

Production Deployment
---------------------

Expand Down