Skip to content

Commit 1dab325

Browse files
authored
Milti-Lora docs follow up (#40)
1 parent f064eed commit 1dab325

File tree

1 file changed

+11
-51
lines changed

1 file changed

+11
-51
lines changed

docs/llama_multi_lora_tutorial.md

Lines changed: 11 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ The following tutorial demonstrates how to deploy **a LLaMa model** with **multi
4242
4343
## Step 1: Start a docker container for triton-vllm serving
4444

45-
**A docker container is strongly recommended for serving**, and this tutorial will only demonstrate how to launch triton in docker env.
45+
**A docker container is strongly recommended for serving**, and this tutorial will only demonstrate how to launch triton in the docker environment.
4646

4747
First, start a docker container using the tritonserver image with vLLM backend from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags):
4848

@@ -56,16 +56,16 @@ sudo docker run --gpus all -it --net=host -p 8001:8001 --shm-size=12G \
5656
/bin/bash
5757
```
5858

59-
**NOTICE:** the version of triton docker image should be configurated, here we use `<xx.yy>` to symbolize.
59+
**NOTICE:** the version of triton docker image should be configurated, here and through this tutorial we use `<xx.yy>` to symbolize the version.
6060

6161
Triton's vLLM container has been introduced starting from 23.10 release, and `multi-lora` experimental support was added in vLLM v0.3.0 release.
6262

6363
> Docker image version `nvcr.io/nvidia/tritonserver:24.05-vllm-python-py3` or higher version is strongly recommended.
6464
65+
> [!IMPORTANT]
66+
> 24.05 release is still under active development, and relevant NGC containers are not available at this time.
6567
---
6668

67-
<!-- TODO: check for the specific correct version, currently we set it to 24.05 -->
68-
6969
For **pre-24.05 containers**, the docker images didn't support multi-lora feature, so you need to replace that provided in the container `/opt/tritonserver/backends/vllm/model.py` with the most up to date version. Just follow this command:
7070

7171
Download the `model.py` script from github:
@@ -74,50 +74,10 @@ Download the `model.py` script from github:
7474
wget -P /opt/tritonserver/backends/vllm/ https://raw.githubusercontent.com/triton-inference-server/vllm_backend/r<xx.yy>/src/model.py
7575
```
7676

77-
**Notice:** `r<xx.yy>` is the triton version you need to configure to r24.04 or later release.
78-
7977
This command will download the `model.py` script to the Triton vllm backend directory which will enable multi-lora feature.
8078

81-
## Step 2: Install vLLM with multi-lora feature
82-
83-
We are now in the docker container, and **the following operations will be done in container environment.**
84-
85-
```bash
86-
cd /vllm_workspace
87-
```
88-
89-
**NOTICE**: To enable multi-lora feature and speed up the inference, vLLM has integrated punica kernels. To compile the punica kernels, you need to turn the `VLLM_INSTALL_PUNICA_KERNELS` env variable on to allow punica kernels compilation.
90-
91-
By default, the punica kernels will **NOT** be compiled when installing the vLLM.
92-
93-
__2.1 install with pip__
94-
95-
For Triton version before 24.05, you need the following command:
96-
97-
```bash
98-
VLLM_INSTALL_PUNICA_KERNELS=1 pip install vllm==0.4.0.post1
99-
```
100-
101-
__2.2 build from source__
102-
103-
As alternative, you can build vLLM from source code:
104-
105-
git clone vllm repository:
106-
107-
```bash
108-
git clone https://github.com/vllm-project/vllm.git
109-
```
110-
111-
All you need to do is to follow the simple step:
112-
113-
```bash
114-
cd vllm
115-
VLLM_INSTALL_PUNICA_KERNELS=1 pip install .
116-
```
117-
118-
This may take you 5-10 mins.
11979

120-
## Step 3: Prepare your weights
80+
## Step 2: Prepare your weights
12181

12282
To support multi-lora on Triton, you need to manage your file path for **model backbone** and **lora weights** separately.
12383

@@ -135,9 +95,9 @@ weights
13595
+ A workspace for `vllm`, and `model backbone weights`, `LoRA adapter weights` is strongly recommended.
13696
+ You should expand the storage of these weight files to ensure they are logically organized in the workspace.
13797

138-
## Step 4: Prepare `model repository` for Triton Server
98+
## Step 3: Prepare `model repository` for Triton Server
13999

140-
__4.1 Download the model repository files__
100+
__3.1 Download the model repository files__
141101

142102
To use Triton, a model repository is needed, for *model path* , *backend configuration* and other information. The vllm backend is implemented based on python backend, and `sampling_params` of vllm are sampled from `model.json`.
143103

@@ -182,7 +142,7 @@ vllm_workspace
182142
└── config.pbtxt
183143
```
184144

185-
__4.2 Populate `model.json`__
145+
__3.2 Populate `model.json`__
186146

187147
For this tutorial we will use the following set of parameters, specified in the `model.json`.
188148

@@ -209,7 +169,7 @@ For this tutorial we will use the following set of parameters, specified in the
209169

210170
The full set of parameters can be found [here](https://github.com/Yard1/vllm/blob/multi_lora/vllm/engine/arg_utils.py#L11).
211171

212-
__4.3 Specify local lora path__
172+
__3.3 Specify local lora path__
213173

214174
vLLM v0.4.0.post1 supported the inference of **local lora weights applying**, which means that the vllm cannot pull any lora adapter from huggingface. So triton should know where the local lora weights are.
215175

@@ -233,7 +193,7 @@ The **key** should be the supported lora name, and the **value** should be the s
233193

234194
> **Warning**: if you set `enable_lora` to `true` in `model.json` without creating a `multi_lora.json` file, the server will throw `FileNotFoundError` when initializing.
235195
236-
## Step 5: Launch Triton
196+
## Step 4: Launch Triton
237197

238198
```bash
239199
# NOTICE: you must first cd to your vllm_workspace path.
@@ -249,7 +209,7 @@ I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
249209
I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
250210
```
251211

252-
## Step 6: Send a request
212+
## Step 5: Send a request
253213

254214
A client request script for multi-lora was prepared, downloading the client script from source:
255215

0 commit comments

Comments
 (0)