You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/llama_multi_lora_tutorial.md
+11-51Lines changed: 11 additions & 51 deletions
Original file line number
Diff line number
Diff line change
@@ -42,7 +42,7 @@ The following tutorial demonstrates how to deploy **a LLaMa model** with **multi
42
42
43
43
## Step 1: Start a docker container for triton-vllm serving
44
44
45
-
**A docker container is strongly recommended for serving**, and this tutorial will only demonstrate how to launch triton in docker env.
45
+
**A docker container is strongly recommended for serving**, and this tutorial will only demonstrate how to launch triton in the docker environment.
46
46
47
47
First, start a docker container using the tritonserver image with vLLM backend from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags):
48
48
@@ -56,16 +56,16 @@ sudo docker run --gpus all -it --net=host -p 8001:8001 --shm-size=12G \
56
56
/bin/bash
57
57
```
58
58
59
-
**NOTICE:** the version of triton docker image should be configurated, here we use `<xx.yy>` to symbolize.
59
+
**NOTICE:** the version of triton docker image should be configurated, here and through this tutorial we use `<xx.yy>` to symbolize the version.
60
60
61
61
Triton's vLLM container has been introduced starting from 23.10 release, and `multi-lora` experimental support was added in vLLM v0.3.0 release.
62
62
63
63
> Docker image version `nvcr.io/nvidia/tritonserver:24.05-vllm-python-py3` or higher version is strongly recommended.
64
64
65
+
> [!IMPORTANT]
66
+
> 24.05 release is still under active development, and relevant NGC containers are not available at this time.
65
67
---
66
68
67
-
<!-- TODO: check for the specific correct version, currently we set it to 24.05 -->
68
-
69
69
For **pre-24.05 containers**, the docker images didn't support multi-lora feature, so you need to replace that provided in the container `/opt/tritonserver/backends/vllm/model.py` with the most up to date version. Just follow this command:
70
70
71
71
Download the `model.py` script from github:
@@ -74,50 +74,10 @@ Download the `model.py` script from github:
**Notice:**`r<xx.yy>` is the triton version you need to configure to r24.04 or later release.
78
-
79
77
This command will download the `model.py` script to the Triton vllm backend directory which will enable multi-lora feature.
80
78
81
-
## Step 2: Install vLLM with multi-lora feature
82
-
83
-
We are now in the docker container, and **the following operations will be done in container environment.**
84
-
85
-
```bash
86
-
cd /vllm_workspace
87
-
```
88
-
89
-
**NOTICE**: To enable multi-lora feature and speed up the inference, vLLM has integrated punica kernels. To compile the punica kernels, you need to turn the `VLLM_INSTALL_PUNICA_KERNELS` env variable on to allow punica kernels compilation.
90
-
91
-
By default, the punica kernels will **NOT** be compiled when installing the vLLM.
92
-
93
-
__2.1 install with pip__
94
-
95
-
For Triton version before 24.05, you need the following command:
To support multi-lora on Triton, you need to manage your file path for **model backbone** and **lora weights** separately.
123
83
@@ -135,9 +95,9 @@ weights
135
95
+ A workspace for `vllm`, and `model backbone weights`, `LoRA adapter weights` is strongly recommended.
136
96
+ You should expand the storage of these weight files to ensure they are logically organized in the workspace.
137
97
138
-
## Step 4: Prepare `model repository` for Triton Server
98
+
## Step 3: Prepare `model repository` for Triton Server
139
99
140
-
__4.1 Download the model repository files__
100
+
__3.1 Download the model repository files__
141
101
142
102
To use Triton, a model repository is needed, for *model path* , *backend configuration* and other information. The vllm backend is implemented based on python backend, and `sampling_params` of vllm are sampled from `model.json`.
143
103
@@ -182,7 +142,7 @@ vllm_workspace
182
142
└── config.pbtxt
183
143
```
184
144
185
-
__4.2 Populate `model.json`__
145
+
__3.2 Populate `model.json`__
186
146
187
147
For this tutorial we will use the following set of parameters, specified in the `model.json`.
188
148
@@ -209,7 +169,7 @@ For this tutorial we will use the following set of parameters, specified in the
209
169
210
170
The full set of parameters can be found [here](https://github.com/Yard1/vllm/blob/multi_lora/vllm/engine/arg_utils.py#L11).
211
171
212
-
__4.3 Specify local lora path__
172
+
__3.3 Specify local lora path__
213
173
214
174
vLLM v0.4.0.post1 supported the inference of **local lora weights applying**, which means that the vllm cannot pull any lora adapter from huggingface. So triton should know where the local lora weights are.
215
175
@@ -233,7 +193,7 @@ The **key** should be the supported lora name, and the **value** should be the s
233
193
234
194
> **Warning**: if you set `enable_lora` to `true` in `model.json` without creating a `multi_lora.json` file, the server will throw `FileNotFoundError` when initializing.
235
195
236
-
## Step 5: Launch Triton
196
+
## Step 4: Launch Triton
237
197
238
198
```bash
239
199
# NOTICE: you must first cd to your vllm_workspace path.
@@ -249,7 +209,7 @@ I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
249
209
I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
250
210
```
251
211
252
-
## Step 6: Send a request
212
+
## Step 5: Send a request
253
213
254
214
A client request script for multi-lora was prepared, downloading the client script from source:
0 commit comments