@@ -38,7 +38,7 @@ pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/hugging
38
38
Run optimum-cli to download and quantize the model:
39
39
``` bash
40
40
cd demos/continuous_batching
41
- optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int8 Meta-Llama-3-8B-Instruct
41
+ optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 Meta-Llama-3-8B-Instruct
42
42
convert_tokenizer -o Meta-Llama-3-8B-Instruct --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens meta-llama/Meta-Llama-3-8B-Instruct
43
43
```
44
44
@@ -287,36 +287,31 @@ It can be demonstrated using benchmarking app from vLLM repository:
287
287
``` bash
288
288
git clone https://github.com/vllm-project/vllm
289
289
cd vllm
290
- pip3 install wheel packaging ninja " setuptools>=49.4.0" numpy
291
290
pip3 install -r requirements-cpu.txt
292
- export VLLM_TARGET_DEVICE=cpu
293
- python setup.py install
294
291
cd benchmarks
295
- sed -i -e ' s|v1/chat/completions|v3/chat/completions|g' backend_request_func.py # allows calls to endpoint with v3 instead of v1 like in vLLM
296
292
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
297
- python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 1
293
+ python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf
298
294
299
- Namespace(backend=' openai-chat' , version=' N/A' , base_url=None, host=' localhost' , port=8000, endpoint=' /v3/chat/completions' , dataset=' ShareGPT_V3_unfiltered_cleaned_split.json' , model=' meta-llama/Meta-Llama-3-8B-Instruct' , tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, request_rate=1 .0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False)
300
- Traffic request rate: 1.0
295
+ Namespace(backend=' openai-chat' , version=' N/A' , base_url=None, host=' localhost' , port=8000, endpoint=' /v3/chat/completions' , dataset=' ShareGPT_V3_unfiltered_cleaned_split.json' , model=' meta-llama/Meta-Llama-3-8B-Instruct' , tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, request_rate=inf .0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False)
296
+ Traffic request rate: inf
301
297
100%| ██████████████████████████████████████████████████| 1000/1000 [17:17< 00:00, 1.04s/it]
302
298
============ Serving Benchmark Result ============
303
299
Successful requests: 1000
304
- Benchmark duration (s): 1037.78
305
- Total input tokens: 245995
306
- Total generated tokens: 195504
307
- Request throughput (req/s): 0.96
308
- Input token throughput (tok/s): 237.04
309
- Output token throughput (tok/s): 188.39
300
+ Benchmark duration (s): 447.62
301
+ Total input tokens: 215201
302
+ Total generated tokens: 198588
303
+ Request throughput (req/s): 2.23
304
+ Input token throughput (tok/s): 480.76
305
+ Output token throughput (tok/s): 443.65
310
306
---------------Time to First Token----------------
311
- Mean TTFT (ms): 693.63
312
- Median TTFT (ms): 570.60
313
- P99 TTFT (ms): 2187.77
307
+ Mean TTFT (ms): 171999.94
308
+ Median TTFT (ms): 170699.21
309
+ P99 TTFT (ms): 360941.40
314
310
-----Time per Output Token (excl. 1st token)------
315
- Mean TPOT (ms): 132.96
316
- Median TPOT (ms): 143.28
317
- P99 TPOT (ms): 234.14
311
+ Mean TPOT (ms): 211.31
312
+ Median TPOT (ms): 223.79
313
+ P99 TPOT (ms): 246.48
318
314
==================================================
319
-
320
315
```
321
316
322
317
## RAG with Model Server
0 commit comments