Skip to content

Commit cf3a3ba

Browse files
dtrawinsbstrzele
andauthored
Simplified text generation demo based on continuous batching (#2584) (#2597)
Co-authored-by: Bartosz Strzelecki <bartosz.strzelecki@intel.com>
1 parent 19102c0 commit cf3a3ba

File tree

1 file changed

+16
-21
lines changed

1 file changed

+16
-21
lines changed

demos/continuous_batching/README.md

Lines changed: 16 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/hugging
3838
Run optimum-cli to download and quantize the model:
3939
```bash
4040
cd demos/continuous_batching
41-
optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int8 Meta-Llama-3-8B-Instruct
41+
optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 Meta-Llama-3-8B-Instruct
4242
convert_tokenizer -o Meta-Llama-3-8B-Instruct --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens meta-llama/Meta-Llama-3-8B-Instruct
4343
```
4444

@@ -287,36 +287,31 @@ It can be demonstrated using benchmarking app from vLLM repository:
287287
```bash
288288
git clone https://github.com/vllm-project/vllm
289289
cd vllm
290-
pip3 install wheel packaging ninja "setuptools>=49.4.0" numpy
291290
pip3 install -r requirements-cpu.txt
292-
export VLLM_TARGET_DEVICE=cpu
293-
python setup.py install
294291
cd benchmarks
295-
sed -i -e 's|v1/chat/completions|v3/chat/completions|g' backend_request_func.py # allows calls to endpoint with v3 instead of v1 like in vLLM
296292
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
297-
python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 1
293+
python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf
298294

299-
Namespace(backend='openai-chat', version='N/A', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, request_rate=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False)
300-
Traffic request rate: 1.0
295+
Namespace(backend='openai-chat', version='N/A', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, request_rate=inf.0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False)
296+
Traffic request rate: inf
301297
100%|██████████████████████████████████████████████████| 1000/1000 [17:17<00:00, 1.04s/it]
302298
============ Serving Benchmark Result ============
303299
Successful requests: 1000
304-
Benchmark duration (s): 1037.78
305-
Total input tokens: 245995
306-
Total generated tokens: 195504
307-
Request throughput (req/s): 0.96
308-
Input token throughput (tok/s): 237.04
309-
Output token throughput (tok/s): 188.39
300+
Benchmark duration (s): 447.62
301+
Total input tokens: 215201
302+
Total generated tokens: 198588
303+
Request throughput (req/s): 2.23
304+
Input token throughput (tok/s): 480.76
305+
Output token throughput (tok/s): 443.65
310306
---------------Time to First Token----------------
311-
Mean TTFT (ms): 693.63
312-
Median TTFT (ms): 570.60
313-
P99 TTFT (ms): 2187.77
307+
Mean TTFT (ms): 171999.94
308+
Median TTFT (ms): 170699.21
309+
P99 TTFT (ms): 360941.40
314310
-----Time per Output Token (excl. 1st token)------
315-
Mean TPOT (ms): 132.96
316-
Median TPOT (ms): 143.28
317-
P99 TPOT (ms): 234.14
311+
Mean TPOT (ms): 211.31
312+
Median TPOT (ms): 223.79
313+
P99 TPOT (ms): 246.48
318314
==================================================
319-
320315
```
321316

322317
## RAG with Model Server

0 commit comments

Comments
 (0)