Skip to content

Commit 23405af

Browse files
dtrawinsngrozae
andauthored
improve llm demo visibility (#3268)
--------- Co-authored-by: ngrozae <104074686+ngrozae@users.noreply.github.com>
1 parent 6aa801d commit 23405af

File tree

2 files changed

+69
-11
lines changed

2 files changed

+69
-11
lines changed

demos/continuous_batching/README.md

Lines changed: 67 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -147,8 +147,12 @@ A single servable exposes both `chat/completions` and `completions` endpoints wi
147147
Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template.
148148
Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template.
149149

150-
:::{dropdown} **Unary call with cURL**
151-
```console
150+
### Unary calls to chat/completions endpoint using cURL
151+
152+
::::{tab-set}
153+
154+
:::{tab-item} Linux
155+
```bash
152156
curl http://localhost:8000/v3/chat/completions \
153157
-H "Content-Type: application/json" \
154158
-d '{
@@ -167,6 +171,26 @@ curl http://localhost:8000/v3/chat/completions \
167171
]
168172
}'| jq .
169173
```
174+
:::
175+
176+
:::{tab-item} Windows
177+
Windows Powershell
178+
```powershell
179+
(Invoke-WebRequest -Uri "http://localhost:8000/v3/chat/completions" `
180+
-Method POST `
181+
-Headers @{ "Content-Type" = "application/json" } `
182+
-Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}').Content
183+
```
184+
185+
Windows Command Prompt
186+
```bat
187+
curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What are the 3 main tourist attractions in Paris?\"}]}"
188+
```
189+
:::
190+
191+
::::
192+
193+
:::{dropdown} Expected Response
170194
```json
171195
{
172196
"choices": [
@@ -190,9 +214,13 @@ curl http://localhost:8000/v3/chat/completions \
190214
}
191215
}
192216
```
193-
217+
:::
218+
### Unary calls to completions endpoint using cURL
194219
A similar call can be made with a `completion` endpoint:
195-
```console
220+
::::{tab-set}
221+
222+
:::{tab-item} Linux
223+
```bash
196224
curl http://localhost:8000/v3/completions \
197225
-H "Content-Type: application/json" \
198226
-d '{
@@ -202,6 +230,26 @@ curl http://localhost:8000/v3/completions \
202230
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
203231
}'| jq .
204232
```
233+
:::
234+
235+
:::{tab-item} Windows
236+
Windows Powershell
237+
```powershell
238+
(Invoke-WebRequest -Uri "http://localhost:8000/v3/completions" `
239+
-Method POST `
240+
-Headers @{ "Content-Type" = "application/json" } `
241+
-Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "prompt":"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"}').Content
242+
```
243+
244+
Windows Command Prompt
245+
```bat
246+
curl -s http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"prompt\":\"^<^|begin_of_text^|^>^<^|start_header_id^|^>system^<^|end_header_id^|^>\n\nYou are assistant^<^|eot_id^|^>^<^|start_header_id^|^>user^<^|end_header_id^|^>\n\nWhat is OpenVINO?^<^|eot_id^|^>^<^|start_header_id^|^>assistant^<^|end_header_id^|^>\"}"
247+
```
248+
:::
249+
250+
::::
251+
252+
:::{dropdown} Expected Response
205253
```json
206254
{
207255
"choices": [
@@ -224,14 +272,18 @@ curl http://localhost:8000/v3/completions \
224272
```
225273
:::
226274

227-
:::{dropdown} **Streaming call with OpenAI Python package**
275+
### Streaming call with OpenAI Python package
228276

229-
The endpoints `chat/completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
277+
The endpoints `chat/completions` and `completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
230278

231279
Install the client library:
232280
```console
233281
pip3 install openai
234282
```
283+
284+
::::{tab-set}
285+
286+
:::{tab-item} Chat completions
235287
```python
236288
from openai import OpenAI
237289

@@ -255,7 +307,10 @@ Output:
255307
It looks like you're testing me!
256308
```
257309

258-
A similar code can be applied for the completion endpoint:
310+
:::
311+
312+
:::{tab-item} Completions
313+
259314
```console
260315
pip3 install openai
261316
```
@@ -283,19 +338,22 @@ It looks like you're testing me!
283338
```
284339
:::
285340

341+
::::
342+
286343
## Benchmarking text generation with high concurrency
287344

288345
OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
289346
It can be demonstrated using benchmarking app from vLLM repository:
290347
```console
291-
git clone --branch v0.6.0 --depth 1 https://github.com/vllm-project/vllm
348+
git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm
292349
cd vllm
293350
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
294351
cd benchmarks
295352
curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
296353
python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf
297354

298-
Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, percentile_metrics='ttft,tpot,itl', metric_percentiles='99')
355+
Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
356+
299357
Traffic request rate: inf
300358
100%|██████████████████████████████████████████████████| 1000/1000 [17:17<00:00, 1.04s/it]
301359
============ Serving Benchmark Result ============

docs/deploying_server_baremetal.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ tar -xzvf ovms_ubuntu22_python_on.tar.gz
1818
```
1919
Install required libraries:
2020
```{code} sh
21-
sudo apt update -y && apt install -y libxml2 curl
21+
sudo apt update -y && sudo apt install -y libxml2 curl
2222
```
2323
Set path to the libraries and add binary to the `PATH`
2424
```{code} sh
@@ -46,7 +46,7 @@ tar -xzvf ovms_ubuntu24_python_on.tar.gz
4646
```
4747
Install required libraries:
4848
```{code} sh
49-
sudo apt update -y && apt install -y libxml2 curl
49+
sudo apt update -y && sudo apt install -y libxml2 curl
5050
```
5151
Set path to the libraries and add binary to the `PATH`
5252
```{code} sh

0 commit comments

Comments
 (0)