improve llm demo visibility (#3268)

dtrawins · ngrozae · web-flow · commit 23405af46d4a · 2025-05-14T17:01:20.000+02:00
---------

Co-authored-by: ngrozae &lt;104074686+ngrozae@users.noreply.github.com&gt;
diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md
@@ -147,8 +147,12 @@ A single servable exposes both `chat/completions` and `completions` endpoints wi
 Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template.
 Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template.
 
-:::{dropdown} **Unary call with cURL**
-```console
+### Unary calls to chat/completions endpoint using cURL 
+
+::::{tab-set}
+
+:::{tab-item} Linux
+```bash
 curl http://localhost:8000/v3/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
@@ -167,6 +171,26 @@ curl http://localhost:8000/v3/chat/completions \
     ]
   }'| jq .
 ```
+:::
+
+:::{tab-item} Windows
+Windows Powershell
+```powershell
+(Invoke-WebRequest -Uri "http://localhost:8000/v3/chat/completions" `
+ -Method POST `
+ -Headers @{ "Content-Type" = "application/json" } `
+ -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}').Content
+```
+
+Windows Command Prompt
+```bat
+curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What are the 3 main tourist attractions in Paris?\"}]}"
+```
+:::
+
+::::
+
+:::{dropdown} Expected Response
 ```json
 {
   "choices": [
@@ -190,9 +214,13 @@ curl http://localhost:8000/v3/chat/completions \
   }
 }
 ```
-
+:::
+### Unary calls to completions endpoint using cURL 
 A similar call can be made with a `completion` endpoint:
-```console
+::::{tab-set}
+
+:::{tab-item} Linux
+```bash
 curl http://localhost:8000/v3/completions \
   -H "Content-Type: application/json" \
   -d '{
@@ -202,6 +230,26 @@ curl http://localhost:8000/v3/completions \
     "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
   }'| jq .
 ```
+:::
+
+:::{tab-item} Windows
+Windows Powershell
+```powershell
+(Invoke-WebRequest -Uri "http://localhost:8000/v3/completions" `
+ -Method POST `
+ -Headers @{ "Content-Type" = "application/json" } `
+ -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "prompt":"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"}').Content
+```
+
+Windows Command Prompt
+```bat
+curl -s http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"prompt\":\"^<^|begin_of_text^|^>^<^|start_header_id^|^>system^<^|end_header_id^|^>\n\nYou are assistant^<^|eot_id^|^>^<^|start_header_id^|^>user^<^|end_header_id^|^>\n\nWhat is OpenVINO?^<^|eot_id^|^>^<^|start_header_id^|^>assistant^<^|end_header_id^|^>\"}"
+```
+::: 
+
+::::
+
+:::{dropdown} Expected Response
 ```json
 {
   "choices": [
@@ -224,14 +272,18 @@ curl http://localhost:8000/v3/completions \
 ```
 :::
 
-:::{dropdown} **Streaming call with OpenAI Python package**
+### Streaming call with OpenAI Python package
 
-The endpoints `chat/completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
+The endpoints `chat/completions` and `completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
 
 Install the client library:
 ```console
 pip3 install openai
 ```
+
+::::{tab-set}
+
+:::{tab-item} Chat completions
 ```python
 from openai import OpenAI
 
@@ -255,7 +307,10 @@ Output:
 It looks like you're testing me!
 ```
 
-A similar code can be applied for the completion endpoint:
+:::
+
+:::{tab-item} Completions
+
 ```console
 pip3 install openai
 ```
@@ -283,19 +338,22 @@ It looks like you're testing me!
 ```
 :::
 
+::::
+
 ## Benchmarking text generation with high concurrency
 
 OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
 It can be demonstrated using benchmarking app from vLLM repository:
 ```console
-git clone --branch v0.6.0 --depth 1 https://github.com/vllm-project/vllm
+git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm
 cd vllm
 pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
 cd benchmarks
 curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
 python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf
 
-Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, percentile_metrics='ttft,tpot,itl', metric_percentiles='99')
+Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
+
 Traffic request rate: inf
 100%|██████████████████████████████████████████████████| 1000/1000 [17:17<00:00,  1.04s/it]
 ============ Serving Benchmark Result ============
diff --git a/docs/deploying_server_baremetal.md b/docs/deploying_server_baremetal.md
@@ -18,7 +18,7 @@ tar -xzvf ovms_ubuntu22_python_on.tar.gz
 ```
 Install required libraries:
 ```{code} sh
-sudo apt update -y && apt install -y libxml2 curl
+sudo apt update -y && sudo apt install -y libxml2 curl
 ```
 Set path to the libraries and add binary to the `PATH`
 ```{code} sh
@@ -46,7 +46,7 @@ tar -xzvf ovms_ubuntu24_python_on.tar.gz
 ```
 Install required libraries:
 ```{code} sh
-sudo apt update -y && apt install -y libxml2 curl
+sudo apt update -y && sudo apt install -y libxml2 curl
 ```
 Set path to the libraries and add binary to the `PATH`
 ```{code} sh