You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: demos/continuous_batching/README.md
+67-9Lines changed: 67 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -147,8 +147,12 @@ A single servable exposes both `chat/completions` and `completions` endpoints wi
147
147
Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template.
148
148
Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template.
149
149
150
-
:::{dropdown} **Unary call with cURL**
151
-
```console
150
+
### Unary calls to chat/completions endpoint using cURL
-Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}').Content
183
+
```
184
+
185
+
Windows Command Prompt
186
+
```bat
187
+
curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What are the 3 main tourist attractions in Paris?\"}]}"
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
:::{dropdown} **Streaming call with OpenAI Python package**
275
+
### Streaming call with OpenAI Python package
228
276
229
-
The endpoints `chat/completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
277
+
The endpoints `chat/completions`and `completions`are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
230
278
231
279
Install the client library:
232
280
```console
233
281
pip3 install openai
234
282
```
283
+
284
+
::::{tab-set}
285
+
286
+
:::{tab-item} Chat completions
235
287
```python
236
288
from openai import OpenAI
237
289
@@ -255,7 +307,10 @@ Output:
255
307
It looks like you're testing me!
256
308
```
257
309
258
-
A similar code can be applied for the completion endpoint:
310
+
:::
311
+
312
+
:::{tab-item} Completions
313
+
259
314
```console
260
315
pip3 install openai
261
316
```
@@ -283,19 +338,22 @@ It looks like you're testing me!
283
338
```
284
339
:::
285
340
341
+
::::
342
+
286
343
## Benchmarking text generation with high concurrency
287
344
288
345
OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
289
346
It can be demonstrated using benchmarking app from vLLM repository:
0 commit comments