Skip to content

Commit ad491b7

Browse files
LLM calculator documentation updates (#2692)
* update LLM documentation * LLM gold support * Apply suggestions from code review Co-authored-by: Damian Kalinowski <damian.kalinowski@intel.com> * fixes * top_k update --------- Co-authored-by: Damian Kalinowski <damian.kalinowski@intel.com>
1 parent 866c575 commit ad491b7

File tree

10 files changed

+54
-34
lines changed

10 files changed

+54
-34
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Start using OpenVINO Model Server with a fast-forward serving example from the [
2121
Read [release notes](https://github.com/openvinotoolkit/model_server/releases) to find out what’s new.
2222

2323
### Key features:
24-
- **[NEW]** [Efficient Text Generation via OpenAI API - preview](https://docs.openvino.ai/nightly/ovms_docs_llm_reference.html)
24+
- **[NEW]** [Efficient Text Generation via OpenAI API](https://docs.openvino.ai/nightly/ovms_docs_llm_reference.html)
2525
- [Python code execution](https://docs.openvino.ai/nightly/ovms_docs_python_support_reference.html)
2626
- [gRPC streaming](https://docs.openvino.ai/nightly/ovms_docs_streaming_endpoints.html)
2727
- [MediaPipe graphs serving](https://docs.openvino.ai/nightly/ovms_docs_mediapipe.html)

demos/continuous_batching/README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,10 @@ That makes it easy to use and efficient especially on on Intel® Xeon® processo
77
88
## Get the docker image
99

10+
Pull public image with CPU only support or including also GPU support.
1011
```bash
11-
docker pull openvino/model_server:2024.4-gpu
12+
docker pull openvino/model_server:latest-gpu
13+
docker pull openvino/model_server:latest
1214
```
1315
or build the image from source to try the latest enhancements in this feature.
1416
```bash
@@ -88,10 +90,11 @@ cat config.json
8890

8991
## Start-up
9092
```bash
91-
docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server --port 9000 --rest_port 8000 --config_path /workspace/config.json
93+
docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server:latest --port 9000 --rest_port 8000 --config_path /workspace/config.json
9294
```
9395
In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)`
94-
to `docker run` command and make sure you copy the graph.pbtxt tuned for GPU device. Also make sure the export model quantization level and cache size fit to the GPU memory.
96+
to `docker run` command, use the image with GPU support and make sure you copy the graph.pbtxt tuned for GPU device.
97+
Also make sure the export model quantization level and cache size fit to the GPU memory.
9598
```
9699
97100
@@ -299,9 +302,9 @@ Check the example in the [RAG notebook](https://github.com/openvinotoolkit/model
299302

300303
## Scaling the Model Server
301304

302-
Check this simple [text generation scaling demo](./scaling/README.md).
305+
Check this simple [text generation scaling demo](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/scaling/README.md).
303306

304307

305308
## Testing the model accuracy over serving API
306309

307-
Check the [guide of using lm-evaluation-harness](./accuracy/README.md)
310+
Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/accuracy/README.md)

demos/continuous_batching/accuracy/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ pip3 install lm_eval[api]
1515

1616
## Exporting the model and starting the model server
1717

18-
Following the procedure to export the model and start the model server from [text generatino demo](../README.md)
18+
Follow the procedure to export the model and start the model server from [text generation demo](../README.md)
1919

2020
## Running the tests
2121

docs/clients_openai.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ LLM calculator <ovms_docs_llm_caclulator>
1414
## Introduction
1515
Beside Tensorflow Serving API and KServe API frontends, the model server has now option to delegate the REST input deserialization and output serialization to a MediaPipe graph. A custom calculator can implement any form of REST API including streaming based on [Server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events).
1616

17-
That way we are introducing a preview of OpenAI compatible endpoint [chat/completions](./model_server_rest_api_chat.md). More endpoints are planned for the implementation.
17+
We are introducing OpenAI compatible endpoint [chat/completions](./model_server_rest_api_chat.md) and [completions](./model_server_rest_api_completions.md).
1818

1919

2020
## Python Client

docs/home.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ The models used by the server need to be stored locally or hosted remotely by ob
3737
Start using OpenVINO Model Server with a fast-forward serving example from the [Quickstart guide](ovms_quickstart.md) or explore [Model Server features](features.md).
3838

3939
### Key features:
40-
- **[NEW]** [Efficient Text Generation - preview](llm/reference.md)
40+
- **[NEW]** [Efficient Text Generation](llm/reference.md)
4141
- [Python code execution](python_support/reference.md)
4242
- [gRPC streaming](streaming_endpoints.md)
4343
- [MediaPipe graphs serving](mediapipe.md)

docs/llm/quickstart.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ Let's deploy [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLla
44

55
1. Install python dependencies for the conversion script:
66
```bash
7-
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
8-
pip3 install "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git@xeon openvino-tokenizers transformers==4.41.2
7+
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/pre-release"
8+
pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git openvino_tokenizers==2024.4.* openvino==2024.4.*
99
```
1010

1111
2. Run optimum-cli to download and quantize the model:
@@ -14,7 +14,7 @@ mkdir workspace && cd workspace
1414

1515
optimum-cli export openvino --disable-convert-tokenizer --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 TinyLlama-1.1B-Chat-v1.0
1616

17-
convert_tokenizer -o TinyLlama-1.1B-Chat-v1.0 --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens TinyLlama/TinyLlama-1.1B-Chat-v1.0
17+
convert_tokenizer -o TinyLlama-1.1B-Chat-v1.0 --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens TinyLlama/TinyLlama-1.1B-Chat-v1.0
1818
```
1919

2020
3. Create `graph.pbtxt` file in a model directory:
@@ -37,7 +37,9 @@ node: {
3737
}
3838
node_options: {
3939
[type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
40-
models_path: "./"
40+
models_path: "./",
41+
plugin_config: '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}',
42+
cache_size: 4
4143
}
4244
}
4345
input_stream_handler {

docs/llm/reference.md

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
# Efficient LLM Serving {#ovms_docs_llm_reference}
22

3-
**THIS IS A PREVIEW FEATURE**
4-
53
## Overview
64

75
With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like:
@@ -15,7 +13,7 @@ It is now integrated into OpenVINO Model Server providing efficient way to run g
1513
Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature.
1614

1715
## LLM Calculator
18-
As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) solutions. The calculator is designed to run in cycles and return the chunks of responses to the client.
16+
As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/master/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
1917

2018
On the input it expects a HttpPayload struct passed by the Model Server frontend:
2119
```cpp
@@ -82,8 +80,8 @@ The calculator supports the following `node_options` for tuning the pipeline con
8280
- `optional uint64 block_size` - number of tokens which KV is stored in a single block (Paged Attention related) [default = 32];
8381
- `optional uint64 max_num_seqs` - max number of sequences actively processed by the engine [default = 256];
8482
- `optional bool dynamic_split_fuse` - use Dynamic Split Fuse token scheduling [default = true];
85-
- `optional string device` - device to load models to. Supported values: "CPU" [default = "CPU"]
86-
- `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options) [default = ""]
83+
- `optional string device` - device to load models to. Supported values: "CPU", "GPU" [default = "CPU"]
84+
- `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options) [default = "{}"]
8785
- `optional uint32 best_of_limit` - max value of best_of parameter accepted by endpoint [default = 20];
8886
- `optional uint32 max_tokens_limit` - max value of max_tokens parameter accepted by endpoint [default = 4096];
8987
- `optional bool enable_prefix_caching` - enable caching of KV-blocks [default = false];
@@ -96,8 +94,29 @@ You can track the actual usage of the cache in the server logs. You can observe
9694
```
9795
Consider increasing the `cache_size` parameter in case the logs report the usage getting close to 100%. When the cache is consumed, some of the running requests might be preempted to free cache for other requests to finish their generations (preemption will likely have negative impact on performance since preempted request cache will need to be recomputed when it gets processed again). When preemption is not possible i.e. `cache size` is very small and there is a single, long running request that consumes it all, then the request gets terminated when no more cache can be assigned to it, even before reaching stopping criteria.
9896

97+
`enable_prefix_caching` can improve generation performance when the initial prompt content is repeated. That is the case with chat applications which resend the history of the conversations. Thanks to prefix caching, there is no need to reevaluate the same sequence of tokens. Thanks to that, first token will be generated much quicker and the overall
98+
utlization of resource will be lower. Old cache will be cleared automatically but it is recommended to increase cache_size to take bigger performance advantage.
99+
100+
`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}'.
101+
99102
The LLM calculator config can also restrict the range of sampling parameters in the client requests. If needed change the default values for `max_tokens_limit` and `best_of_limit`. It is meant to avoid the result of memory overconsumption by invalid requests.
100103

104+
105+
## Canceling the generation
106+
107+
In order to optimize the usage of compute resources, it is important to stop the text generation when it becomes irrelevant for the client or when the client gets disconnected for any reason. Such capability is implemented via a tight integration between the LLM calculator and the model server frontend. The calculator gets notified about the client session disconnection. When the client application stops or deliberately breaks the session, the generation cycle gets broken and all resources are released. Below is an easy example how the client can initialize stopping the generation:
108+
```python
109+
from openai import OpenAI
110+
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
111+
stream = client.completions.create(model="model", prompt="Say this is a test", stream=True)
112+
for chunk in stream:
113+
if chunk.choices[0].text is not None:
114+
print(chunk.choices[0].delta.content, end="", flush=True)
115+
if some_condition:
116+
stream.close()
117+
break
118+
```
119+
101120
## Models Directory
102121

103122
In node configuration we set `models_path` indicating location of the directory with files loaded by LLM engine. It loads following files:
@@ -119,13 +138,13 @@ This model directory can be created based on the models from Hugging Face Hub or
119138

120139
In your python environment install required dependencies:
121140
```
122-
pip3 install "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git@7a224c2419240d5fb58f2f75c2e29f179ed6da28 openvino-tokenizers
141+
pip3 install "optimum-intel[nncf,openvino]
123142
```
124143

125144
Because there is very dynamic development in optimum-intel and openvino, it is recommended to use the latest versions of the dependencies:
126145
```
127-
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly"
128-
pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git openvino-tokenizers
146+
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/pre-release"
147+
pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git openvino_tokenizers openvino
129148
```
130149

131150
LLM model can be exported with a command:
@@ -136,7 +155,7 @@ Precision parameter is important and can influence performance, accuracy and mem
136155

137156
Export the tokenizer model with a command:
138157
```
139-
convert_tokenizer -o {target folder name} --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
158+
convert_tokenizer -o {target folder name} --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
140159
```
141160

142161
Check [tested models](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models).
@@ -164,11 +183,11 @@ When default template is loaded, servable accepts `/chat/completions` calls when
164183

165184
## Limitations
166185

167-
LLM calculator is a preview feature. It runs a set of accuracy, stability and performance tests, but the next releases targets production grade quality. It has now a set of known issues:
186+
There are several known limitations which are expected to be addressed in the coming releases:
168187

169-
- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_graphs_running`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md)
188+
- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_current_graphs`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md)
170189
- Multi modal models are not supported yet. Images can't be sent now as the context.
171-
- GPU device is not supported yet. It is planned for version 2024.5.
190+
- `logprobs` parameter is not supported currently in greedy search (temperature=0) and in streaming mode. It includes only a single logprob and do not include values for input tokens.
172191

173192
## References
174193
- [Chat Completions API](../model_server_rest_api_chat.md)

docs/model_server_rest_api_chat.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -87,18 +87,16 @@ curl http://localhost/v3/chat/completions \
8787
|-------|----------|----------|----------|---------|-----|
8888
| temperature |||| float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
8989
| top_p |||| float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
90-
| top_k |||| int (default: `0`) | Controls the number of top tokens to consider. Set to 0 to consider all tokens. |
90+
| top_k |||| int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
9191
| repetition_penalty |||| float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
9292
| frequency_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
9393
| presence_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
9494
| seed |||| integer (default: `0`) | Random seed to use for the generation. |
9595

9696
#### Unsupported params from OpenAI service:
9797
- logit_bias
98-
- logprobs
9998
- top_logprobs
10099
- response_format
101-
- seed
102100
- tools
103101
- tool_choice
104102
- user
@@ -109,10 +107,8 @@ curl http://localhost/v3/chat/completions \
109107
- min_p
110108
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
111109
- early_stopping
112-
- stop
113110
- stop_token_ids
114111
- min_tokens
115-
- logprobs
116112
- prompt_logprobs
117113
- detokenize
118114
- skip_special_tokens

docs/model_server_rest_api_completions.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ curl http://localhost/v3/completions \
7575
|-------|----------|----------|----------|---------|-----|
7676
| temperature |||| float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
7777
| top_p |||| float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
78-
| top_k |||| int (default: `0`) | Controls the number of top tokens to consider. Set to 0 to consider all tokens. |
78+
| top_k |||| int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
7979
| repetition_penalty |||| float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
8080
| frequency_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
8181
| presence_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
@@ -84,18 +84,15 @@ curl http://localhost/v3/completions \
8484
#### Unsupported params from OpenAI service:
8585
- echo
8686
- logit_bias
87-
- logprobs
8887
- suffix
8988

9089

9190
#### Unsupported params from vLLM:
9291
- min_p
9392
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
9493
- early_stopping
95-
- stop
9694
- stop_token_ids
9795
- min_tokens
98-
- logprobs
9996
- prompt_logprobs
10097
- detokenize
10198
- skip_special_tokens

0 commit comments

Comments
 (0)