You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)`
94
-
to `docker run` command and make sure you copy the graph.pbtxt tuned for GPU device. Also make sure the export model quantization level and cache size fit to the GPU memory.
96
+
to `docker run` command, use the image with GPU support and make sure you copy the graph.pbtxt tuned for GPU device.
97
+
Also make sure the export model quantization level and cache size fit to the GPU memory.
95
98
```
96
99
97
100
@@ -299,9 +302,9 @@ Check the example in the [RAG notebook](https://github.com/openvinotoolkit/model
299
302
300
303
## Scaling the Model Server
301
304
302
-
Check this simple [text generation scaling demo](./scaling/README.md).
305
+
Check this simple [text generation scaling demo](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/scaling/README.md).
303
306
304
307
305
308
## Testing the model accuracy over serving API
306
309
307
-
Check the [guide of using lm-evaluation-harness](./accuracy/README.md)
310
+
Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/accuracy/README.md)
Beside Tensorflow Serving API and KServe API frontends, the model server has now option to delegate the REST input deserialization and output serialization to a MediaPipe graph. A custom calculator can implement any form of REST API including streaming based on [Server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events).
16
16
17
-
That way we are introducing a preview of OpenAI compatible endpoint [chat/completions](./model_server_rest_api_chat.md). More endpoints are planned for the implementation.
17
+
We are introducing OpenAI compatible endpoint [chat/completions](./model_server_rest_api_chat.md) and [completions](./model_server_rest_api_completions.md).
Copy file name to clipboardExpand all lines: docs/home.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ The models used by the server need to be stored locally or hosted remotely by ob
37
37
Start using OpenVINO Model Server with a fast-forward serving example from the [Quickstart guide](ovms_quickstart.md) or explore [Model Server features](features.md).
38
38
39
39
### Key features:
40
-
-**[NEW]**[Efficient Text Generation - preview](llm/reference.md)
40
+
-**[NEW]**[Efficient Text Generation](llm/reference.md)
With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like:
@@ -15,7 +13,7 @@ It is now integrated into OpenVINO Model Server providing efficient way to run g
15
13
Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature.
16
14
17
15
## LLM Calculator
18
-
As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) solutions. The calculator is designed to run in cycles and return the chunks of responses to the client.
16
+
As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/master/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
19
17
20
18
On the input it expects a HttpPayload struct passed by the Model Server frontend:
21
19
```cpp
@@ -82,8 +80,8 @@ The calculator supports the following `node_options` for tuning the pipeline con
82
80
-`optional uint64 block_size` - number of tokens which KV is stored in a single block (Paged Attention related) [default = 32];
83
81
-`optional uint64 max_num_seqs` - max number of sequences actively processed by the engine [default = 256];
-`optional string device` - device to load models to. Supported values: "CPU" [default = "CPU"]
86
-
-`optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options)[default = ""]
83
+
-`optional string device` - device to load models to. Supported values: "CPU", "GPU"[default = "CPU"]
84
+
-`optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options)[default = "{}"]
87
85
-`optional uint32 best_of_limit` - max value of best_of parameter accepted by endpoint [default = 20];
88
86
-`optional uint32 max_tokens_limit` - max value of max_tokens parameter accepted by endpoint [default = 4096];
@@ -96,8 +94,29 @@ You can track the actual usage of the cache in the server logs. You can observe
96
94
```
97
95
Consider increasing the `cache_size` parameter in case the logs report the usage getting close to 100%. When the cache is consumed, some of the running requests might be preempted to free cache for other requests to finish their generations (preemption will likely have negative impact on performance since preempted request cache will need to be recomputed when it gets processed again). When preemption is not possible i.e. `cache size` is very small and there is a single, long running request that consumes it all, then the request gets terminated when no more cache can be assigned to it, even before reaching stopping criteria.
98
96
97
+
`enable_prefix_caching` can improve generation performance when the initial prompt content is repeated. That is the case with chat applications which resend the history of the conversations. Thanks to prefix caching, there is no need to reevaluate the same sequence of tokens. Thanks to that, first token will be generated much quicker and the overall
98
+
utlization of resource will be lower. Old cache will be cleared automatically but it is recommended to increase cache_size to take bigger performance advantage.
99
+
100
+
`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}'.
101
+
99
102
The LLM calculator config can also restrict the range of sampling parameters in the client requests. If needed change the default values for `max_tokens_limit` and `best_of_limit`. It is meant to avoid the result of memory overconsumption by invalid requests.
100
103
104
+
105
+
## Canceling the generation
106
+
107
+
In order to optimize the usage of compute resources, it is important to stop the text generation when it becomes irrelevant for the client or when the client gets disconnected for any reason. Such capability is implemented via a tight integration between the LLM calculator and the model server frontend. The calculator gets notified about the client session disconnection. When the client application stops or deliberately breaks the session, the generation cycle gets broken and all resources are released. Below is an easy example how the client can initialize stopping the generation:
@@ -136,7 +155,7 @@ Precision parameter is important and can influence performance, accuracy and mem
136
155
137
156
Export the tokenizer model with a command:
138
157
```
139
-
convert_tokenizer -o {target folder name} --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
158
+
convert_tokenizer -o {target folder name} --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
@@ -164,11 +183,11 @@ When default template is loaded, servable accepts `/chat/completions` calls when
164
183
165
184
## Limitations
166
185
167
-
LLM calculator is a preview feature. It runs a set of accuracy, stability and performance tests, but the next releases targets production grade quality. It has now a set of known issues:
186
+
There are several known limitations which are expected to be addressed in the coming releases:
168
187
169
-
- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_graphs_running`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md)
188
+
- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_current_graphs`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md)
170
189
- Multi modal models are not supported yet. Images can't be sent now as the context.
171
-
-GPU device is not supported yet. It is planned for version 2024.5.
190
+
-`logprobs` parameter is not supported currently in greedy search (temperature=0) and in streaming mode. It includes only a single logprob and do not include values for input tokens.
| temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
89
89
| top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
90
-
| top_k | ✅ | ❌ | ✅ | int (default: `0`) | Controls the number of top tokens to consider. Set to 0 to consider all tokens. |
90
+
| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
91
91
| repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
92
92
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
93
93
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
94
94
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
| temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
77
77
| top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
78
-
| top_k | ✅ | ❌ | ✅ | int (default: `0`) | Controls the number of top tokens to consider. Set to 0 to consider all tokens. |
78
+
| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
79
79
| repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
80
80
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
81
81
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
0 commit comments