LLM calculator documentation updates (#2692)

dtrawins · dkalinowski · web-flow · commit ad491b72dc11 · 2024-09-18T11:21:55.000+02:00
* update LLM documentation

* LLM gold support

* Apply suggestions from code review

Co-authored-by: Damian Kalinowski &lt;damian.kalinowski@intel.com&gt;

* fixes

* top_k update

---------

Co-authored-by: Damian Kalinowski &lt;damian.kalinowski@intel.com&gt;
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ Start using OpenVINO Model Server with a fast-forward serving example from the [
 Read [release notes](https://github.com/openvinotoolkit/model_server/releases) to find out what’s new.
 
 ### Key features:
-- **[NEW]** [Efficient Text Generation via OpenAI API - preview](https://docs.openvino.ai/nightly/ovms_docs_llm_reference.html)
+- **[NEW]** [Efficient Text Generation via OpenAI API](https://docs.openvino.ai/nightly/ovms_docs_llm_reference.html)
 - [Python code execution](https://docs.openvino.ai/nightly/ovms_docs_python_support_reference.html)
 - [gRPC streaming](https://docs.openvino.ai/nightly/ovms_docs_streaming_endpoints.html)
 - [MediaPipe graphs serving](https://docs.openvino.ai/nightly/ovms_docs_mediapipe.html) 
diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md
@@ -7,8 +7,10 @@ That makes it easy to use and efficient especially on on Intel® Xeon® processo
 
 ## Get the docker image
 
+Pull public image with CPU only support or including also GPU support.
 ```bash
-docker pull openvino/model_server:2024.4-gpu
+docker pull openvino/model_server:latest-gpu
+docker pull openvino/model_server:latest
 ```
 or build the image from source to try the latest enhancements in this feature.
 ```bash
@@ -88,10 +90,11 @@ cat config.json
 
 ## Start-up
 ```bash
-docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server --port 9000 --rest_port 8000 --config_path /workspace/config.json
+docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server:latest --port 9000 --rest_port 8000 --config_path /workspace/config.json
 ```
 In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` 
-to `docker run` command and make sure you copy the graph.pbtxt tuned for GPU device. Also make sure the export model quantization level and cache size fit to the GPU memory.
+to `docker run` command, use the image with GPU support and make sure you copy the graph.pbtxt tuned for GPU device. 
+Also make sure the export model quantization level and cache size fit to the GPU memory.
 ```
 
 
@@ -299,9 +302,9 @@ Check the example in the [RAG notebook](https://github.com/openvinotoolkit/model
 
 ## Scaling the Model Server
 
-Check this simple [text generation scaling demo](./scaling/README.md).
+Check this simple [text generation scaling demo](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/scaling/README.md).
 
 
 ## Testing the model accuracy over serving API
 
-Check the [guide of using lm-evaluation-harness](./accuracy/README.md)
+Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/accuracy/README.md)
diff --git a/demos/continuous_batching/accuracy/README.md b/demos/continuous_batching/accuracy/README.md
@@ -15,7 +15,7 @@ pip3 install lm_eval[api]
 
 ## Exporting the model and starting the model server
 
-Following the procedure to export the model and start the model server from [text generatino demo](../README.md)
+Follow the procedure to export the model and start the model server from [text generation demo](../README.md)
 
 ## Running the tests
 
diff --git a/docs/clients_openai.md b/docs/clients_openai.md
@@ -14,7 +14,7 @@ LLM calculator <ovms_docs_llm_caclulator>
 ## Introduction
 Beside Tensorflow Serving API and KServe API frontends, the model server has now option to delegate the REST input deserialization and output serialization to a MediaPipe graph. A custom calculator can implement any form of REST API including streaming based on [Server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events).
 
-That way we are introducing a preview of OpenAI compatible endpoint [chat/completions](./model_server_rest_api_chat.md). More endpoints are planned for the implementation.
+We are introducing OpenAI compatible endpoint [chat/completions](./model_server_rest_api_chat.md) and [completions](./model_server_rest_api_completions.md).
 
 
 ## Python Client
diff --git a/docs/home.md b/docs/home.md
@@ -37,7 +37,7 @@ The models used by the server need to be stored locally or hosted remotely by ob
 Start using OpenVINO Model Server with a fast-forward serving example from the [Quickstart guide](ovms_quickstart.md) or explore [Model Server features](features.md).
 
 ### Key features:
-- **[NEW]** [Efficient Text Generation - preview](llm/reference.md)
+- **[NEW]** [Efficient Text Generation](llm/reference.md)
 - [Python code execution](python_support/reference.md)
 - [gRPC streaming](streaming_endpoints.md)
 - [MediaPipe graphs serving](mediapipe.md) 
diff --git a/docs/llm/quickstart.md b/docs/llm/quickstart.md
@@ -4,8 +4,8 @@ Let's deploy [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLla
 
 1. Install python dependencies for the conversion script:
 ```bash
-export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
-pip3 install "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git@xeon openvino-tokenizers transformers==4.41.2
+export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/pre-release"
+pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git  openvino_tokenizers==2024.4.* openvino==2024.4.*
 ```
 
 2. Run optimum-cli to download and quantize the model:
@@ -14,7 +14,7 @@ mkdir workspace && cd workspace
 
 optimum-cli export openvino --disable-convert-tokenizer --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 TinyLlama-1.1B-Chat-v1.0
 
-convert_tokenizer -o TinyLlama-1.1B-Chat-v1.0 --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens TinyLlama/TinyLlama-1.1B-Chat-v1.0
+convert_tokenizer -o TinyLlama-1.1B-Chat-v1.0 --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens TinyLlama/TinyLlama-1.1B-Chat-v1.0
 ```
 
 3. Create `graph.pbtxt` file in a model directory: 
@@ -37,7 +37,9 @@ node: {
   }
   node_options: {
       [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
-          models_path: "./"
+          models_path: "./",
+          plugin_config: '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}',
+          cache_size: 4
       }
   }
   input_stream_handler {
diff --git a/docs/llm/reference.md b/docs/llm/reference.md
@@ -1,7 +1,5 @@
 # Efficient LLM Serving {#ovms_docs_llm_reference}
 
-**THIS IS A PREVIEW FEATURE**
-
 ## Overview
 
 With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like:
@@ -15,7 +13,7 @@ It is now integrated into OpenVINO Model Server providing efficient way to run g
 Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature.
 
 ## LLM Calculator
-As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) solutions. The calculator is designed to run in cycles and return the chunks of responses to the client.
+As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/tree/master/src/cpp/include/openvino/genai) library. The calculator is designed to run in cycles and return the chunks of responses to the client.
 
 On the input it expects a HttpPayload struct passed by the Model Server frontend:
 ```cpp
@@ -82,8 +80,8 @@ The calculator supports the following `node_options` for tuning the pipeline con
 -    `optional uint64 block_size` - number of tokens which KV is stored in a single block (Paged Attention related) [default = 32];
 -    `optional uint64 max_num_seqs` - max number of sequences actively processed by the engine [default = 256];
 -    `optional bool dynamic_split_fuse` - use Dynamic Split Fuse token scheduling [default = true];
--    `optional string device` - device to load models to. Supported values: "CPU" [default = "CPU"]
--    `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options) [default = ""]
+-    `optional string device` - device to load models to. Supported values: "CPU", "GPU" [default = "CPU"]
+-    `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options) [default = "{}"]
 -    `optional uint32 best_of_limit` - max value of best_of parameter accepted by endpoint [default = 20];
 -    `optional uint32 max_tokens_limit` - max value of max_tokens parameter accepted by endpoint [default = 4096];
 -    `optional bool enable_prefix_caching` - enable caching of KV-blocks [default = false];
@@ -96,8 +94,29 @@ You can track the actual usage of the cache in the server logs. You can observe
 ```
 Consider increasing the `cache_size` parameter in case the logs report the usage getting close to 100%. When the cache is consumed, some of the running requests might be preempted to free cache for other requests to finish their generations (preemption will likely have negative impact on performance since preempted request cache will need to be recomputed when it gets processed again). When preemption is not possible i.e. `cache size` is very small and there is a single, long running request that consumes it all, then the request gets terminated when no more cache can be assigned to it, even before reaching stopping criteria. 
 
+`enable_prefix_caching` can improve generation performance when the initial prompt content is repeated. That is the case with chat applications which resend the history of the conversations. Thanks to prefix caching, there is no need to reevaluate the same sequence of tokens. Thanks to that, first token will be generated much quicker and the overall
+utlization of resource will be lower. Old cache will be cleared automatically but it is recommended to increase cache_size to take bigger performance advantage.
+
+`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}'.
+
 The LLM calculator config can also restrict the range of sampling parameters in the client requests. If needed change the default values for  `max_tokens_limit` and `best_of_limit`. It is meant to avoid the result of memory overconsumption by invalid requests.
 
+
+## Canceling the generation
+
+In order to optimize the usage of compute resources, it is important to stop the text generation when it becomes irrelevant for the client or when the client gets disconnected for any reason. Such capability is implemented via a tight integration between the LLM calculator and the model server frontend. The calculator gets notified about the client session disconnection. When the client application stops or deliberately breaks the session, the generation cycle gets broken and all resources are released. Below is an easy example how the client can initialize stopping the generation:
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
+stream = client.completions.create(model="model", prompt="Say this is a test", stream=True)
+for chunk in stream:
+    if chunk.choices[0].text is not None:
+        print(chunk.choices[0].delta.content, end="", flush=True)
+    if some_condition:
+        stream.close()
+        break
+```
+
 ## Models Directory
 
 In node configuration we set `models_path` indicating location of the directory with files loaded by LLM engine. It loads following files:
@@ -119,13 +138,13 @@ This model directory can be created based on the models from Hugging Face Hub or
 
 In your python environment install required dependencies:
 ```
-pip3 install "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git@7a224c2419240d5fb58f2f75c2e29f179ed6da28 openvino-tokenizers
+pip3 install "optimum-intel[nncf,openvino]
 ```
 
 Because there is very dynamic development in optimum-intel and openvino, it is recommended to use the latest versions of the dependencies:
 ```
-export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly"
-pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git openvino-tokenizers
+export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/pre-release"
+pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git  openvino_tokenizers openvino
 ```
 
 LLM model can be exported with a command:
@@ -136,7 +155,7 @@ Precision parameter is important and can influence performance, accuracy and mem
 
 Export the tokenizer model with a command:
 ```
-convert_tokenizer -o {target folder name} --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
+convert_tokenizer -o {target folder name} --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens {tokenizer model in HF hub or Pytorch model folder}
 ```
 
 Check [tested models](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models).
@@ -164,11 +183,11 @@ When default template is loaded, servable accepts `/chat/completions` calls when
 
 ## Limitations
 
-LLM calculator is a preview feature. It runs a set of accuracy, stability and performance tests, but the next releases targets production grade quality. It has now a set of known issues:
+There are several known limitations which are expected to be addressed in the coming releases:
 
-- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_graphs_running`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md) 
+- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_current_graphs`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md) 
 - Multi modal models are not supported yet. Images can't be sent now as the context.
-- GPU device is not supported yet. It is planned for version 2024.5. 
+- `logprobs` parameter is not supported currently in greedy search (temperature=0) and in streaming mode. It includes only a single logprob and do not include values for input tokens.
 
 ## References
 - [Chat Completions API](../model_server_rest_api_chat.md)
diff --git a/docs/model_server_rest_api_chat.md b/docs/model_server_rest_api_chat.md
@@ -87,18 +87,16 @@ curl http://localhost/v3/chat/completions \
 |-------|----------|----------|----------|---------|-----|
 | temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
 | top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
-| top_k | ✅ | ❌ | ✅ | int (default: `0`) | Controls the number of top tokens to consider. Set to 0 to consider all tokens. |
+| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
 | repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
 | frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
 | presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
 | seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
 
 #### Unsupported params from OpenAI service:
 - logit_bias
-- logprobs
 - top_logprobs
 - response_format
-- seed
 - tools
 - tool_choice
 - user
@@ -109,10 +107,8 @@ curl http://localhost/v3/chat/completions \
 - min_p
 - use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
 - early_stopping
-- stop
 - stop_token_ids
 - min_tokens
-- logprobs
 - prompt_logprobs
 - detokenize
 - skip_special_tokens
diff --git a/docs/model_server_rest_api_completions.md b/docs/model_server_rest_api_completions.md
@@ -75,7 +75,7 @@ curl http://localhost/v3/completions \
 |-------|----------|----------|----------|---------|-----|
 | temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
 | top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
-| top_k | ✅ | ❌ | ✅ | int (default: `0`) | Controls the number of top tokens to consider. Set to 0 to consider all tokens. |
+| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
 | repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
 | frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
 | presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
@@ -84,18 +84,15 @@ curl http://localhost/v3/completions \
 #### Unsupported params from OpenAI service:
 - echo
 - logit_bias
-- logprobs
 - suffix
 
 
 #### Unsupported params from vLLM:
 - min_p
 - use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
 - early_stopping
-- stop
 - stop_token_ids
 - min_tokens
-- logprobs
 - prompt_logprobs
 - detokenize
 - skip_special_tokens
diff --git a/docs/performance_tuning.md b/docs/performance_tuning.md