Skip to content

Commit f99d997

Browse files
authored
Prompt lookup decoding docs update (#3280)
1 parent e9ba878 commit f99d997

File tree

4 files changed

+35
-4
lines changed

4 files changed

+35
-4
lines changed

demos/common/export_models/export_model.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ def add_common_arguments(parser):
5050
'Equal to draft_source_model if HF model name is used. Available only in draft_source_model has been specified.', dest='draft_model_name')
5151
parser_text.add_argument('--max_prompt_len', required=False, type=int, default=None, help='Sets NPU specific property for maximum number of tokens in the prompt. '
5252
'Not effective if target device is not NPU', dest='max_prompt_len')
53+
parser_text.add_argument('--prompt_lookup_decoding', action='store_true', help='Set pipeline to use prompt lookup decoding', dest='prompt_lookup_decoding')
5354

5455
parser_embeddings = subparsers.add_parser('embeddings', help='export model for embeddings endpoint')
5556
add_common_arguments(parser_embeddings)
@@ -330,6 +331,9 @@ def export_text_generation_model(model_repository_path, source_model, model_name
330331
plugin_config['MAX_PROMPT_LEN'] = task_parameters['max_prompt_len']
331332
if task_parameters['ov_cache_dir'] is not None:
332333
plugin_config['CACHE_DIR'] = task_parameters['ov_cache_dir']
334+
335+
if task_parameters['prompt_lookup_decoding']:
336+
plugin_config['prompt_lookup'] = True
333337

334338
# Additional plugin properties for HETERO
335339
if "HETERO" in task_parameters['target_device']:

docs/llm/reference.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ The calculator supports the following `node_options` for tuning the pipeline con
100100
- `optional uint64 max_num_seqs` - max number of sequences actively processed by the engine [default = 256];
101101
- `optional bool dynamic_split_fuse` - use Dynamic Split Fuse token scheduling [default = true];
102102
- `optional string device` - device to load models to. Supported values: "CPU", "GPU" [default = "CPU"]
103-
- `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options) [default = "{}"]
103+
- `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes.html) and additional pipeline options. Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options) [default = "{}"]
104104
- `optional uint32 best_of_limit` - max value of best_of parameter accepted by endpoint [default = 20];
105105
- `optional uint32 max_tokens_limit` - max value of max_tokens parameter accepted by endpoint;
106106
- `optional bool enable_prefix_caching` - enable caching of KV-blocks [default = false];
@@ -118,14 +118,19 @@ utilization of resource will be lower. Old cache will be cleared automatically b
118118

119119
`dynamic_split_fuse` [algorithm](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen#b-dynamic-splitfuse-) is enabled by default to boost the throughput by splitting the tokens to even chunks. In some conditions like with very low concurrency or with very short prompts, it might be beneficial to disable this algorithm. When it is disabled, there should be set also the parameter `max_num_batched_tokens` to match the model max context length.
120120

121-
`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size `{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}`.
122-
123-
**Important for NPU users**: NPU plugin sets a limitation on prompt (1024 tokens by default) that can be modified by setting `MAX_PROMPT_LEN` in `plugin_config`, for example to double that limit set: `{"MAX_PROMPT_LEN": 2048}`
121+
`plugin_config` accepts a json dictionary of tuning parameters for the OpenVINO plugin. It can tune the behavior of the inference runtime. For example you can include there kv cache compression or the group size `{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}`. It also holds additional options that are described below.
124122

125123
The LLM calculator config can also restrict the range of sampling parameters in the client requests. If needed change the default values for `best_of_limit` or set `max_tokens_limit`. It is meant to avoid the result of memory overconsumption by invalid requests.
126124

127125
**Note that the following options are ignored in Stateful servables (so in deployments on NPU): cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching**
128126

127+
### Additional configuration in plugin_config
128+
129+
As mentioned above, in LLM pipelines, `plugin_config` map holds not only OpenVINO device plugin options, but also additional pipeline configuration. Those additional options are:
130+
131+
- `prompt_lookup` - if set to `true`, pipeline will use [prompt lookup decoding](https://github.com/apoorvumang/prompt-lookup-decoding) technique for sampling new tokens. Example: `plugin_config: '{"prompt_lookup": true}'`
132+
- `MAX_PROMPT_LEN` (**important for NPU users**) - NPU plugin sets a limitation on prompt (1024 tokens by default), this options allows modifying this value. Example: `plugin_config: '{"MAX_PROMPT_LEN": 2048}'`
133+
129134

130135
## Canceling the generation
131136

docs/model_server_rest_api_chat.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,17 @@ Note that below parameters are valid only for speculative pipeline. See [specula
129129
| num_assistant_tokens ||| ⚠️ | int | This value defines how many tokens should a draft model generate before main model validates them. Equivalent of `num_speculative_tokens` in vLLM. Cannot be used with `assistant_confidence_threshold`. |
130130
| assistant_confidence_threshold |||| float | This parameter determines confidence level for continuing generation. If draft model generates token with confidence below that threshold, it stops generation for the current cycle and main model starts validation. Cannot be used with `num_assistant_tokens`. |
131131

132+
#### Prompt lookup decoding specific
133+
134+
Note that below parameters are valid only for prompt lookup pipeline. Add `"prompt_lookup": true` to `plugin_config` in your graph config node options to serve it.
135+
136+
| Param | OpenVINO Model Server | OpenAI /chat/completions API | vLLM Serving Sampling Params | Type | Description |
137+
|-------|----------|----------|----------|---------|-----|
138+
| num_assistant_tokens |||| int | Number of candidate tokens proposed after ngram match is found |
139+
| max_ngram_size |||| int | The maximum ngram to use when looking for matches in the prompt |
140+
141+
**Note**: vLLM does not support those parameters as sampling parameters, but enables prompt lookup decoding, by setting them in [LLM config](https://docs.vllm.ai/en/stable/features/spec_decode.html#speculating-by-matching-n-grams-in-the-prompt)
142+
132143
#### Unsupported params from OpenAI service:
133144
- logit_bias
134145
- top_logprobs

docs/model_server_rest_api_completions.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,17 @@ curl http://localhost/v3/completions \
8888
| num_assistant_tokens ||| ⚠️ | int | This value defines how many tokens should a draft model generate before main model validates them. Equivalent of `num_speculative_tokens` in vLLM. Cannot be used with `assistant_confidence_threshold`. |
8989
| assistant_confidence_threshold |||| float | This parameter determines confidence level for continuing generation. If draft model generates token with confidence below that threshold, it stops generation for the current cycle and main model starts validation. Cannot be used with `num_assistant_tokens`. |
9090

91+
#### Prompt lookup decoding specific
92+
93+
Note that below parameters are valid only for prompt lookup pipeline. Add `"prompt_lookup": true` to `plugin_config` in your graph config node options to serve it.
94+
95+
| Param | OpenVINO Model Server | OpenAI /completions API | vLLM Serving Sampling Params | Type | Description |
96+
|-------|----------|----------|----------|---------|-----|
97+
| num_assistant_tokens |||| int | Number of candidate tokens proposed after ngram match is found |
98+
| max_ngram_size |||| int | The maximum ngram to use when looking for matches in the prompt |
99+
100+
**Note**: vLLM does not support those parameters as sampling parameters, but enables prompt lookup decoding, by setting them in [LLM config](https://docs.vllm.ai/en/stable/features/spec_decode.html#speculating-by-matching-n-grams-in-the-prompt)
101+
91102
#### Unsupported params from OpenAI service:
92103
- logit_bias
93104
- suffix

0 commit comments

Comments
 (0)