You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/llm/reference.md
+10-20Lines changed: 10 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -158,16 +158,16 @@ When using models with more complex templates and support for `tools` or `reason
158
158
__Tool parsers:__
159
159
-`hermes3` (also works for Qwen3 models)
160
160
-`llama3`
161
-
-`phi4` (no streaming support)
161
+
-`phi4`
162
162
-`mistral` (no streaming support)
163
+
-`gptoss`
164
+
-`qwen3coder`
163
165
164
166
__Reasoning parsers:__
165
167
-`qwen3`
166
168
167
-
Those are the only acceptable values at the moment since OVMS supports `tools` handling in these particular models and `reasoning` in `Qwen3`.
168
-
169
169
Note that using `tools` might require a chat template other than the original.
170
-
We recommend using templates from [vLLM repository](https://github.com/vllm-project/vllm/tree/main/examples) for `hermes3`, `llama3`, `phi4`and `mistral` models. Save selected template as `chat_template.jinja` in model directory and it will be used instead of the default one.
170
+
We recommend using templates from the [vLLM repository](https://github.com/vllm-project/vllm/tree/main/examples) for `hermes3`, `llama3`, `phi4`, `mistral`, `gptoss`, and `qwen3coder` models (if available). Save the selected template as `chat_template.jinja` in the model directory and it will be used instead of the default one. If a template is not available for your model, please refer to the model's documentation or use the default template provided by the model server.
171
171
172
172
When `tool_parser` is used, it's possible to leverage tool guided generation with `enable_tool_guided_generation` option. That setting pushes the model to generate tool calls that matches the schemas specified in the `tools`.
173
173
@@ -214,10 +214,10 @@ In node configuration we set `models_path` indicating location of the directory
214
214
├── openvino_tokenizer.bin
215
215
├── openvino_tokenizer.xml
216
216
├── tokenizer_config.json
217
-
├── template.jinja
217
+
├── chat_template.jinja
218
218
```
219
219
220
-
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing. Model directory may also contain `generation_config.json` which specifies recommended generation parameters.
220
+
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `chat_template.jinja` are loaded to read information required for chat template processing. Model directory may also contain `generation_config.json` which specifies recommended generation parameters.
221
221
If such file exists, model server will use it to load default generation configuration for processing request to that model.
222
222
223
223
Additionally, Visual Language Models have encoder and decoder models for text and vision and potentially other auxiliary models.
@@ -239,20 +239,13 @@ When sending a request to `/completions` endpoint, model server adds `bos_token_
239
239
When sending a request to `/chat/completions` endpoint, model server will try to apply chat template to request `messages` contents.
240
240
241
241
Loading chat template proceeds as follows:
242
-
1. If `tokenizer.jinja` is present, try to load template from it.
243
-
2. If there is no `tokenizer.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template.
242
+
1. If `chat_template.jinja` is present, try to load template from it.
243
+
2. If there is no `chat_template.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template.
244
244
3. If `tokenizer_config.json` exists try to read `eos_token` and `bos_token` fields. If they are not present, both values are set to empty string.
245
245
246
-
**Note**: If both `template.jinja` file and `chat_completion` field from `tokenizer_config.json` are successfully loaded, `template.jinja` takes precedence over `tokenizer_config.json`.
247
-
248
-
If no chat template has been specified, default template is applied. The template looks as follows:
249
-
```
250
-
"{% if messages|length != 1 %} {{ raise_exception('This servable accepts only single message requests') }}{% endif %}{{ messages[0]['content'] }}"
251
-
```
252
-
253
-
When default template is loaded, servable accepts `/chat/completions` calls when `messages` list contains only single element (otherwise returns error) and treats `content` value of that single message as an input prompt for the model.
246
+
If both `chat_template.jinja` file and `chat_template` field from `tokenizer_config.json` are successfully loaded, `chat_template.jinja` takes precedence over `tokenizer_config.json`.
254
247
255
-
**Note:**Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`.
248
+
Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`.
256
249
257
250
Errors during configuration files processing (access issue, corrupted file, incorrect content) result in servable loading failure.
258
251
@@ -276,9 +269,6 @@ There are several known limitations which are expected to be addressed in the co
276
269
277
270
- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_current_graphs`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md)
278
271
-`logprobs` parameter is not supported currently in streaming mode. It includes only a single logprob and do not include values for input tokens
279
-
- Server logs might sporadically include a message "PCRE2 substitution failed with error code -55" - this message can be safely ignored. It will be removed in next version
280
-
- using `tools` is supported only for Hermes3, Llama3, Phi4 and Qwen3 models
281
-
- using `tools` is not supported in streaming mode
282
272
- using `tools` is not supported in configuration without Python
283
273
284
274
Some servable types introduce additional limitations:
0 commit comments