Skip to content

Commit da74488

Browse files
authored
Update LLM reference docs (#3761)
1 parent eb84e11 commit da74488

File tree

2 files changed

+11
-21
lines changed

2 files changed

+11
-21
lines changed

docs/llm/reference.md

Lines changed: 10 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -158,16 +158,16 @@ When using models with more complex templates and support for `tools` or `reason
158158
__Tool parsers:__
159159
- `hermes3` (also works for Qwen3 models)
160160
- `llama3`
161-
- `phi4` (no streaming support)
161+
- `phi4`
162162
- `mistral` (no streaming support)
163+
- `gptoss`
164+
- `qwen3coder`
163165

164166
__Reasoning parsers:__
165167
- `qwen3`
166168

167-
Those are the only acceptable values at the moment since OVMS supports `tools` handling in these particular models and `reasoning` in `Qwen3`.
168-
169169
Note that using `tools` might require a chat template other than the original.
170-
We recommend using templates from [vLLM repository](https://github.com/vllm-project/vllm/tree/main/examples) for `hermes3`, `llama3`, `phi4` and `mistral` models. Save selected template as `chat_template.jinja` in model directory and it will be used instead of the default one.
170+
We recommend using templates from the [vLLM repository](https://github.com/vllm-project/vllm/tree/main/examples) for `hermes3`, `llama3`, `phi4`, `mistral`, `gptoss`, and `qwen3coder` models (if available). Save the selected template as `chat_template.jinja` in the model directory and it will be used instead of the default one. If a template is not available for your model, please refer to the model's documentation or use the default template provided by the model server.
171171

172172
When `tool_parser` is used, it's possible to leverage tool guided generation with `enable_tool_guided_generation` option. That setting pushes the model to generate tool calls that matches the schemas specified in the `tools`.
173173

@@ -214,10 +214,10 @@ In node configuration we set `models_path` indicating location of the directory
214214
├── openvino_tokenizer.bin
215215
├── openvino_tokenizer.xml
216216
├── tokenizer_config.json
217-
├── template.jinja
217+
├── chat_template.jinja
218218
```
219219

220-
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing. Model directory may also contain `generation_config.json` which specifies recommended generation parameters.
220+
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `chat_template.jinja` are loaded to read information required for chat template processing. Model directory may also contain `generation_config.json` which specifies recommended generation parameters.
221221
If such file exists, model server will use it to load default generation configuration for processing request to that model.
222222

223223
Additionally, Visual Language Models have encoder and decoder models for text and vision and potentially other auxiliary models.
@@ -239,20 +239,13 @@ When sending a request to `/completions` endpoint, model server adds `bos_token_
239239
When sending a request to `/chat/completions` endpoint, model server will try to apply chat template to request `messages` contents.
240240

241241
Loading chat template proceeds as follows:
242-
1. If `tokenizer.jinja` is present, try to load template from it.
243-
2. If there is no `tokenizer.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template.
242+
1. If `chat_template.jinja` is present, try to load template from it.
243+
2. If there is no `chat_template.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template.
244244
3. If `tokenizer_config.json` exists try to read `eos_token` and `bos_token` fields. If they are not present, both values are set to empty string.
245245

246-
**Note**: If both `template.jinja` file and `chat_completion` field from `tokenizer_config.json` are successfully loaded, `template.jinja` takes precedence over `tokenizer_config.json`.
247-
248-
If no chat template has been specified, default template is applied. The template looks as follows:
249-
```
250-
"{% if messages|length != 1 %} {{ raise_exception('This servable accepts only single message requests') }}{% endif %}{{ messages[0]['content'] }}"
251-
```
252-
253-
When default template is loaded, servable accepts `/chat/completions` calls when `messages` list contains only single element (otherwise returns error) and treats `content` value of that single message as an input prompt for the model.
246+
If both `chat_template.jinja` file and `chat_template` field from `tokenizer_config.json` are successfully loaded, `chat_template.jinja` takes precedence over `tokenizer_config.json`.
254247

255-
**Note:** Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`.
248+
Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`.
256249

257250
Errors during configuration files processing (access issue, corrupted file, incorrect content) result in servable loading failure.
258251

@@ -276,9 +269,6 @@ There are several known limitations which are expected to be addressed in the co
276269

277270
- Metrics related to text generation are not exposed via `metrics` endpoint. Key metrics from LLM calculators are included in the server logs with information about active requests, scheduled for text generation and KV Cache usage. It is possible to track in the metrics the number of active generation requests using metric called `ovms_current_graphs`. Also tracking statistics for request and responses is possible. [Learn more](../metrics.md)
278271
- `logprobs` parameter is not supported currently in streaming mode. It includes only a single logprob and do not include values for input tokens
279-
- Server logs might sporadically include a message "PCRE2 substitution failed with error code -55" - this message can be safely ignored. It will be removed in next version
280-
- using `tools` is supported only for Hermes3, Llama3, Phi4 and Qwen3 models
281-
- using `tools` is not supported in streaming mode
282272
- using `tools` is not supported in configuration without Python
283273

284274
Some servable types introduce additional limitations:

windows_prepare_llm_models.bat

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ if not exist "%~1\%MISTRAL_MODEL%\%TOKENIZER_FILE%" (
202202
if exist "%~1\%GPTOSS_MODEL%\%TOKENIZER_FILE%" (
203203
echo Models file %~1\%GPTOSS_MODEL%\%TOKENIZER_FILE% exists. Skipping downloading models.
204204
) else (
205-
echo Downloading tokenizer and detokenizer for Mistral model to %~1\%GPTOSS_MODEL% directory.
205+
echo Downloading tokenizer and detokenizer for GPT-OSS model to %~1\%GPTOSS_MODEL% directory.
206206
mkdir "%~1\%GPTOSS_MODEL%"
207207
convert_tokenizer "%GPTOSS_MODEL%" --with_detokenizer -o "%~1\%GPTOSS_MODEL%"
208208
if !errorlevel! neq 0 exit /b !errorlevel!

0 commit comments

Comments
 (0)