The 2025.1. is a major release adding support for visual language models and enabling text generation on NPU accelerator.
VLM support
The endpoint chat/completion
has been extended to support vision language models. Now it is possible to send images in the context of chat. Vision language models can be deployed just like the LLM models.
Check the end-to-end demo: Link
Updated API reference: Link
Text Generation on NPU
Now it is possible to deploy LLM and VLM models on NPU accelerator. Text generation will be exposed over completions and chat/completions endpoints. From the client perspective it works the same way as with GPU and CPU deployment but it doesn’t support continuous batching algorithm. NPU is targeted for AI PC use cases with low concurrency.
Check the NPU LLM demo and NPU VLM demo.
Model management improvements
- Option to start MediaPipe graphs and generative endpoints from CLI without the configuration file. Simply point
--model_path
CLI argument to directory with MediaPipe graph. - Unification for the JSON configuration file structure for models and graphs under section
models_config_list
.
Breaking changes
- gRPC server is now optional. There is no default gRPC port set. The parameter –port is mandatory to start gRPC server. It is possible to start only REST API server with
--rest_port
parameter. At least one port number needs to be defined to start OVMS from CLI (--port
for gRPC or--rest_port
for REST). Starting OVMS via C-API does not require any port to be defined.
Other changes
-
Updated scalability demonstration using multiple instance: Link
-
Increased allowed number of text generation stop words in the request from 4 to 16
-
Enabled and tested OVMS integration with Visual Studio Code extension of Continue. OpenVINO Model Server can be used as a backend for code completion and built-in IDE chat assistant. Check out instructions: Link
-
Performance improvements – enhancements in OpenVINO Runtime and also in text sampling generation algorithm which should increase the throughput in high concurrency load
Bug fixes
-
Fixed handling of the LLM context length - now OVMS will stop generating the text when model context is exceeded. An error will be raised when the prompt is longer than the context or when the
max_tokens
plus the input tokens exceed the model context. -
Security and stability improvements
-
Fixed cancellation of text generation workloads - clients are allowed to stop the generation in non-streaming scenarios by simply closing the connection
Known issues and limitations
chat/completions
API accepts images encoded to base64 format but does not accept URL format.
Qwen Vision models deployed on GPU might experience an execution error when image size has too high resolution. It is recommended to edit the model preprocessor_config.json and lower max_pixels
parameter to a value. It will ensure the images will be resized automatically to smaller resolution. It will avoid the outage on GPU and improve performance. In some cases, accuracy might be impacted, though.
Note that by default, NPU sets limitation to the prompt length to 1024 tokens. You can modify that limit by using --max_prompt_len
parameter in model export script, or manually modify MAX_PROMPT_LEN
plugin config param in graph.pbtxt.
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2025.1
- CPU device supportdocker pull openvino/model_server:2025.1-gpu
- GPU, NPU and CPU device support
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog