Skip to content

OpenVINO™ Model Server 2025.1

Latest
Compare
Choose a tag to compare
@dkalinowski dkalinowski released this 10 Apr 12:03
· 5 commits to releases/2025/1 since this release
c9658a3

The 2025.1. is a major release adding support for visual language models and enabling text generation on NPU accelerator.

VLM support

The endpoint chat/completion has been extended to support vision language models. Now it is possible to send images in the context of chat. Vision language models can be deployed just like the LLM models.

Check the end-to-end demo: Link

Updated API reference: Link

Text Generation on NPU

Now it is possible to deploy LLM and VLM models on NPU accelerator. Text generation will be exposed over completions and chat/completions endpoints. From the client perspective it works the same way as with GPU and CPU deployment but it doesn’t support continuous batching algorithm. NPU is targeted for AI PC use cases with low concurrency.

Check the NPU LLM demo and NPU VLM demo.

Model management improvements

  • Option to start MediaPipe graphs and generative endpoints from CLI without the configuration file. Simply point --model_path CLI argument to directory with MediaPipe graph.
  • Unification for the JSON configuration file structure for models and graphs under section models_config_list.

Breaking changes

  • gRPC server is now optional. There is no default gRPC port set. The parameter –port is mandatory to start gRPC server. It is possible to start only REST API server with --rest_port parameter. At least one port number needs to be defined to start OVMS from CLI (--port for gRPC or --rest_port for REST). Starting OVMS via C-API does not require any port to be defined.

Other changes

  • Updated scalability demonstration using multiple instance: Link

  • Increased allowed number of text generation stop words in the request from 4 to 16

  • Enabled and tested OVMS integration with Visual Studio Code extension of Continue. OpenVINO Model Server can be used as a backend for code completion and built-in IDE chat assistant. Check out instructions: Link

  • Performance improvements – enhancements in OpenVINO Runtime and also in text sampling generation algorithm which should increase the throughput in high concurrency load

Bug fixes

  • Fixed handling of the LLM context length - now OVMS will stop generating the text when model context is exceeded. An error will be raised when the prompt is longer than the context or when the max_tokens plus the input tokens exceed the model context.

  • Security and stability improvements

  • Fixed cancellation of text generation workloads - clients are allowed to stop the generation in non-streaming scenarios by simply closing the connection

Known issues and limitations

chat/completions API accepts images encoded to base64 format but does not accept URL format.

Qwen Vision models deployed on GPU might experience an execution error when image size has too high resolution. It is recommended to edit the model preprocessor_config.json and lower max_pixels parameter to a value. It will ensure the images will be resized automatically to smaller resolution. It will avoid the outage on GPU and improve performance. In some cases, accuracy might be impacted, though.

Note that by default, NPU sets limitation to the prompt length to 1024 tokens. You can modify that limit by using --max_prompt_len parameter in model export script, or manually modify MAX_PROMPT_LEN plugin config param in graph.pbtxt.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

  • docker pull openvino/model_server:2025.1 - CPU device support
  • docker pull openvino/model_server:2025.1-gpu - GPU, NPU and CPU device support

or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog