Releases: openvinotoolkit/model_server
OpenVINO™ Model Server 2025.1
The 2025.1. is a major release adding support for visual language models and enabling text generation on NPU accelerator.
VLM support
The endpoint chat/completion
has been extended to support vision language models. Now it is possible to send images in the context of chat. Vision language models can be deployed just like the LLM models.
Check the end-to-end demo: Link
Updated API reference: Link
Text Generation on NPU
Now it is possible to deploy LLM and VLM models on NPU accelerator. Text generation will be exposed over completions and chat/completions endpoints. From the client perspective it works the same way as with GPU and CPU deployment but it doesn’t support continuous batching algorithm. NPU is targeted for AI PC use cases with low concurrency.
Check the NPU LLM demo and NPU VLM demo.
Model management improvements
- Option to start MediaPipe graphs and generative endpoints from CLI without the configuration file. Simply point
--model_path
CLI argument to directory with MediaPipe graph. - Unification for the JSON configuration file structure for models and graphs under section
models_config_list
.
Breaking changes
- gRPC server is now optional. There is no default gRPC port set. The parameter –port is mandatory to start gRPC server. It is possible to start only REST API server with
--rest_port
parameter. At least one port number needs to be defined to start OVMS from CLI (--port
for gRPC or--rest_port
for REST). Starting OVMS via C-API does not require any port to be defined.
Other changes
-
Updated scalability demonstration using multiple instance: Link
-
Increased allowed number of text generation stop words in the request from 4 to 16
-
Enabled and tested OVMS integration with Visual Studio Code extension of Continue. OpenVINO Model Server can be used as a backend for code completion and built-in IDE chat assistant. Check out instructions: Link
-
Performance improvements – enhancements in OpenVINO Runtime and also in text sampling generation algorithm which should increase the throughput in high concurrency load
Bug fixes
-
Fixed handling of the LLM context length - now OVMS will stop generating the text when model context is exceeded. An error will be raised when the prompt is longer than the context or when the
max_tokens
plus the input tokens exceed the model context. -
Security and stability improvements
-
Fixed cancellation of text generation workloads - clients are allowed to stop the generation in non-streaming scenarios by simply closing the connection
Known issues and limitations
chat/completions
API accepts images encoded to base64 format but does not accept URL format.
Qwen Vision models deployed on GPU might experience an execution error when image size has too high resolution. It is recommended to edit the model preprocessor_config.json and lower max_pixels
parameter to a value. It will ensure the images will be resized automatically to smaller resolution. It will avoid the outage on GPU and improve performance. In some cases, accuracy might be impacted, though.
Note that by default, NPU sets limitation to the prompt length to 1024 tokens. You can modify that limit by using --max_prompt_len
parameter in model export script, or manually modify MAX_PROMPT_LEN
plugin config param in graph.pbtxt.
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2025.1
- CPU device supportdocker pull openvino/model_server:2025.1-gpu
- GPU, NPU and CPU device support
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2025.0
The 2025.0 is a major release adding support for Windows native deployments and improvements to the generative use cases.
New feature - Windows native server deployment
-
This release enables model server deployment on Windows operating systems as a binary application
-
Full support for generative endpoints – text generation and embeddings based on OpenAI API, reranking based on Cohere API
-
Functional parity with linux version with several minor differences: cloud storage, CAPI interface, DAG pipelines - read more
-
It is targeted on client machines with Windows 11 and Data Center environment with Windows 2022 Server OS
-
Demos are updated to work both on Linux and Windows. Check the installation guide
Other Changes and Improvements
-
Added official support for Battle Mage GPU, Arrow Lake CPU, iGPU, NPU and Lunar Lake CPU, iGPU and NPU
-
Updated base docker images – added Ubuntu 24 and RedHat UBI 9, dropped Ubuntu 20 and RedHat UBI 8
-
Extended chat/completions API to support
max_completion_tokens
parameter and messages content as an array. Those changes are to make the API keep compatibility with OpenAI API. -
Truncate option in embeddings endpoint – It is now possible to export the embeddings model with option to truncate the input automatically to match the embeddings context length. By default, the error is raised when too long input is passed.
-
Speculative decoding algorithm added to text generations – Check the demo
-
Added direct support for models without named outputs – when models don’t have named outputs, generic names will be assigned in the model initialization with a pattern
out_<index>
-
Added histogram metric for tracking MediaPipe graph processing duration
-
Performance improvements
Breaking changes
- Discontinued support for NVIDIA plugin
Bug fixes
-
Corrected behavior of cancelling text generation for disconnected clients
-
Fixed detecting of the model context length for embeddings endpoint
-
Security and stability improvements
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2025.0
- CPU device supportdocker pull openvino/model_server:2025.0-gpu
- GPU, NPU and CPU device support
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.5
The 2024.5 release comes with support for embedding and rerank endpoints, as well as experimental Windows support version.
Changes and improvements
-
The OpenAI API text embedding endpoint has been added, enabling OVMS to be used as a building block for AI applications like RAG.
-
The rerank endpoint has been added based on Cohere API, enabling easy similarity detection between a query and a set of documents. It is one of the building blocks for AI applications like RAG and makes integration with frameworks such as langchain easy.
-
The
echo
sampling parameter together withlogprobs
in thecompletions
endpoint is now supported. -
Performance increase on both CPU and GPU for LLM text generation.
-
LLM dynamic_split_fuse for GPU target device boosts throughput in high-concurrency scenarios.
-
The procedure for LLM service deployment and model repository preparation has been simplified.
-
Improvements in LLM tests coverage and stability.
-
Instructions how to build experimental version of a Windows binary package - native model server for Windows OS – is available. This release includes a set of limitations and has limited tests coverage. It is intended for testing, while the production-ready release is expected with 2025.0. All feedback is welcome.
-
OpenVINO Model Server C-API now supports asynchronous inference, improves performance with ability of setting outputs, enables using OpenCL & VA surfaces on both inputs & outputs for GPU target device's
-
KServe REST API Model_metadata endpoint can now provide additional model_info references.
-
Included support for NPU and iGPU on MTL and LNL platforms
-
Security and stability improvements
Breaking changes
No breaking changes.
Bug fixes:
- Fix support for url encoded model name for KServe REST API
- OpenAI text generation endpoints now accepts requests with both v3 & v3/v1 path prefix
- Fix reporting metrics in video stream benchmark client
- Fix sporadic INVALID_ARGUMENT error on completions endpoint
- Fix incorrect LLM finish reason when expecting stop but got length
Discontinuation plans
In the future release, support for the following build options will not be maintained:
- Ubuntu 20 as the base image
- OpenVINO NVIDIA plugin
You can use an OpenVINO Model Server public Docker images based on Ubuntu22.04 via the following command:
docker pull openvino/model_server:2024.5
- CPU device supportdocker pull openvino/model_server:2024.5-gpu
- GPU, NPU and CPU device support
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.4
The 2024.4 release brings official support for OpenAI API text generation. It is now recommended for production usage. It comes with a set of added features and improvements.
Changes and improvements
-
Significant performance improvements for multinomial sampling algorithm
-
finish_reason
in the response correctly determines reaching the max_tokens (length) and completed the sequence (stop) -
Added automatic cancelling of text generation for disconnected clients
-
Included prefix caching feature which speeds up text generation by caching the prompt evaluation
-
Option to compress the KV Cache to lower precision – it reduces the memory consumption with minimal impact on accuracy
-
Added support for
stop
sampling parameters. It can define a sequence which stops text generation. -
Added support for
logprobs
sampling parameter. It returns the probabilities of generated tokens. -
Included generic metrics related to execution of MediaPipe graph. Metric
ovms_current_graphs
can be used for autoscaling based on current load and the level of concurrency. Counters likeovms_requests_accepted
andovms_responses
can track the activity of the server. -
Included demo of text generation horizontal scalability
-
Configurable handling of non-UTF-8 responses from the model – detokenizer can now automatically change then to Unicode replacement character
-
Included support for Llama3.1 models
-
Text generation is supported both on CPU and GPU -check the demo
Breaking changes
No breaking changes.
Bug fixes
-
Security and stability improvements
-
Fixed handling of model templates without bos_token
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.4
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.4-gpu
- CPU, GPU and NPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.3
The 2024.3 release focus mostly on improvements in OpenAI API text generation implementation.
Changes and improvements
A set of improvements in OpenAI API text generation:
- Significantly better performance thanks to numerous improvements in OpenVINO Runtime and sampling algorithms
- Added config parameters
best_of_limit
andmax_tokens_limit
to avoid memory overconsumption impact from invalid requests Read more - Added reporting LLM metrics in the server logs Read more
- Added extra sampling parameters
diversity_penalty
,length_penalty
,repetition_penalty
. Read more
Improvements in documentation and demos:
- Added RAG demo with OpenAI API
- Added K8S deployment demo for text generation scenarios
- Simplified models initialization for a set of demos with mediapipe graphs using pose_detection model. TFLite models don't required any conversions Check demo
Breaking changes
No breaking changes.
Bug fixes
- Resolved issue with sporadic text generation hang via OpenAI API endpoints
- Fixed issue with chat streamer impacting incomplete utf-8 sequences
- Corrected format of the last streaming event in
completions
endpoint - Fixed issue with request hanging when running out of available cache
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.3
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.3-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.2
The major new functionality in 2024.2 is a preview feature of OpenAI compatible API for text generation along with state of the art techniques like continuous batching and paged attention for improving efficiency of generative workloads.
Changes and improvements
-
Updated OpenVINO Runtime backend to 2024.2
-
OpenVINO Model Server can be now used for text generation use cases using OpenAI compatible API
-
Added support for continuous batching and PagedAttention algorithms for text generation with fast and efficient in high concurrency load especially on Intel Xeon processors. Learn more about it.
-
Added LLM text generation OpenAI API demo.
-
Added notebook showcasing RAG algorithm with online scope changes delegated to the model server. Link
-
Enabled python 3.12 for python clients, samples and demos.
-
Updated RedHat UBI base image to 8.10
Breaking changes
No breaking changes.
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.2
- CPU device support with the image based on Ubuntu 22.04
docker pull openvino/model_server:2024.2-gpu
- GPU and CPU device support with the image based on Ubuntu 22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.1
The 2024.1 has a few improvements in the serving functionality, demo enhancements and bug fixes.
Changes and improvements
-
Updated OpenVINO Runtime backend to 2024.1 Link
-
Added support for OpenVINO models with string data type on output. Together with the features introduced in 2024.0, now OVMS can support models with input and output of string type. That way you can take advantage of the tokenization built into the model as the first layer. You can also rely on any post-processing embedded into the model which returns just text. Check universal sentence encoder demo and image classification with string output demo
-
Updated MediaPipe python calculators to support relative path for all related configuration and python code files. Now, the complete graph configuration folder can be deployed in arbitrary path without any code changes. It is demonstrated in the updated text generation demo.
-
Extended support for KServe REST API for MediaPipe graph endpoints. Now you can send the data in KServe JSON body. Check how it is used in text generation use case.
-
Added demo showcasing full RAG algorithm entirely delegated to the model server Link
-
Added RedHat UBI based Dockerfile for python demos, usage documented in python demos
Breaking changes
No breaking changes.
Bug fixes
- Improvements in error handling for invalid requests and incorrect configuration
- Fixes in the demos and documentation
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.1
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.1-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.0
The 2024.0 includes new version of OpenVINO™ backend and several improvements in the serving functionality.
Changes and improvements
- Updated OpenVINO™ Runtime backend to 2024.0. Link
- Extended text generation demo to support multi batch size both with streaming and unary clients. Link to demo
- Added support for REST client for servables based on MediaPipe graphs including python pipeline nodes. Link to demo
- Added additional MediaPipe calculators which can be reused for multiple image analysis scenarios. Link to new calculators
- Added support for models with a
string
input data type including tokenization extension. Link to demo - Security related updates in versions of included dependencies.
Deprecation notices
Batch Size AUTO and Shape AUTO are deprecated and will be removed.
Use Dynamic Model Shape feature instead.
Breaking changes
No breaking changes.
Bug fixes
- Improvements in error handling for invalid requests and incorrect configuration
- Minor fixes in the demos and documentation
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.0
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.0-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2023.3
The 2023.3 is a major release with added a new feature and numerous improvements.
Changes and improvements
-
Included a set of new demos using custom nodes as a python code. They include LLM text generation, stable diffusion and seq2seq translation.
-
Improvements in the demo highlighting video stream analysis. A simple client example can now process the video stream from a local camera, video file or RTSP stream. The data can be sent to the model server via unary gRPC calls or gRPC streaming.
-
Changes in the public release artifacts – the base image of the public model server images is now updated to Ubuntu 22.04 and RHEL 8.8. Public docker images include support for python custom nodes but without custom python dependencies. The public binary distribution of the model server is targeted also on Ubuntu 22.04 and RHEL 8.8 but without python support (it can be deployed on bare metal hosts without python installed). Check building from source guide.
-
Improvements in the documentation https://docs.openvino.ai/2023.3/ovms_what_is_openvino_model_server.html
New Features (Preview)
- Added support for serving MediaPipe graphs with custom nodes implemented as a python code. It greatly simplifies exposing GenAI algorithms based on Hugging Face and Optimum libraries. It can be also applied for arbitrary pre and post processing for the AI solutions. Learn more about it
Stable Feature
gRPC streaming support is out of preview and considered stable.
Breaking changes
No breaking changes.
Deprecation notices
Batch Size AUTO and Shape AUTO are deprecated and will be removed.
Use Dynamic Model Shape feature instead.
Bug fixes
-
OVMS handles boolean parameters to plugin config now #2197
-
Sporadic failures in the IrisTracking demo using gRPC stream are fixed #2161
-
Fixed handling of the incorrect MediaPipe graphs producing multiple outputs with the same name #2161
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2023.3
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2023.3-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2023.2
The 2023.2 is a major release with several new features and improvements.
Changes
- Updated OpenVINO backend to version 2023.2.
- MediaPipe framework has been updated to the current latest version 0.10.3.
- Model API used in the OpenVINO Inference MediaPipe Calculator has been updated and included with all its features.
New Features
- Introduced extension of KServe gRPC API with a stream on input and output. That extension is enabled for the servables with MediaPipe graphs. MediaPipe graph is persistent in the scope of the user session. That improves processing performance and supports stateful graphs – for example tracking algorithms. It also enables the use of source calculators. Check more details.
- Added a demo showcasing gRPC streaming with MediaPipe graph. Check more details.
- Added parameters for gRPC quota configuration and changed default gRPC channel arguments to add rate limits. It will minimize the risks of impact of the service from uncontrolled flow of requests. Check more details.
- Updated python clients requirements to match wide range of python versions from 3.7 to 3.11
Breaking changes
No breaking changes.
Bug fixes
- Handling situation when MediaPipe graph is being added with the same name as previously loaded DAG.
- Fixed returned HTTP status code when MediaPipe graph/DAG is not loaded yet. (previously 404, now 503)
- Corrected error message returned via HTTP when using method other than GET for metadata endpoint - "Unsupported method".
You can use an OpenVINO Model Server public Docker image's based on Ubuntu via the following command:
docker pull openvino/model_server:2023.2
- CPU device support with the image based on Ubuntu20.04
docker pull openvino/model_server:2023.2-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog