Skip to content

Could not find an implementation for QuantizeLinear(23) #25932

@ultranationalism

Description

@ultranationalism

Describe the issue

When exporting a model that contains an Attention block I deliberately target ONNX opset 23 so that the single Attention operator (introduced in opset 23) is kept intact instead of being decomposed into many small primitives. The exported FP32 model runs correctly with the current ONNX Runtime CPU build.
Afterwards I apply static INT8 quantization via onnxruntime.quantization.quantize_static. The resulting graph contains QuantizeLinear nodes at opset 23. At runtime the session fails to initialize with

[ONNXRuntimeError] : 9 : NOT_IMPLEMENTED :
Could not find an implementation for QuantizeLinear(23)

The same workflow using opset 21 works without error.
Inspection of the ORT source tree shows that the kernels

ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 23, uint8_t, QuantizeLinear)
ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 23, int8_t, QuantizeLinear)

are indeed registered for opset 23, yet the binary that is loaded at runtime appears to lack them.

To reproduce

import os, torch, numpy as np, onnxruntime as ort
from torch import nn
from onnxruntime.quantization import quantize_static, QuantFormat, QuantType, CalibrationMethod

class TinyConv(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 8, 3, padding=1)

    def forward(self, x):
        return self.conv(x.unsqueeze(1).unsqueeze(2)).squeeze(2)

os.makedirs('tmp_model', exist_ok=True)
model_path = 'tmp_model/contentvec.onnx'
int8_path  = 'tmp_model/contentvec_int8.onnx'

model = TinyConv().eval()
dummy = torch.randn(1, 128)
torch.onnx.export(model, dummy, model_path,
                  opset_version=23, dynamo=True,
                  input_names=['input_values'], output_names=['hidden_states'])

class DummyReader:
    def __init__(self):
        self.data = [np.random.randn(1, 128).astype(np.float32) for _ in range(4)]
        self.idx = 0
    def get_next(self):
        if self.idx >= len(self.data): return None
        out = {'input_values': self.data[self.idx]}
        self.idx += 1
        return out

quantize_static(model_path, int8_path, DummyReader(),
                quant_format=QuantFormat.QOperator,
                activation_type=QuantType.QUInt8,
                weight_type=QuantType.QInt8,
                calibrate_method=CalibrationMethod.MinMax)

sess = ort.InferenceSession(int8_path, providers=['CPUExecutionProvider'])
print(sess.run(None, {'input_values': np.random.randn(1, 128).astype(np.float32)})[0].shape)

Urgency

[HIGH PRIORITY] Missing CPU Kernel for QuantizeLinear at ONNX Opset 23 After Static Quantization

Platform

Windows

OS Version

26100.4946

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.23.0.dev20250902003

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    quantizationissues related to quantization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions