-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Describe the issue
When exporting a model that contains an Attention block I deliberately target ONNX opset 23 so that the single Attention operator (introduced in opset 23) is kept intact instead of being decomposed into many small primitives. The exported FP32 model runs correctly with the current ONNX Runtime CPU build.
Afterwards I apply static INT8 quantization via onnxruntime.quantization.quantize_static. The resulting graph contains QuantizeLinear nodes at opset 23. At runtime the session fails to initialize with
[ONNXRuntimeError] : 9 : NOT_IMPLEMENTED :
Could not find an implementation for QuantizeLinear(23)
The same workflow using opset 21 works without error.
Inspection of the ORT source tree shows that the kernels
ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 23, uint8_t, QuantizeLinear)
ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 23, int8_t, QuantizeLinear)
are indeed registered for opset 23, yet the binary that is loaded at runtime appears to lack them.
To reproduce
import os, torch, numpy as np, onnxruntime as ort
from torch import nn
from onnxruntime.quantization import quantize_static, QuantFormat, QuantType, CalibrationMethod
class TinyConv(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(1, 8, 3, padding=1)
def forward(self, x):
return self.conv(x.unsqueeze(1).unsqueeze(2)).squeeze(2)
os.makedirs('tmp_model', exist_ok=True)
model_path = 'tmp_model/contentvec.onnx'
int8_path = 'tmp_model/contentvec_int8.onnx'
model = TinyConv().eval()
dummy = torch.randn(1, 128)
torch.onnx.export(model, dummy, model_path,
opset_version=23, dynamo=True,
input_names=['input_values'], output_names=['hidden_states'])
class DummyReader:
def __init__(self):
self.data = [np.random.randn(1, 128).astype(np.float32) for _ in range(4)]
self.idx = 0
def get_next(self):
if self.idx >= len(self.data): return None
out = {'input_values': self.data[self.idx]}
self.idx += 1
return out
quantize_static(model_path, int8_path, DummyReader(),
quant_format=QuantFormat.QOperator,
activation_type=QuantType.QUInt8,
weight_type=QuantType.QInt8,
calibrate_method=CalibrationMethod.MinMax)
sess = ort.InferenceSession(int8_path, providers=['CPUExecutionProvider'])
print(sess.run(None, {'input_values': np.random.randn(1, 128).astype(np.float32)})[0].shape)
Urgency
[HIGH PRIORITY] Missing CPU Kernel for QuantizeLinear at ONNX Opset 23 After Static Quantization
Platform
Windows
OS Version
26100.4946
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.23.0.dev20250902003
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response