Skip to content

Commit c8b9aca

Browse files
author
ssjia
committed
Update on "[ET-VK][ez][qconv] Add auto-selection to prefer im2col for q8ta_conv2d"
The q8ta_conv2d operator previously always delegated to the general (sliding window) implementation, even though the im2col implementation is 2-5x faster for non-grouped convolutions with in_channels % 4 == 0. This change adds runtime auto-selection logic that checks the groups parameter and input channel alignment, then dispatches to q8ta_conv2d_im2col when its constraints are met. On ResNet50 int8, this reduces Vulkan inference latency from 14.2ms to 6.8ms (2.1x speedup) on Samsung Galaxy S24, making it 30% faster than XNNPACK (9.7ms). Also adds performance test cases for deep-channel small-spatial scenarios (512ch 7x7, 1024→2048ch 1x1 stride-2) that stress-test the optimization. Differential Revision: [D93768637](https://our.internmc.facebook.com/intern/diff/D93768637/) [ghstack-poisoned]
2 parents bf70d8a + 99a8402 commit c8b9aca

File tree

3 files changed

+6
-14
lines changed

3 files changed

+6
-14
lines changed

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -384,7 +384,7 @@ for basics.
384384
- `Release notes: quantization`: changes to quantization.
385385
- `Release notes: ops & kernels`: changes to the opset and any new / changed kernel implementations.
386386
- `Release notes: api`: changes to public facing apis (any interfaces, pybinded runtime methods, etc.).
387-
- `Release notes: backends`: changes to any of the backend delegates.
387+
- `Release notes: <backend>`: changes to any of the backend delegates (e.g: `Release notes: apple`, `Release notes: arm`, etc).
388388
- `Release notes: build`: changes related to the build system, including major dependency upgrades, notable build flags, optimizations, etc.
389389
- `Release notes: devtools`: changes to any of ExecuTorch's developer tools, for example the debugger & profiler.
390390
- `Release notes: examples`: changes to any code under `examples/`.

backends/vulkan/runtime/graph/ops/glsl/q8ta_linear_gemv.glsl

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -106,20 +106,14 @@ void main() {
106106
memoryBarrierShared();
107107
barrier();
108108

109-
// Tree reduction to combine partial results
110-
for (int i = WGS / 2; i > 0; i /= 2) {
111-
if (lid < i) {
109+
// Only the first thread writes the result
110+
if (lid == 0) {
111+
for (int i = 1; i < WGS; ++i) {
112112
[[unroll]] for (int tile_n4 = 0; tile_n4 < TILE_N4; ++tile_n4) {
113-
partial_accums[lid].data[0][tile_n4] +=
114-
partial_accums[lid + i].data[0][tile_n4];
113+
partial_accums[0].data[0][tile_n4] +=
114+
partial_accums[i].data[0][tile_n4];
115115
}
116116
}
117-
memoryBarrierShared();
118-
barrier();
119-
}
120-
121-
// Only the first thread writes the result
122-
if (lid == 0) {
123117
out_accum = partial_accums[0];
124118

125119
FPPerOutChannelParams weight_scales_tile;

third-party/gflags.bzl

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ def define_gflags():
1212
srcs = srcs,
1313
headers = headers,
1414
exported_headers = exported_headers,
15-
enable_static_variant = True,
1615
threads = True,
1716
)
1817

@@ -21,6 +20,5 @@ def define_gflags():
2120
srcs = srcs,
2221
headers = headers,
2322
exported_headers = exported_headers,
24-
enable_static_variant = True,
2523
threads = False,
2624
)

0 commit comments

Comments
 (0)