Update on "[ET-VK][ez][qconv] Add auto-selection to prefer im2col for q8ta_conv2d"

ssjia · ssjia · commit c8b9acaa14b6 · 2026-02-21T00:12:42.000-08:00
The q8ta_conv2d operator previously always delegated to the general (sliding window) implementation, even though the im2col implementation is 2-5x faster for non-grouped convolutions with in_channels % 4 == 0. This change adds runtime auto-selection logic that checks the groups parameter and input channel alignment, then dispatches to q8ta_conv2d_im2col when its constraints are met. On ResNet50 int8, this reduces Vulkan inference latency from 14.2ms to 6.8ms (2.1x speedup) on Samsung Galaxy S24, making it 30% faster than XNNPACK (9.7ms). Also adds performance test cases for deep-channel small-spatial scenarios (512ch 7x7, 1024→2048ch 1x1 stride-2) that stress-test the optimization. Differential Revision: [D93768637](https://our.internmc.facebook.com/intern/diff/D93768637/) [ghstack-poisoned]
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -384,7 +384,7 @@ for basics.
      - `Release notes: quantization`: changes to quantization.
      - `Release notes: ops & kernels`: changes to the opset and any new / changed kernel implementations.
      - `Release notes: api`: changes to public facing apis (any interfaces, pybinded runtime methods, etc.).
-     - `Release notes: backends`: changes to any of the backend delegates.
+     - `Release notes: <backend>`: changes to any of the backend delegates (e.g: `Release notes: apple`, `Release notes: arm`, etc).
      - `Release notes: build`: changes related to the build system, including major dependency upgrades, notable build flags, optimizations, etc.
      - `Release notes: devtools`: changes to any of ExecuTorch's developer tools, for example the debugger & profiler.
      - `Release notes: examples`: changes to any code under `examples/`.
diff --git a/backends/vulkan/runtime/graph/ops/glsl/q8ta_linear_gemv.glsl b/backends/vulkan/runtime/graph/ops/glsl/q8ta_linear_gemv.glsl
@@ -106,20 +106,14 @@ void main() {
   memoryBarrierShared();
   barrier();
 
-  // Tree reduction to combine partial results
-  for (int i = WGS / 2; i > 0; i /= 2) {
-    if (lid < i) {
+  // Only the first thread writes the result
+  if (lid == 0) {
+    for (int i = 1; i < WGS; ++i) {
       [[unroll]] for (int tile_n4 = 0; tile_n4 < TILE_N4; ++tile_n4) {
-        partial_accums[lid].data[0][tile_n4] +=
-            partial_accums[lid + i].data[0][tile_n4];
+        partial_accums[0].data[0][tile_n4] +=
+            partial_accums[i].data[0][tile_n4];
       }
     }
-    memoryBarrierShared();
-    barrier();
-  }
-
-  // Only the first thread writes the result
-  if (lid == 0) {
     out_accum = partial_accums[0];
 
     FPPerOutChannelParams weight_scales_tile;
diff --git a/third-party/gflags.bzl b/third-party/gflags.bzl
@@ -12,7 +12,6 @@ def define_gflags():
         srcs = srcs,
         headers = headers,
         exported_headers = exported_headers,
-        enable_static_variant = True,
         threads = True,
     )
 
@@ -21,6 +20,5 @@ def define_gflags():
         srcs = srcs,
         headers = headers,
         exported_headers = exported_headers,
-        enable_static_variant = True,
         threads = False,
     )

Original file line number	Diff line number	Diff line change
`@@ -12,7 +12,6 @@ def define_gflags():`
`12`	`12`	`srcs = srcs,`
`13`	`13`	`headers = headers,`
`14`	`14`	`exported_headers = exported_headers,`
`15`		`- enable_static_variant = True,`
`16`	`15`	`threads = True,`
`17`	`16`	`)`
`18`	`17`
`@@ -21,6 +20,5 @@ def define_gflags():`
`21`	`20`	`srcs = srcs,`
`22`	`21`	`headers = headers,`
`23`	`22`	`exported_headers = exported_headers,`
`24`		`- enable_static_variant = True,`
`25`	`23`	`threads = False,`
`26`	`24`	`)`