[WebGPU] Implement Split-K on Conv|MatMul #26461

Jiawei-Shao · 2025-10-31T08:52:17Z

Description

This patch implements the Split-K optimization on Conv|MatMul. With Split-K we can re-arrange the computation into multiple workgroups when K is large to increase the parallelism on the platforms that Split-K is confirmed to be useful.

Support Split-K in MakeMatMulPackedVec4Source() to split a workgroup with large K into smaller ones. In this patch we only support Split-K with batch_size == 1 and vec4 on Conv|MatMul.
Support Split-K in MatMulWriteFnSource() (add the partial result to output with atomic built-in functions)
Implement SplitKConfig to decide whether Split-K should be used or not, and all the related thresholds.
Implement MatMulFillBiasBeforeSplitKProgram to initialize the output with bias or 0 when Split-K is used.

Motivation and Context

In current implementation, when K or dim_inner is large, in each invocation we always do the computation one by one in a very large loop, which may not make full use of all EUs on a GPU.

With Split-K we can split such large amount of computation (K) into multiple workgroups with less computation (kSplitK, smaller than K), which can greatly improve the parallelism.

With this patch we can get about 15% performance improvement on efficientnet-lite-f16-demo and 9% improvement on mobilenetv2-12-f16-demo on Lunar Lake and Meteor Lake.

This patch implements the `Split-K` optimization on `Conv|MatMul`. With `Split-K` we can re-arrange the computation into multiple workgroups when `K` is large to increase the parallelism.

Jiawei-Shao · 2025-10-31T08:56:21Z

@jchen10 @xhcao
This is may first PR to support Split-K in Conv|MatMul. PTAL, thanks!

onnxruntime/core/providers/webgpu/math/gemm_utils.cc

jchen10 · 2025-10-31T12:07:09Z

LGTM, thanks!

This reverts commit 3193815.

fs-eire · 2025-11-05T20:17:04Z

I have a general question regarding "reduce" operation in WebGPU. In my understanding, there are general 2 ways to implement a reduce operation:

use atomic premitive, specifically atomicCompareExchangeWeak in this case, just like this PR
use tree reduction (implemented in reduction_ops.cc) (is it possible to apply this algo to Split-K?)

In general, what is the pros vs. cons for the 2 different approaches?

Jiawei-Shao · 2025-11-06T03:15:44Z

I have a general question regarding "reduce" operation in WebGPU. In my understanding, there are general 2 ways to implement a reduce operation:

use atomic primitive, specifically atomicCompareExchangeWeak in this case, just like this PR

use tree reduction (implemented in reduction_ops.cc) (is it possible to apply this algo to Split-K?)

In general, what is the pros vs. cons for the 2 different approaches?

I think tree reduction works best when we only need to increase parallelism by increasing the number of invocations in one workgroup, while Split-K is used when we not only want to increase the number of invocations, but also want to use more workgroups and shared local memory on the fly. On modern Intel GPUs dispatching more workgroups means utilizing more Xe cores (each Xe core has its own vector engines and Shared Local Memory), and that's why we see performance improvements with Split-K on the matrix multiplications with large K.

Jiawei-Shao · 2025-11-06T06:25:41Z

@fs-eire @guschmue @qjia7 PTAL, thanks!

onnxruntime/core/providers/webgpu/math/matmul_packed.cc

onnxruntime/core/providers/webgpu/math/matmul_packed.h

onnxruntime/core/providers/webgpu/nn/conv.cc

onnxruntime/core/providers/webgpu/shader_helper.cc

onnxruntime/core/providers/webgpu/math/matmul.cc

- Renamed `MatMulFillBiasBeforeSplitKProgram` to `MatMulFillBiasOrZeroBeforeSplitKProgram` - Fill one vec4 value (0 or bias) per invocation in `MatMulFillBiasOrZeroBeforeSplitKProgram` - Renamed `CreateMatMulProgram()` to `ComputeMatMul()` and run both `MatMulProgram` and `MatMulFillBiasOrZeroBeforeSplitKProgram` in `ComputeMatMul()` - Removed `ShaderUsage::UseAtomicU32ForSplitK` and use `ProgramOutput::Atomic` instead - Removed `data_type` in the `CacheHint` of `MatMulFillBiasOrZeroBeforeSplitKProgram` - Updated the value of `config.split_dim_inner_` to 512 after more experiments

onnxruntime/core/providers/webgpu/math/matmul.cc

onnxruntime/core/providers/webgpu/shader_helper.cc

onnxruntime/core/providers/webgpu/nn/conv.cc

onnxruntime/core/providers/webgpu/math/matmul_packed.h

onnxruntime/core/providers/webgpu/math/matmul_packed.cc

onnxruntime/core/providers/webgpu/nn/conv.cc

onnxruntime/core/providers/webgpu/math/matmul.cc

onnxruntime/core/providers/webgpu/math/matmul_packed.cc

- Move the query of `SplitKConfig` into `ComputeMatMul()`. It's safe because in `MatMul::ComputeInternal()` `is_channels_last` is always false, while currently `Split-K` only supports `is_channels_last` being true. - Add a comment about avoiding the use of `global_id` or `global_idx` - Directly pass the temporary output shape in FillBiasOrZeroProgram - Merge multiple `if(needs_split_k)` into one in `ComputeMatMul()` - Use `global_idx` instead of `global_id.x`

Copilot

Pull Request Overview

This PR implements the Split-K optimization for Conv and MatMul operations in the WebGPU execution provider. Split-K improves performance when the inner dimension (K) is large by splitting computation across multiple workgroups to increase parallelism, particularly beneficial on Intel GPUs (Lunar Lake and Meteor Lake architectures).

Key Changes:

Adds Split-K support in MakeMatMulPackedVec4Source() to divide large K dimensions into smaller chunks across multiple workgroups
Implements atomic operations in MatMulWriteFnSource() to accumulate partial results from Split-K workgroups
Introduces SplitKConfig class with hardware-specific thresholds and heuristics for determining when to use Split-K
Adds MatMulFillBiasOrZeroBeforeSplitKProgram to initialize output buffers before Split-K computation

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
onnxruntime/test/providers/cpu/nn/conv_op_test.cc	Adds test cases for Conv2D with MatMul-like shapes to validate Split-K implementation with and without bias
onnxruntime/test/providers/cpu/nn/conv_fp16_test.cc	Adds FP16 test coverage for Split-K Conv2D operations with and without bias
onnxruntime/core/providers/webgpu/shader_helper.cc	Extends atomic type support to Float16x4 and Float32x4 for Split-K output buffers
onnxruntime/core/providers/webgpu/nn/conv.cc	Updates Conv implementation to use new `ComputeMatMul` function that handles Split-K
onnxruntime/core/providers/webgpu/math/matmul_packed.h	Defines `MatMulProgram` with Split-K parameter and new `MatMulFillBiasOrZeroBeforeSplitKProgram` class
onnxruntime/core/providers/webgpu/math/matmul_packed.cc	Implements shader generation for Split-K MatMul and bias/zero initialization program
onnxruntime/core/providers/webgpu/math/matmul.h	Introduces `SplitKConfig` class and converts `CreateMatMulProgram` to `ComputeMatMul` function
onnxruntime/core/providers/webgpu/math/matmul.cc	Implements Split-K configuration logic, dispatch size calculations, and bias initialization for Split-K
onnxruntime/core/providers/webgpu/math/gemm_utils.h	Extends `MatMulWriteFnSource` signature to support Split-K and output variable types
onnxruntime/core/providers/webgpu/math/gemm_utils.cc	Implements atomic accumulation logic for Split-K using compare-exchange operations on i32-cast float values

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/webgpu/math/matmul.cc

onnxruntime/core/providers/webgpu/math/gemm_utils.cc

Copilot

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

onnxruntime/core/providers/webgpu/math/matmul_packed.cc:1

Corrected spelling of 'MatMulFillBiasBeforeSplitKProgram' to 'MatMulFillBiasOrZeroBeforeSplitKProgram'.

// Copyright (c) Microsoft Corporation. All rights reserved.

onnxruntime/core/providers/webgpu/math/gemm_utils.cc:1

[nitpick] The comment states that is_channels_last is not used for Split-K, but this may be misleading since the enforcement at line 209 only checks for has_bias. Consider clarifying whether is_channels_last could be supported in the future or if there are other constraints.

// Copyright (c) Microsoft Corporation. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/webgpu/math/gemm_utils.cc

Copilot

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/webgpu/math/matmul.cc

- Cache `SplitKConfig` to `WebGPUContext` and only initialize it once - Early return false when `enable_split_k_` is false - Use `WORKGROUP_SIZE` (in Program.h) instead of declaring another one

Copilot

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/test/providers/cpu/nn/conv_op_test.cc

onnxruntime/test/providers/cpu/nn/conv_fp16_test.cc

Copilot

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

qjia7

No more comments for me.
Please refactor the related shaders to eliminate any reliance on global_id.xxx or workgroup_id.xxx in your follow-up PRs. This dependency presents a potential risk and could lead to unforeseen issues.

fs-eire · 2025-11-14T22:41:35Z

May need merge or rebase to latest main branch to include the fix to the CI pipeline.

Implement Split-K on Conv|MatMul

46e4559

This patch implements the `Split-K` optimization on `Conv|MatMul`. With `Split-K` we can re-arrange the computation into multiple workgroups when `K` is large to increase the parallelism.

jchen10 reviewed Oct 31, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/gemm_utils.cc Outdated Show resolved Hide resolved

jchen10 reviewed Oct 31, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/gemm_utils.cc Outdated Show resolved Hide resolved

Jiawei-Shao added 3 commits October 31, 2025 21:38

Address reviewer's comments

1f06b95

Remove the check of is_channels_last in UseSplitK

3193815

Still require is_channels_last to be true

ecbc093

This reverts commit 3193815.

Jiawei-Shao marked this pull request as ready for review November 3, 2025 05:19

Jiawei-Shao added 2 commits November 4, 2025 21:14

Check the use of Split-K with ratio and enable Split-K on ACM

82d3d9b

Fix incorrect ratio

0099edd

guschmue added the ep:WebGPU ort-web webgpu provider label Nov 4, 2025

Jiawei-Shao added 2 commits November 5, 2025 09:15

Update ratio

05bd1f8

Update ratio

11ecdfe

Jiawei-Shao changed the title ~~Implement Split-K on Conv|MatMul~~ [webgpu] Implement Split-K on Conv|MatMul Nov 5, 2025

Jiawei-Shao changed the title ~~[webgpu] Implement Split-K on Conv|MatMul~~ [WebGPU] Implement Split-K on Conv|MatMul Nov 5, 2025

Compute FP16 values with MLFloat16

534dc2c

qjia7 reviewed Nov 6, 2025

View reviewed changes

Jiawei-Shao added 2 commits November 10, 2025 14:47

Disallow out-of-bound write

d03755b

Jiawei-Shao commented Nov 10, 2025

View reviewed changes

Jiawei-Shao added 2 commits November 12, 2025 13:05

Use safer thresholds by now

581828e

Merge branch 'main' into impl-splitk-matmul

2ed25de

Jiawei-Shao requested a review from qjia7 November 12, 2025 07:37

qjia7 reviewed Nov 12, 2025

View reviewed changes

Jiawei-Shao requested a review from qjia7 November 13, 2025 07:21

Copilot AI reviewed Nov 14, 2025

View reviewed changes

Address comments from Copilot

418d6c0

Jiawei-Shao requested a review from Copilot November 14, 2025 03:10

Copilot AI reviewed Nov 14, 2025

View reviewed changes

Address more comments from Copilot

0ca5e65

Jiawei-Shao requested a review from Copilot November 14, 2025 03:17

Copilot AI reviewed Nov 14, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/gemm_utils.cc Outdated Show resolved Hide resolved

Address more comments from Copilot

7a415de

Jiawei-Shao requested a review from Copilot November 14, 2025 03:20

Copilot AI reviewed Nov 14, 2025

View reviewed changes

qjia7 reviewed Nov 14, 2025

View reviewed changes

Jiawei-Shao added 4 commits November 14, 2025 14:24

Address reviewer's comments

13b94e8

- Cache `SplitKConfig` to `WebGPUContext` and only initialize it once - Early return false when `enable_split_k_` is false - Use `WORKGROUP_SIZE` (in Program.h) instead of declaring another one

Don't call SetWorkgroupSize() as we are using the default value

082d1e3

Remove a redundant declaration

6b15ede

Merge branch 'main' into impl-splitk-matmul

04e2890

Jiawei-Shao requested review from Copilot and qjia7 November 14, 2025 06:55

Copilot AI reviewed Nov 14, 2025

View reviewed changes

onnxruntime/test/providers/cpu/nn/conv_op_test.cc Outdated Show resolved Hide resolved

onnxruntime/test/providers/cpu/nn/conv_fp16_test.cc Outdated Show resolved Hide resolved

Address comments from Copilot

fb4c743

Jiawei-Shao requested a review from Copilot November 14, 2025 07:19

Copilot AI reviewed Nov 14, 2025

View reviewed changes

Jiawei-Shao added 2 commits November 14, 2025 15:25

Remove unused declarations

4ef3ac2

Fix another typo

fa6f226

Jiawei-Shao requested a review from Copilot November 14, 2025 07:30

Copilot AI reviewed Nov 14, 2025

View reviewed changes

xhcao mentioned this pull request Nov 14, 2025

webgpu: add MatMul and Gemm cases with large shapes #26572

Open

qjia7 approved these changes Nov 14, 2025

View reviewed changes

qjia7 requested review from fs-eire and guschmue November 14, 2025 09:07

[WebGPU] Implement Split-K on Conv|MatMul #26461

Are you sure you want to change the base?

[WebGPU] Implement Split-K on Conv|MatMul #26461

Uh oh!

Conversation

Jiawei-Shao commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Jiawei-Shao commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

jchen10 commented Oct 31, 2025

Uh oh!

fs-eire commented Nov 5, 2025

Uh oh!

Jiawei-Shao commented Nov 6, 2025

Uh oh!

Jiawei-Shao commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Jiawei-Shao commented Oct 31, 2025 •

edited

Loading