-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[WebGPU] Implement Split-K on Conv|MatMul #26461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch implements the `Split-K` optimization on `Conv|MatMul`. With `Split-K` we can re-arrange the computation into multiple workgroups when `K` is large to increase the parallelism.
|
LGTM, thanks! |
|
I have a general question regarding "reduce" operation in WebGPU. In my understanding, there are general 2 ways to implement a reduce operation:
In general, what is the pros vs. cons for the 2 different approaches? |
I think |
- Renamed `MatMulFillBiasBeforeSplitKProgram` to `MatMulFillBiasOrZeroBeforeSplitKProgram` - Fill one vec4 value (0 or bias) per invocation in `MatMulFillBiasOrZeroBeforeSplitKProgram` - Renamed `CreateMatMulProgram()` to `ComputeMatMul()` and run both `MatMulProgram` and `MatMulFillBiasOrZeroBeforeSplitKProgram` in `ComputeMatMul()` - Removed `ShaderUsage::UseAtomicU32ForSplitK` and use `ProgramOutput::Atomic` instead - Removed `data_type` in the `CacheHint` of `MatMulFillBiasOrZeroBeforeSplitKProgram` - Updated the value of `config.split_dim_inner_` to 512 after more experiments
- Move the query of `SplitKConfig` into `ComputeMatMul()`. It's safe because in `MatMul::ComputeInternal()` `is_channels_last` is always false, while currently `Split-K` only supports `is_channels_last` being true. - Add a comment about avoiding the use of `global_id` or `global_idx` - Directly pass the temporary output shape in FillBiasOrZeroProgram - Merge multiple `if(needs_split_k)` into one in `ComputeMatMul()` - Use `global_idx` instead of `global_id.x`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements the Split-K optimization for Conv and MatMul operations in the WebGPU execution provider. Split-K improves performance when the inner dimension (K) is large by splitting computation across multiple workgroups to increase parallelism, particularly beneficial on Intel GPUs (Lunar Lake and Meteor Lake architectures).
Key Changes:
- Adds Split-K support in
MakeMatMulPackedVec4Source()to divide large K dimensions into smaller chunks across multiple workgroups - Implements atomic operations in
MatMulWriteFnSource()to accumulate partial results from Split-K workgroups - Introduces
SplitKConfigclass with hardware-specific thresholds and heuristics for determining when to use Split-K - Adds
MatMulFillBiasOrZeroBeforeSplitKProgramto initialize output buffers before Split-K computation
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/nn/conv_op_test.cc | Adds test cases for Conv2D with MatMul-like shapes to validate Split-K implementation with and without bias |
| onnxruntime/test/providers/cpu/nn/conv_fp16_test.cc | Adds FP16 test coverage for Split-K Conv2D operations with and without bias |
| onnxruntime/core/providers/webgpu/shader_helper.cc | Extends atomic type support to Float16x4 and Float32x4 for Split-K output buffers |
| onnxruntime/core/providers/webgpu/nn/conv.cc | Updates Conv implementation to use new ComputeMatMul function that handles Split-K |
| onnxruntime/core/providers/webgpu/math/matmul_packed.h | Defines MatMulProgram with Split-K parameter and new MatMulFillBiasOrZeroBeforeSplitKProgram class |
| onnxruntime/core/providers/webgpu/math/matmul_packed.cc | Implements shader generation for Split-K MatMul and bias/zero initialization program |
| onnxruntime/core/providers/webgpu/math/matmul.h | Introduces SplitKConfig class and converts CreateMatMulProgram to ComputeMatMul function |
| onnxruntime/core/providers/webgpu/math/matmul.cc | Implements Split-K configuration logic, dispatch size calculations, and bias initialization for Split-K |
| onnxruntime/core/providers/webgpu/math/gemm_utils.h | Extends MatMulWriteFnSource signature to support Split-K and output variable types |
| onnxruntime/core/providers/webgpu/math/gemm_utils.cc | Implements atomic accumulation logic for Split-K using compare-exchange operations on i32-cast float values |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (2)
onnxruntime/core/providers/webgpu/math/matmul_packed.cc:1
- Corrected spelling of 'MatMulFillBiasBeforeSplitKProgram' to 'MatMulFillBiasOrZeroBeforeSplitKProgram'.
// Copyright (c) Microsoft Corporation. All rights reserved.
onnxruntime/core/providers/webgpu/math/gemm_utils.cc:1
- [nitpick] The comment states that
is_channels_lastis not used for Split-K, but this may be misleading since the enforcement at line 209 only checks forhas_bias. Consider clarifying whetheris_channels_lastcould be supported in the future or if there are other constraints.
// Copyright (c) Microsoft Corporation. All rights reserved.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Cache `SplitKConfig` to `WebGPUContext` and only initialize it once - Early return false when `enable_split_k_` is false - Use `WORKGROUP_SIZE` (in Program.h) instead of declaring another one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
qjia7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more comments for me.
Please refactor the related shaders to eliminate any reliance on global_id.xxx or workgroup_id.xxx in your follow-up PRs. This dependency presents a potential risk and could lead to unforeseen issues.
|
May need merge or rebase to latest main branch to include the fix to the CI pipeline. |
Description
This patch implements the
Split-Koptimization onConv|MatMul. WithSplit-Kwe can re-arrange the computation into multiple workgroups whenKis large to increase the parallelism on the platforms thatSplit-Kis confirmed to be useful.Split-KinMakeMatMulPackedVec4Source()to split a workgroup with large K into smaller ones. In this patch we only supportSplit-Kwithbatch_size == 1andvec4onConv|MatMul.Split-KinMatMulWriteFnSource()(add the partial result to output with atomic built-in functions)SplitKConfigto decide whetherSplit-Kshould be used or not, and all the related thresholds.MatMulFillBiasBeforeSplitKProgramto initialize the output withbiasor 0 whenSplit-Kis used.Motivation and Context
In current implementation, when
Kordim_inneris large, in each invocation we always do the computation one by one in a very large loop, which may not make full use of all EUs on a GPU.With
Split-Kwe can split such large amount of computation (K) into multiple workgroups with less computation (kSplitK, smaller than K), which can greatly improve the parallelism.With this patch we can get about 15% performance improvement on
efficientnet-lite-f16-demoand 9% improvement onmobilenetv2-12-f16-demoon Lunar Lake and Meteor Lake.