Skip to content

Conversation

@Jiawei-Shao
Copy link
Contributor

@Jiawei-Shao Jiawei-Shao commented Oct 31, 2025

Description

This patch implements the Split-K optimization on Conv|MatMul. With Split-K we can re-arrange the computation into multiple workgroups when K is large to increase the parallelism on the platforms that Split-K is confirmed to be useful.

  1. Support Split-K in MakeMatMulPackedVec4Source() to split a workgroup with large K into smaller ones. In this patch we only support Split-K with batch_size == 1 and vec4 on Conv|MatMul.
  2. Support Split-K in MatMulWriteFnSource() (add the partial result to output with atomic built-in functions)
  3. Implement SplitKConfig to decide whether Split-K should be used or not, and all the related thresholds.
  4. Implement MatMulFillBiasBeforeSplitKProgram to initialize the output with bias or 0 when Split-K is used.

Motivation and Context

In current implementation, when K or dim_inner is large, in each invocation we always do the computation one by one in a very large loop, which may not make full use of all EUs on a GPU.

With Split-K we can split such large amount of computation (K) into multiple workgroups with less computation (kSplitK, smaller than K), which can greatly improve the parallelism.

With this patch we can get about 15% performance improvement on efficientnet-lite-f16-demo and 9% improvement on mobilenetv2-12-f16-demo on Lunar Lake and Meteor Lake.

This patch implements the `Split-K` optimization on `Conv|MatMul`.
With `Split-K` we can re-arrange the computation into multiple
workgroups when `K` is large to increase the parallelism.
@Jiawei-Shao
Copy link
Contributor Author

@jchen10 @xhcao
This is may first PR to support Split-K in Conv|MatMul. PTAL, thanks!

@jchen10
Copy link
Contributor

jchen10 commented Oct 31, 2025

LGTM, thanks!

@Jiawei-Shao Jiawei-Shao marked this pull request as ready for review November 3, 2025 05:19
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Nov 4, 2025
@Jiawei-Shao Jiawei-Shao changed the title Implement Split-K on Conv|MatMul [webgpu] Implement Split-K on Conv|MatMul Nov 5, 2025
@Jiawei-Shao Jiawei-Shao changed the title [webgpu] Implement Split-K on Conv|MatMul [WebGPU] Implement Split-K on Conv|MatMul Nov 5, 2025
@fs-eire
Copy link
Contributor

fs-eire commented Nov 5, 2025

I have a general question regarding "reduce" operation in WebGPU. In my understanding, there are general 2 ways to implement a reduce operation:

  • use atomic premitive, specifically atomicCompareExchangeWeak in this case, just like this PR
  • use tree reduction (implemented in reduction_ops.cc) (is it possible to apply this algo to Split-K?)

In general, what is the pros vs. cons for the 2 different approaches?

@Jiawei-Shao
Copy link
Contributor Author

I have a general question regarding "reduce" operation in WebGPU. In my understanding, there are general 2 ways to implement a reduce operation:

  • use atomic primitive, specifically atomicCompareExchangeWeak in this case, just like this PR
  • use tree reduction (implemented in reduction_ops.cc) (is it possible to apply this algo to Split-K?)

In general, what is the pros vs. cons for the 2 different approaches?

I think tree reduction works best when we only need to increase parallelism by increasing the number of invocations in one workgroup, while Split-K is used when we not only want to increase the number of invocations, but also want to use more workgroups and shared local memory on the fly. On modern Intel GPUs dispatching more workgroups means utilizing more Xe cores (each Xe core has its own vector engines and Shared Local Memory), and that's why we see performance improvements with Split-K on the matrix multiplications with large K.

@Jiawei-Shao
Copy link
Contributor Author

@fs-eire @guschmue @qjia7 PTAL, thanks!

- Renamed `MatMulFillBiasBeforeSplitKProgram` to `MatMulFillBiasOrZeroBeforeSplitKProgram`
- Fill one vec4 value (0 or bias) per invocation in `MatMulFillBiasOrZeroBeforeSplitKProgram`
- Renamed `CreateMatMulProgram()` to `ComputeMatMul()` and run both `MatMulProgram` and
  `MatMulFillBiasOrZeroBeforeSplitKProgram` in `ComputeMatMul()`
- Removed `ShaderUsage::UseAtomicU32ForSplitK` and use `ProgramOutput::Atomic` instead
- Removed `data_type` in the `CacheHint` of `MatMulFillBiasOrZeroBeforeSplitKProgram`
- Updated the value of `config.split_dim_inner_` to 512 after more experiments
@Jiawei-Shao Jiawei-Shao requested a review from qjia7 November 12, 2025 07:37
- Move the query of `SplitKConfig` into `ComputeMatMul()`. It's safe
  because in `MatMul::ComputeInternal()` `is_channels_last` is always
  false, while currently `Split-K` only supports `is_channels_last`
  being true.
- Add a comment about avoiding the use of `global_id` or `global_idx`
- Directly pass the temporary output shape in FillBiasOrZeroProgram
- Merge multiple `if(needs_split_k)` into one in `ComputeMatMul()`
- Use `global_idx` instead of `global_id.x`
@Jiawei-Shao Jiawei-Shao requested a review from qjia7 November 13, 2025 07:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the Split-K optimization for Conv and MatMul operations in the WebGPU execution provider. Split-K improves performance when the inner dimension (K) is large by splitting computation across multiple workgroups to increase parallelism, particularly beneficial on Intel GPUs (Lunar Lake and Meteor Lake architectures).

Key Changes:

  • Adds Split-K support in MakeMatMulPackedVec4Source() to divide large K dimensions into smaller chunks across multiple workgroups
  • Implements atomic operations in MatMulWriteFnSource() to accumulate partial results from Split-K workgroups
  • Introduces SplitKConfig class with hardware-specific thresholds and heuristics for determining when to use Split-K
  • Adds MatMulFillBiasOrZeroBeforeSplitKProgram to initialize output buffers before Split-K computation

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
onnxruntime/test/providers/cpu/nn/conv_op_test.cc Adds test cases for Conv2D with MatMul-like shapes to validate Split-K implementation with and without bias
onnxruntime/test/providers/cpu/nn/conv_fp16_test.cc Adds FP16 test coverage for Split-K Conv2D operations with and without bias
onnxruntime/core/providers/webgpu/shader_helper.cc Extends atomic type support to Float16x4 and Float32x4 for Split-K output buffers
onnxruntime/core/providers/webgpu/nn/conv.cc Updates Conv implementation to use new ComputeMatMul function that handles Split-K
onnxruntime/core/providers/webgpu/math/matmul_packed.h Defines MatMulProgram with Split-K parameter and new MatMulFillBiasOrZeroBeforeSplitKProgram class
onnxruntime/core/providers/webgpu/math/matmul_packed.cc Implements shader generation for Split-K MatMul and bias/zero initialization program
onnxruntime/core/providers/webgpu/math/matmul.h Introduces SplitKConfig class and converts CreateMatMulProgram to ComputeMatMul function
onnxruntime/core/providers/webgpu/math/matmul.cc Implements Split-K configuration logic, dispatch size calculations, and bias initialization for Split-K
onnxruntime/core/providers/webgpu/math/gemm_utils.h Extends MatMulWriteFnSource signature to support Split-K and output variable types
onnxruntime/core/providers/webgpu/math/gemm_utils.cc Implements atomic accumulation logic for Split-K using compare-exchange operations on i32-cast float values

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Jiawei-Shao Jiawei-Shao requested a review from Copilot November 14, 2025 03:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

onnxruntime/core/providers/webgpu/math/matmul_packed.cc:1

  • Corrected spelling of 'MatMulFillBiasBeforeSplitKProgram' to 'MatMulFillBiasOrZeroBeforeSplitKProgram'.
// Copyright (c) Microsoft Corporation. All rights reserved.

onnxruntime/core/providers/webgpu/math/gemm_utils.cc:1

  • [nitpick] The comment states that is_channels_last is not used for Split-K, but this may be misleading since the enforcement at line 209 only checks for has_bias. Consider clarifying whether is_channels_last could be supported in the future or if there are other constraints.
// Copyright (c) Microsoft Corporation. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Jiawei-Shao Jiawei-Shao requested a review from Copilot November 14, 2025 03:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Jiawei-Shao Jiawei-Shao requested a review from Copilot November 14, 2025 03:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Cache `SplitKConfig` to `WebGPUContext` and only initialize it once
- Early return false when `enable_split_k_` is false
- Use `WORKGROUP_SIZE` (in Program.h) instead of declaring another one
@Jiawei-Shao Jiawei-Shao requested review from Copilot and qjia7 November 14, 2025 06:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Jiawei-Shao Jiawei-Shao requested a review from Copilot November 14, 2025 07:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Jiawei-Shao Jiawei-Shao requested a review from Copilot November 14, 2025 07:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more comments for me.
Please refactor the related shaders to eliminate any reliance on global_id.xxx or workgroup_id.xxx in your follow-up PRs. This dependency presents a potential risk and could lead to unforeseen issues.

@qjia7 qjia7 requested review from fs-eire and guschmue November 14, 2025 09:07
@fs-eire
Copy link
Contributor

fs-eire commented Nov 14, 2025

May need merge or rebase to latest main branch to include the fix to the CI pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants