Skip to content

[RemoveLayout] Remove convert layout op for any layout if the user is tt.store with block pointer #4751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

chengjunlu
Copy link
Contributor

@chengjunlu chengjunlu commented Jul 21, 2025

It is always lower cost to store the value to the memory referred by block pointer directly without layout conversion.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes layout conversion optimizations for tt.store operations when using block pointers. The change eliminates conditional checks that previously prevented layout conversion removal in specific scenarios involving DPAS encoding, making the optimization more aggressive by always preferring direct memory stores without layout conversion.

  • Removes DPAS encoding validation checks for convert operations
  • Eliminates restriction on forwarding DPAS encodings when output type already has DPAS encoding

Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing test.
can you ensure no performance regression?

@chengjunlu chengjunlu force-pushed the chengjun/improve_remove_layout branch 2 times, most recently from 42fa047 to ca1a49d Compare July 22, 2025 05:56
@chengjunlu
Copy link
Contributor Author

…block pointer. It is always lower cost to store the value to the memory referred by block pointer. without layout conversion.

Signed-off-by: Lu,Chengjun <chengjun.lu@intel.com>
@anmyachev
Copy link
Contributor

@etiotto
Copy link
Contributor

etiotto commented Jul 28, 2025

Possible degradation in FlexAttention:
image

Need to rebase the PR and rerun. New run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/16578249148

#blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 16], warpsPerCTA = [2, 2], order = [1, 0]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 4], order = [1, 0]}>
module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, "ttg.threads-per-warp" = 16 : i32, "ttig.support_sg_2d_block"} {
tt.func public @matmul_kernel_with_block_pointers(%arg0: !tt.ptr<f16>, %arg1: !tt.ptr<f16>, %arg2: !tt.ptr<f16>, %arg3: i32, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: i32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the argument are used by the kernel. Remove them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except %arg2

%c1_i64 = arith.constant 1 : i64
%c256_i64 = arith.constant 256 : i64
%cst = arith.constant dense<0.000000e+00> : tensor<64x256xf16, #blocked>
%25 = ttg.convert_layout %cst : tensor<64x256xf16, #blocked> -> tensor<64x256xf16, #blocked1>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a CHECK-NOT to ensure the generated code doesn't contain a convert layout operation

@whitneywhtsang
Copy link
Contributor

Flex Attention performance is good with the latest run.

@etiotto
Copy link
Contributor

etiotto commented Jul 28, 2025

Flex Attention performance is good with the latest run.

Yup, it is. No performance degradations in microbenchmark due to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants