-
Notifications
You must be signed in to change notification settings - Fork 68
[RemoveLayout] Remove convert layout op for any layout if the user is tt.store with block pointer #4751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes layout conversion optimizations for tt.store
operations when using block pointers. The change eliminates conditional checks that previously prevented layout conversion removal in specific scenarios involving DPAS encoding, making the optimization more aggressive by always preferring direct memory stores without layout conversion.
- Removes DPAS encoding validation checks for convert operations
- Eliminates restriction on forwarding DPAS encodings when output type already has DPAS encoding
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing test.
can you ensure no performance regression?
42fa047
to
ca1a49d
Compare
…block pointer. It is always lower cost to store the value to the memory referred by block pointer. without layout conversion. Signed-off-by: Lu,Chengjun <chengjun.lu@intel.com>
I don't know why the launch is in the queue for so long. Here are new runs: |
Possible degradation in FlexAttention: Need to rebase the PR and rerun. New run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/16578249148 |
#blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 16], warpsPerCTA = [2, 2], order = [1, 0]}> | ||
#blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 4], order = [1, 0]}> | ||
module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, "ttg.threads-per-warp" = 16 : i32, "ttig.support_sg_2d_block"} { | ||
tt.func public @matmul_kernel_with_block_pointers(%arg0: !tt.ptr<f16>, %arg1: !tt.ptr<f16>, %arg2: !tt.ptr<f16>, %arg3: i32, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: i32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the argument are used by the kernel. Remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except %arg2
%c1_i64 = arith.constant 1 : i64 | ||
%c256_i64 = arith.constant 256 : i64 | ||
%cst = arith.constant dense<0.000000e+00> : tensor<64x256xf16, #blocked> | ||
%25 = ttg.convert_layout %cst : tensor<64x256xf16, #blocked> -> tensor<64x256xf16, #blocked1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a CHECK-NOT to ensure the generated code doesn't contain a convert layout operation
Flex Attention performance is good with the latest run. |
Yup, it is. No performance degradations in microbenchmark due to this PR. |
It is always lower cost to store the value to the memory referred by block pointer directly without layout conversion.