Skip to content

Convert block ptr type layouts to Subgroup2DBlockEncoding layouts #4463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions test/TritonIntelGPU/optimize-block-io-encoding.mlir
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
// RUN: triton-opt %s -split-input-file -allow-unregistered-dialect --tritonintelgpu-optimize-block-io-encoding | FileCheck %s

// COM: test complete example
#blocked = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 16], warpsPerCTA = [8, 4], order = [1, 0]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 4], warpsPerCTA = [32, 1], order = [1, 0]}>
#blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [1, 16], warpsPerCTA = [16, 2], order = [1, 0]}>
// CHECK-DAG: #[[$SUBGROUP_BLOCK_A:.+]] = #ttig.subgroup_2d_block<{warpsPerCTA = [8, 4], instrShape = [8, 16], numBlocks=2, order=[1, 0], kWidth=1, threadsPerWarp=16}>
// CHECK-DAG: #[[$SUBGROUP_BLOCK_B:.+]] = #ttig.subgroup_2d_block<{warpsPerCTA = [8, 4], instrShape = [16, 16], numBlocks=2, order=[0, 1], kWidth=2, threadsPerWarp=16}>
// CHECK-DAG: #[[$DPAS:.+]] = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
#mma = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32, "ttg.threads-per-warp" = 16 : i32, ttig.min_sg_size = 16 : i32, ttig.support_dpas, ttig.support_sg_2d_block} {
tt.func public @matmul_kernel_with_block_pointers(%arg0: !tt.ptr<f16>, %arg1: !tt.ptr<f16>, %arg2: !tt.ptr<f16>) {
%c4_i32 = arith.constant 4 : i32
%c256_i32 = arith.constant 256 : i32
%c1024_i64 = arith.constant 1024 : i64
%c5120_i64 = arith.constant 5120 : i64
%c1_i64 = arith.constant 1 : i64
%c0_i32 = arith.constant 0 : i32
%c4096_i64 = arith.constant 4096 : i64
%c32_i32 = arith.constant 32 : i32
%c64_i32 = arith.constant 64 : i32
%c5120_i32 = arith.constant 5120 : i32
%cst = arith.constant dense<0.000000e+00> : tensor<256x256xf32, #blocked>

// CHECK: %[[MAKE_TENSOR_PTR_A:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<256x32xf16, #[[$SUBGROUP_BLOCK_A]]>>
%10 = tt.make_tensor_ptr %arg0, [%c1024_i64, %c5120_i64], [%c5120_i64, %c1_i64], [%c256_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<256x32xf16, #blocked1>>
// CHECK: %[[MAKE_TENSOR_PTR_B:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<32x256xf16, #[[$SUBGROUP_BLOCK_B]]>>
%12 = tt.make_tensor_ptr %arg1, [%c5120_i64, %c4096_i64], [%c4096_i64, %c1_i64], [%c0_i32, %c256_i32] {order = array<i32: 1, 0>} : <tensor<32x256xf16, #blocked2>>
// CHECK: %[[RES:.*]]:3 = scf.for {{.*}} iter_args({{.*}} = {{.*}}, %[[ARG5:.*]] = %[[MAKE_TENSOR_PTR_A]], %[[ARG6:.*]] = %[[MAKE_TENSOR_PTR_B]])
%13:3 = scf.for %arg3 = %c0_i32 to %c5120_i32 step %c32_i32 iter_args(%arg4 = %cst, %arg5 = %10, %arg6 = %12) -> (tensor<256x256xf32, #blocked>, !tt.ptr<tensor<256x32xf16, #blocked1>>, !tt.ptr<tensor<32x256xf16, #blocked2>>) : i32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a CHECK to ensure that the loop init args have the new tensor ptr type with %mma layout.
The return types of the scf.for operation should be CHECKED too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of the scf is not modified - so I am not sure what we need to check? The tt.dot return type is unchanged and the make tensor ptr / store uses the same layout as before.

 %26 = tt.dot %24, %25, %23, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma2, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma2, kWidth = 2}>> -> tensor<256x256xf32, #mma2>
      %27 = ttg.convert_layout %26 : tensor<256x256xf32, #mma2> -> tensor<256x256xf32, #blocked>
      %28 = tt.advance %arg5, [%c0_i32, %c32_i32] : <tensor<256x32xf16, #mma>>
      %29 = tt.advance %arg6, [%c32_i32, %c0_i32] : <tensor<32x256xf16, #mma1>>
      scf.yield %27, %28, %29 : tensor<256x256xf32, #blocked>, !tt.ptr<tensor<256x32xf16, #mma>>, !tt.ptr<tensor<32x256xf16, #mma1>>
    }
    %14 = tt.make_tensor_ptr %arg2, [%c1024_i64, %c4096_i64], [%c4096_i64, %c1_i64], [%9, %11] {order = array<i32: 1, 0>} : <tensor<256x256xf16, #blocked2>>
    %15 = arith.truncf %13#0 : tensor<256x256xf32, #blocked> to tensor<256x256xf16, #blocked>
    %16 = ttg.convert_layout %15 : tensor<256x256xf16, #blocked> -> tensor<256x256xf16, #blocked2>
    tt.store %14, %16 {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<256x256xf16, #blocked2>>
    tt.return

I can add a check to make sure the arguments of the scf for loop are using the new values, but the verifier for the scf for should not allowed a type mismatch so I don't think we need another type check (I briefly looked through other lit examples and did not see arg type checking for scf.for).

%17 = tt.load %arg5 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #blocked1>>
// CHECK: %[[A_LOAD:.*]] = tt.load %[[ARG5]] {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #[[$SUBGROUP_BLOCK_A]]>>
// CHECK: {{.*}} = ttg.convert_layout %[[A_LOAD]] : tensor<256x32xf16, #[[$SUBGROUP_BLOCK_A]]> -> tensor<256x32xf16, #blocked1>
%18 = tt.load %arg6 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<32x256xf16, #blocked2>>
// CHECK: %[[B_LOAD:.*]] = tt.load %[[ARG6]] {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<32x256xf16, #[[$SUBGROUP_BLOCK_B]]>>
// CHECK: {{.*}} = ttg.convert_layout %[[B_LOAD]] : tensor<32x256xf16, #[[$SUBGROUP_BLOCK_B]]> -> tensor<32x256xf16, #blocked2>
%19 = ttg.convert_layout %17 : tensor<256x32xf16, #blocked1> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
%20 = ttg.convert_layout %18 : tensor<32x256xf16, #blocked2> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
%21 = ttg.convert_layout %arg4 : tensor<256x256xf32, #blocked> -> tensor<256x256xf32, #mma>
%22 = ttg.convert_layout %19 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>
%23 = ttg.convert_layout %20 : tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
// CHECK: tt.dot {{.*}} : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #[[$DPAS]], kWidth = 2}>> -> tensor<256x256xf32, #[[$DPAS]]>
%24 = tt.dot %22, %23, %21, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x256xf32, #mma>
%25 = ttg.convert_layout %24 : tensor<256x256xf32, #mma> -> tensor<256x256xf32, #blocked>
// CHECK: %[[ADVANCE_A:.*]] = tt.advance {{.*}} : <tensor<256x32xf16, #[[$SUBGROUP_BLOCK_A]]>>
%26 = tt.advance %arg5, [%c0_i32, %c32_i32] : <tensor<256x32xf16, #blocked1>>
// CHECK: %[[ADVANCE_B:.*]] = tt.advance {{.*}} : <tensor<32x256xf16, #[[$SUBGROUP_BLOCK_B]]>>
%27 = tt.advance %arg6, [%c32_i32, %c0_i32] : <tensor<32x256xf16, #blocked2>>
// CHECK: scf.yield {{.*}}, %[[ADVANCE_A]], %[[ADVANCE_B]]
scf.yield %25, %26, %27 : tensor<256x256xf32, #blocked>, !tt.ptr<tensor<256x32xf16, #blocked1>>, !tt.ptr<tensor<32x256xf16, #blocked2>>
}
%14 = tt.make_tensor_ptr %arg2, [%c1024_i64, %c4096_i64], [%c4096_i64, %c1_i64], [%c0_i32, %c256_i32] {order = array<i32: 1, 0>} : <tensor<256x256xf16, #blocked2>>
// CHECK aritch.truncf %[[RES]]#0 : tensor<256x256xf32, #blocked> to tensor<256x256xf16, #blocked>
%15 = arith.truncf %13#0 : tensor<256x256xf32, #blocked> to tensor<256x256xf16, #blocked>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add CHECK to verify that %13#0 has the dpas layout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't have the DPAS layout, it will be blocked. The DPAS layout is converted to blocked in #25.

%16 = ttg.convert_layout %15 : tensor<256x256xf16, #blocked> -> tensor<256x256xf16, #blocked2>
tt.store %14, %16 {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<256x256xf16, #blocked2>>
tt.return
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a test containing an scf.while loop:

#blocked = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 16], warpsPerCTA = [8, 4], order = [1, 0]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 4], warpsPerCTA = [32, 1], order = [1, 0]}>
#blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [1, 16], warpsPerCTA = [16, 2], order = [1, 0]}>
#mma = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>

module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32, "ttg.threads-per-warp" = 16 : i32} {
  tt.func public @test_while_loop(%arg0: !tt.ptr<f16>, %arg1: !tt.ptr<f16>, %arg2: i1, %B: tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>)  {
    %c0_i32 = arith.constant 0 : i32
    %c1_i64 = arith.constant 1 : i64
    %c32_i32 = arith.constant 32 : i32
    %c256_i32 = arith.constant 256 : i32
    %c1024_i64 = arith.constant 1024 : i64
    %cst = arith.constant dense<0.000000e+00> : tensor<256x256xf32, #mma>
    %0 = tt.make_tensor_ptr %arg1, [%c1024_i64, %c1_i64], [%c1_i64, %c1024_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<256x32xf16, #blocked1>>
    %1:2 = scf.while (%arg3 = %0, %arg4 = %c0_i32) : (!tt.ptr<tensor<256x32xf16, #blocked1>>, i32) -> (!tt.ptr<tensor<256x32xf16, #blocked1>>, i32) {
      scf.condition(%arg2) %arg3, %arg4 : !tt.ptr<tensor<256x32xf16, #blocked1>>, i32
    } do {
    ^bb0(%arg3: !tt.ptr<tensor<256x32xf16, #blocked1>>, %arg4: i32):
      %3 = tt.advance %arg3, [%c256_i32, %c0_i32] : <tensor<256x32xf16, #blocked1>>
      %A = tt.load %3 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #blocked1>>
      %convA1 = ttg.convert_layout %A : tensor<256x32xf16, #blocked1> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
      %convA2 = ttg.convert_layout %convA1 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>
      %res = tt.dot %convA2, %B, %cst, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x256xf32, #mma>
      %7 = arith.addi %arg4, %c32_i32 : i32
      scf.yield %3, %7 : !tt.ptr<tensor<256x32xf16, #blocked1>>, i32
    }
    tt.return
  }
}

This is currently failing, I'd expect this to work given that the pass contains some code to handle while loops.

PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /home/jovyan/intel-xpu-backend-for-triton/build/cmake.linux-x86_64-cpython-3.10/bin/triton-opt /home/jovyan/tmp/test4.mlir --split-input-file --tritonintelgpu-optimize-block-io-encoding
 #0 0x00005633d27e59b7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/jovyan/intel-xpu-backend-for-triton/build/cmake.linux-x86_64-cpython-3.10/bin/triton-opt+0x483d9b7)
 #1 0x00005633d27e361e llvm::sys::RunSignalHandlers() (/home/jovyan/intel-xpu-backend-for-triton/build/cmake.linux-x86_64-cpython-3.10/bin/triton-opt+0x483b61e)
 #2 0x00005633d27e60b5 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x00007fbcfefb1520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007fbcff0059fc pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x969fc)
 #5 0x00007fbcfefb1476 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42476)
 #6 0x00007fbcfef977f3 abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f3)
 #7 0x00005633d2793231 (/home/jovyan/intel-xpu-backend-for-triton/build/cmake.linux-x86_64-cpython-3.10/bin/triton-opt+0x47eb231)
 #8 0x00005633ce562bf1 mlir::triton::getMakeTensorPtrOp(mlir::Value) /home/jovyan/intel-xpu-backend-for-triton/lib/Dialect/Triton/IR/Utility.cpp:93:1
 #9 0x00005633ce562639 getMakeTensorPtrOpImpl(mlir::Operation*, mlir::Value) /home/jovyan/intel-xpu-backend-for-triton/lib/Dialect/Triton/IR/Utility.cpp:30:34
#10 0x00005633ce562989 mlir::triton::getMakeTensorPtrOp(mlir::Value) /home/jovyan/intel-xpu-backend-for-triton/lib/Dialect/Triton/IR/Utility.cpp:69:34
#11 0x00005633cebcf32e mlir::triton::gpu::intel::TritonIntelGPUOptimizeBlockIOEncodingPass::getSubgroup2DBlockLayoutForOperand(mlir::Value, mlir::triton::gpu::intel::DpasEncodingAttr, llvm::MapVector<mlir::Operation*, mlir::Attribute, llvm::DenseMap<mlir::Operation*, unsigned int, llvm::DenseMapInfo<mlir::Operation*, void>, llvm::detail::DenseMapPair<mlir::Operation*, unsigned int>>, llvm::SmallVector<std::pair<mlir::Operation*, mlir::Attribute>, 0u>>&) /home/jovyan/intel-xpu-backend-for-triton/third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp:245:57
#12 0x00005633cebcf8a3 mlir::triton::gpu::intel::TritonIntelGPUOptimizeBlockIOEncodingPass::runOnOperation()::'lambda'(mlir::triton::DotOp)::operator()(mlir::triton::DotOp) const /home/jovyan/intel-xpu-backend-for-triton/third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp:300:41
#13 0x00005633cebd253f _ZZN4mlir6detail4walkILNS_9WalkOrderE1ENS_15ForwardIteratorEZNS_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNS4_5DotOpEE_S8_vEENSt9enable_ifIXaantsrSt11disjunctionIJSt7is_sameIT2_PNS_9OperationEESC_ISD_PNS_6RegionEESC_ISD_PNS_5BlockEEEE5valuesrSC_IT3_vE5valueESO_E4typeESF_OT1_ENKUlSF_E_clESF_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:336:20
#14 0x00005633cebd3c98 _ZN4llvm12function_refIFvPN4mlir9OperationEEE11callback_fnIZNS1_6detail4walkILNS1_9WalkOrderE1ENS1_15ForwardIteratorEZNS1_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNSB_5DotOpEE_SF_vEENSt9enable_ifIXaantsrSt11disjunctionIJSt7is_sameIT2_S3_ESJ_ISK_PNS1_6RegionEESJ_ISK_PNS1_5BlockEEEE5valuesrSJ_IT3_vE5valueEST_E4typeES3_OT1_EUlS3_E_EEvlS3_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/llvm/ADT/STLFunctionalExtras.h:47:40
#15 0x00005633ce354d47 llvm::function_ref<void (mlir::Operation*)>::operator()(mlir::Operation*) const /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/llvm/ADT/STLFunctionalExtras.h:69:62
#16 0x00005633ce352580 void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:187:1
#17 0x00005633ce3524ef void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:179:7
#18 0x00005633ce3524ef void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:179:7
#19 0x00005633ce3524ef void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:179:7
#20 0x00005633cebd25b4 _ZN4mlir6detail4walkILNS_9WalkOrderE1ENS_15ForwardIteratorEZNS_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNS4_5DotOpEE_S8_vEENSt9enable_ifIXaantsrSt11disjunctionIJSt7is_sameIT2_PNS_9OperationEESC_ISD_PNS_6RegionEESC_ISD_PNS_5BlockEEEE5valuesrSC_IT3_vE5valueESO_E4typeESF_OT1_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:341:38
#21 0x00005633cebd12f0 _ZN4mlir9Operation4walkILNS_9WalkOrderE1ENS_15ForwardIteratorEZNS_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNS4_5DotOpEE_vEENSt9enable_ifIXeqsrN4llvm15function_traitsINSt5decayIT1_E4typeEXsrSt8is_classISG_E5valueEEE8num_argsLi1EET2_E4typeEOSE_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Operation.h:798:75
#22 0x00005633cebd019f _ZN4mlir7OpState4walkILNS_9WalkOrderE1ENS_15ForwardIteratorEZNS_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNS4_5DotOpEE_vEENSt9enable_ifIXeqsrN4llvm15function_traitsINSt5decayIT1_E4typeEXsrSt8is_classISG_E5valueEEE8num_argsLi1EET2_E4typeEOSE_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/OpDefinition.h:169:68
#23 0x00005633cebcf962 mlir::triton::gpu::intel::TritonIntelGPUOptimizeBlockIOEncodingPass::runOnOperation() /home/jovyan/intel-xpu-backend-for-triton/third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp:311:21
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is probably failing because the advance occurs before the load and the argument to advance is not MakeTensorPtrOp but the while loop construction. Why do you expect this to work? Can this ever be generated from a Triton kernel written in Python? I have run this PR on all the tests and benchmarks and not seen this error - so if it legal then we are missing unit test coverage for this pattern.


// -----

// COM: Test while loop / nested tt.advance
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 4], warpsPerCTA = [32, 1], order = [1, 0]}>
#mma = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
// CHECK-DAG: #[[$BLOCKED:.+]] = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 4], warpsPerCTA = [32, 1], order = [1, 0]}>
// CHECK-DAG: #[[$SUBGROUP_2D_BLOCK:.+]] = #ttig.subgroup_2d_block<{warpsPerCTA = [8, 4], instrShape = [8, 16], numBlocks=2, order=[1, 0], kWidth=1, threadsPerWarp=16}>
// CHECK-DAG: #[[$DPAS:.+]] = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32, "ttg.threads-per-warp" = 16 : i32, ttig.min_sg_size = 16 : i32, ttig.support_dpas, ttig.support_sg_2d_block} {
tt.func public @matmul_kernel_with_block_pointers(%arg0: !tt.ptr<f16>) {
%c1024_i64 = arith.constant 1024 : i64
%c5120_i64 = arith.constant 5120 : i64
%c1_i64 = arith.constant 1 : i64
%c256_i32 = arith.constant 256 : i32
%c0_i32 = arith.constant 0 : i32
%c32_i32 = arith.constant 32 : i32

// CHECK: %[[A_PTR:.*]] = tt.make_tensor_ptr %arg0, {{.*}} : <tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]>>
%a_ptr = tt.make_tensor_ptr %arg0, [%c1024_i64, %c5120_i64], [%c5120_i64, %c1_i64], [%c256_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<256x32xf16, #blocked1>>

// CHECK: scf.while {{.*}} : (!tt.ptr<tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]>>) -> !tt.ptr<tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]>>
%1 = scf.while (%a_ptr_crt = %a_ptr) : (!tt.ptr<tensor<256x32xf16, #blocked1>>) -> (!tt.ptr<tensor<256x32xf16, #blocked1>>) {
%2 = "dummy.evaluate_condition"() : () -> i1
// CHECK: scf.condition({{.*}}) {{.*}} : !tt.ptr<tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]>>
scf.condition(%2) %a_ptr_crt : !tt.ptr<tensor<256x32xf16, #blocked1>>
} do {
^bb0(%a_ptr_crt: !tt.ptr<tensor<256x32xf16, #blocked1>>):
// CHECK: ^bb0({{.*}}: !tt.ptr<tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]>>):

// CHECK: %[[A_LOAD:.*]] = tt.load {{.*}} : !tt.ptr<tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]>>
%3 = tt.load %a_ptr_crt {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #blocked1>>
// CHECK: ttg.convert_layout %[[A_LOAD]] : tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]> -> tensor<256x32xf16, #[[$BLOCKED]]>
// CHECK: ttg.convert_layout {{.*}} : tensor<256x32xf16, #[[$BLOCKED]]> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 1}>>
%4 = ttg.convert_layout %3 : tensor<256x32xf16, #blocked1> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>

%cstB = arith.constant dense<0.000000e+00> : tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
%cst = arith.constant dense<0.000000e+00> : tensor<256x256xf32, #mma>

// CHECK: tt.dot {{.*}} : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #[[$DPAS]], kWidth = 2}>> -> tensor<256x256xf32, #[[$DPAS]]>
%5 = tt.dot %4, %cstB, %cst, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x256xf32, #mma>
%6 = ttg.convert_layout %5 : tensor<256x256xf32, #mma> -> tensor<256x256xf32, #blocked1>
// CHECK: tt.advance {{.*}} : <tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]>>
%7 = tt.advance %a_ptr_crt, [%c0_i32, %c32_i32] : <tensor<256x32xf16, #blocked1>>

// CHECK: scf.yield {{.*}} : !tt.ptr<tensor<256x32xf16, #[[$SUBGROUP_2D_BLOCK]]>>
scf.yield %a_ptr_crt : !tt.ptr<tensor<256x32xf16, #blocked1>>
}
tt.return
}
}

// -----

// COM: test complex control flow
// COM: Note that instead of using tt.advance we make a new tensor ptr each time. This is nice, because it lets us test that we can find MakeTensorPtr op inside the scf.if.
#blocked = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 16], warpsPerCTA = [8, 4], order = [1, 0]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 4], warpsPerCTA = [32, 1], order = [1, 0]}>
#blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [1, 16], warpsPerCTA = [16, 2], order = [1, 0]}>
// CHECK-DAG: #[[$SUBGROUP_BLOCK_A:.+]] = #ttig.subgroup_2d_block<{warpsPerCTA = [8, 4], instrShape = [8, 16], numBlocks=2, order=[1, 0], kWidth=1, threadsPerWarp=16}>
// CHECK-DAG: #[[$SUBGROUP_BLOCK_B:.+]] = #ttig.subgroup_2d_block<{warpsPerCTA = [8, 4], instrShape = [16, 16], numBlocks=2, order=[0, 1], kWidth=2, threadsPerWarp=16}>
// CHECK-DAG: #[[$DPAS:.+]] = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
#mma = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32, "ttg.threads-per-warp" = 16 : i32, ttig.min_sg_size = 16 : i32, ttig.support_dpas, ttig.support_sg_2d_block} {
// CHECK-LABEL: @matmul_change_block_ptr_in_prologue
tt.func @matmul_change_block_ptr_in_prologue(%a_base: !tt.ptr<f16>,
%b_base: !tt.ptr<f16>) {
%c0_i64 = arith.constant 0 : i64
%c1_i64 = arith.constant 1 : i64
%k_tiles = arith.constant 32 : i64
%true = arith.constant true
%false = arith.constant false

%zero = arith.constant dense<0.0> : tensor<128x128xf32, #blocked>

// CHECK: %[[A_UNDEF:.*]] = ub.poison : !tt.ptr<tensor<128x64xf16, #[[$SUBGROUP_BLOCK_A]]>>
// CHECK: %[[B_UNDEF:.*]] = ub.poison : !tt.ptr<tensor<64x128xf16, #[[$SUBGROUP_BLOCK_B]]>>
%a_ptr_undef = ub.poison : !tt.ptr<tensor<128x64xf16, #blocked1>>
%b_ptr_undef = ub.poison : !tt.ptr<tensor<64x128xf16, #blocked2>>
// CHECK: scf.for {{.*}} iter_args({{.*}} = {{.*}}, %[[A_PTR:.*]] = %[[A_UNDEF]], %[[B_PTR:.*]] = %[[B_UNDEF]])
scf.for %k = %c0_i64 to %k_tiles step %c1_i64 iter_args(%acc = %zero, %flag = %true, %a_ptr = %a_ptr_undef, %b_ptr = %b_ptr_undef) -> (tensor<128x128xf32, #blocked>, i1, !tt.ptr<tensor<128x64xf16, #blocked1>>, !tt.ptr<tensor<64x128xf16, #blocked2>>) : i64 {
%do_prologue = "prologue_cond"(%k) : (i64) -> i1
// CHECK: %[[PTRS:.*]]:2 = scf.if {{.*}} -> (!tt.ptr<tensor<128x64xf16, #[[$SUBGROUP_BLOCK_A]]>>, !tt.ptr<tensor<64x128xf16, #[[$SUBGROUP_BLOCK_B]]>>)
%cur_a_ptr, %cur_b_ptr = scf.if %do_prologue -> (!tt.ptr<tensor<128x64xf16, #blocked1>>, !tt.ptr<tensor<64x128xf16, #blocked2>>) {
%off_m, %off_n, %off_k = "get_offsets"(%k) : (i64) -> (i32, i32, i32)
// CHECK tt.make_tensor_ptr {{.*}} : <tensor<128x64xf16, #[[$SUBGROUP_BLOCK_A]]>>
%next_a_ptr = tt.make_tensor_ptr %a_base, [%k, %k], [%c1_i64, %c1_i64], [%off_m, %off_k] {order = array<i32: 1, 0>} : <tensor<128x64xf16, #blocked1>>
// CHECK tt.make_tensor_ptr {{.*}} : <tensor<64x128xf16, #[[$SUBGROUP_BLOCK_B]]>>
%next_b_ptr = tt.make_tensor_ptr %b_base, [%k, %k], [%c1_i64, %c1_i64], [%off_n, %off_k] {order = array<i32: 1, 0>} : <tensor<64x128xf16, #blocked2>>
// CHECK: scf.yield {{.*}} : !tt.ptr<tensor<128x64xf16, #[[$SUBGROUP_BLOCK_A]]>>, !tt.ptr<tensor<64x128xf16, #[[$SUBGROUP_BLOCK_B]]>>
scf.yield %next_a_ptr, %next_b_ptr : !tt.ptr<tensor<128x64xf16, #blocked1>>, !tt.ptr<tensor<64x128xf16, #blocked2>>
} else {
// CHECK: scf.yield {{.*}} : !tt.ptr<tensor<128x64xf16, #[[$SUBGROUP_BLOCK_A]]>>, !tt.ptr<tensor<64x128xf16, #[[$SUBGROUP_BLOCK_B]]>>
scf.yield %a_ptr, %b_ptr : !tt.ptr<tensor<128x64xf16, #blocked1>>, !tt.ptr<tensor<64x128xf16, #blocked2>>
}

// CHECK: %[[A:.*]] = tt.load %[[PTRS]]#0 {{.*}} : !tt.ptr<tensor<128x64xf16, #[[$SUBGROUP_BLOCK_A]]>>
%a = tt.load %cur_a_ptr {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<128x64xf16, #blocked1>>
// CHECK: ttg.convert_layout %[[A]] : tensor<128x64xf16, #[[$SUBGROUP_BLOCK_A]]> -> tensor<128x64xf16, #blocked1>
// CHECK: %[[B:.*]] = tt.load %[[PTRS]]#1 {{.*}} : !tt.ptr<tensor<64x128xf16, #[[$SUBGROUP_BLOCK_B]]>>
%b = tt.load %cur_b_ptr {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<64x128xf16, #blocked2>>
// CHECK: {{.*}} = ttg.convert_layout %[[B]] : tensor<64x128xf16, #[[$SUBGROUP_BLOCK_B]]> -> tensor<64x128xf16, #blocked2>
%a_dot = ttg.convert_layout %a : tensor<128x64xf16, #blocked1> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
%b_dot = ttg.convert_layout %b : tensor<64x128xf16, #blocked2> -> tensor<64x128xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
%a_dot_dpas = ttg.convert_layout %a_dot : tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>
%b_dot_dpas = ttg.convert_layout %b_dot : tensor<64x128xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<64x128xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
%accum = ttg.convert_layout %acc : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
%c = tt.dot %a_dot_dpas, %b_dot_dpas, %accum, inputPrecision = tf32 : tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<64x128xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
%c_out = ttg.convert_layout %c : tensor<128x128xf32, #mma> -> tensor<128x128xf32, #blocked>

%do_epilogue = arith.cmpi eq, %k, %c0_i64 : i64
%use_acc = arith.select %do_epilogue, %false, %true : i1
scf.if %do_epilogue {
"acc_user"(%c_out) : (tensor<128x128xf32, #blocked>) -> ()
}
// CHECK: scf.yield {{.*}} : {{.*}}, i1, !tt.ptr<tensor<128x64xf16, #[[$SUBGROUP_BLOCK_A]]>>, !tt.ptr<tensor<64x128xf16, #[[$SUBGROUP_BLOCK_B]]>>
scf.yield %c_out, %use_acc, %cur_a_ptr, %cur_b_ptr : tensor<128x128xf32, #blocked>, i1, !tt.ptr<tensor<128x64xf16, #blocked1>>, !tt.ptr<tensor<64x128xf16, #blocked2>>
}

tt.return
}
}
1 change: 1 addition & 0 deletions third_party/intel/backend/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,7 @@ def make_ttgir(mod, metadata, opt, properties):

intel.passes.ttgpuir.add_accelerate_matmul(pm)
intel.passes.ttgpuir.add_materialize_block_pointer(pm)
intel.passes.ttgpuir.add_optimize_block_load_encoding(pm)
intel.passes.ttgpuir.add_remove_layout_conversions(pm)
intel.passes.ttgpuir.add_optimize_dot_operands(pm)
intel.passes.ttgpuir.add_pipeline(pm, opt.num_stages, XPUBackend.get_split_barrier_scope(opt))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -409,4 +409,22 @@ def TritonIntelGPUReduceVariableLiveness
"mlir::scf::SCFDialect",
"mlir::arith::ArithDialect"];
}

def TritonIntelGPUOptimizeBlockIOEncodingPass
: Pass<"tritonintelgpu-optimize-block-io-encoding", "mlir::ModuleOp"> {
let summary = "Set encodings on candidates for Subgroup 2D Block IO ops";

let description = [{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: tt.load operations feeding a tt.dot operation are expected to have "dot" encoding in their type. How does this work with the new encoding ? Will we need a layout conversion operation to change the layout from "BlockEncoding" to "Dot" encoding ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently a tt.load op is expected to have DotOperandEncoding and the block_io tag if we want to lower it to Subgroup 2D Block loads. This pass changes the type of the tt.load to have a Subgroup2DBlockEncoding layout but inserts a layout conversion back to the original type. Because the Subgroup2DBlockEncoding layout is not an anchor layout it is replaced in RemoveLayoutConversions with the DotOperandEncoding layout.
In #4500 I make Subgroup2DBLockEncoding an anchor layout and teach LoadStoreOpToLLVM to "see through" the layout conversion, which becomes a no-op. But the ultimate goal (which I am working on now) is to move the layout conversion (from subgroup 2d block load to DPAS in registers) out of LoadStoreOpToLLVM and into a ConvertLayoutOp, so eventually LoadStoreOpToLLVM would not depend directly on DPAS to lower block ptr / tensor desc loads to 2d block IO instructions.

Set the Subgroup2DBlock encoding on tensor ptr types that are candidates for Subgroup 2D Block IO lowering.

The goal is to change the tensor ptr type to use the new encoding so the LoadOp will use the new encoding, allowing the
encoding to be an anchor layout during RemoveLayoutConversions. To avoid duplicating work in RemoveLayoutConversions, a
ConvertLayout op to the existing encoding replaces the result of the LoadOp.
}];

let dependentDialects = ["mlir::triton::gpu::TritonGPUDialect",
"mlir::triton::gpu::intel::TritonIntelGPUDialect",
"mlir::triton::TritonDialect"];
}

#endif // TRITON_INTEL_GPU_PASSES
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ add_triton_library(TritonIntelGPUTransforms
DistributeToWarps.cpp
MatchTargetSize.cpp
MaterializeBlockPointer.cpp
OptimizeBlockIOEncoding.cpp
OptimizeDotOperands.cpp
OptimizeReductionLocality.cpp
Pipeliner/MatmulLoopPipeline.cpp
Expand Down
Loading