Skip to content

Introduce a new pass to change LoadOp layouts to Subgroup2DBlock layouts #4362

@alexbaden

Description

@alexbaden

In today's pass pipeline a load op feeding a dot op is tagged with the block_io attribute during the MaterializeBlockPointer pass, after the DPAS layout has been applied in AccelerateMatmul. The load op layout is then changed to a ttg.dot_op operand layout during RemoveLayoutConversions.

IR before RemoveLayoutConversions

   %10 = tt.make_tensor_ptr %arg0, [%c1024_i64, %c5120_i64], [%c5120_i64, %c1_i64], [%9, %c0_i32] {order = array<i32: 1, 0>} : <tensor<256x32xf16, #blocked1>> loc(#loc12)
    %11 = arith.muli %8, %c256_i32 : i32 loc(#loc13)
    %12 = tt.make_tensor_ptr %arg1, [%c5120_i64, %c4096_i64], [%c4096_i64, %c1_i64], [%c0_i32, %11] {order = array<i32: 1, 0>} : <tensor<32x256xf16, #blocked2>> loc(#loc14)
    %13:3 = scf.for %arg3 = %c0_i32 to %c5120_i32 step %c32_i32 iter_args(%arg4 = %cst, %arg5 = %10, %arg6 = %12) -> (tensor<256x256xf32, #blocked>, !tt.ptr<tensor<256x32xf16, #blocked1>>, !tt.ptr<tensor<32x256xf16, #blocked2>>)  : i32 {
      %17 = tt.load %arg5 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #blocked1>> loc(#loc16)
      %18 = tt.load %arg6 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<32x256xf16, #blocked2>> loc(#loc17)
      %19 = ttg.convert_layout %17 : tensor<256x32xf16, #blocked1> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> loc(#loc16)
      %20 = ttg.convert_layout %18 : tensor<32x256xf16, #blocked2> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> loc(#loc17)
      %21 = ttg.convert_layout %arg4 : tensor<256x256xf32, #blocked> -> tensor<256x256xf32, #mma> loc(#loc1)
      %22 = ttg.convert_layout %19 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> loc(#loc16)
      %23 = ttg.convert_layout %20 : tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> loc(#loc17)
      %24 = tt.dot %22, %23, %21, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x256xf32, #mma> loc(#loc18)

Note the block_io attribute on both loads, then the subsequent conversion to ttg.dot_op layout with blocked parent.

After RemoveLayoutConversions the blocked layouts have been removed and the load ops now have ttg.dot_op layouts with the DPAS parent:

 %17 = tt.load %arg5 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>> loc(#loc16)
      %18 = tt.load %arg6 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>> loc(#loc17)
      %19 = tt.dot %17, %18, %arg4, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x256xf32, #mma> loc(#loc18)

The Subgroup2DBlockIO layout should be applied during this process. I propose adding a new pass which would run after MaterializeBlockPointer but before RemoveLayoutConversions. This new pass will apply the subgroup layout and modify downstream layout conversions to use the new layout. MaterializeBlockPointer would still be used to apply the block_io tag to the LoadOp, and the block_io tag would be used in the new pass as a signal to apply layout conversion.

Note that we could probably shift the decision making about when to apply the block io tag / use the Subgroup2DBlock layout to the new pass. But I think it is easier to introduce the new pass in stages, giving it more responsibility after we demonstrate the pass works as expected within the existing pipeline.

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions