-
Notifications
You must be signed in to change notification settings - Fork 68
Description
In today's pass pipeline a load op feeding a dot op is tagged with the block_io
attribute during the MaterializeBlockPointer
pass, after the DPAS layout has been applied in AccelerateMatmul
. The load op layout is then changed to a ttg.dot_op
operand layout during RemoveLayoutConversions
.
IR before RemoveLayoutConversions
%10 = tt.make_tensor_ptr %arg0, [%c1024_i64, %c5120_i64], [%c5120_i64, %c1_i64], [%9, %c0_i32] {order = array<i32: 1, 0>} : <tensor<256x32xf16, #blocked1>> loc(#loc12)
%11 = arith.muli %8, %c256_i32 : i32 loc(#loc13)
%12 = tt.make_tensor_ptr %arg1, [%c5120_i64, %c4096_i64], [%c4096_i64, %c1_i64], [%c0_i32, %11] {order = array<i32: 1, 0>} : <tensor<32x256xf16, #blocked2>> loc(#loc14)
%13:3 = scf.for %arg3 = %c0_i32 to %c5120_i32 step %c32_i32 iter_args(%arg4 = %cst, %arg5 = %10, %arg6 = %12) -> (tensor<256x256xf32, #blocked>, !tt.ptr<tensor<256x32xf16, #blocked1>>, !tt.ptr<tensor<32x256xf16, #blocked2>>) : i32 {
%17 = tt.load %arg5 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #blocked1>> loc(#loc16)
%18 = tt.load %arg6 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<32x256xf16, #blocked2>> loc(#loc17)
%19 = ttg.convert_layout %17 : tensor<256x32xf16, #blocked1> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> loc(#loc16)
%20 = ttg.convert_layout %18 : tensor<32x256xf16, #blocked2> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> loc(#loc17)
%21 = ttg.convert_layout %arg4 : tensor<256x256xf32, #blocked> -> tensor<256x256xf32, #mma> loc(#loc1)
%22 = ttg.convert_layout %19 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> loc(#loc16)
%23 = ttg.convert_layout %20 : tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> loc(#loc17)
%24 = tt.dot %22, %23, %21, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x256xf32, #mma> loc(#loc18)
Note the block_io
attribute on both loads, then the subsequent conversion to ttg.dot_op
layout with blocked parent.
After RemoveLayoutConversions
the blocked layouts have been removed and the load ops now have ttg.dot_op
layouts with the DPAS parent:
%17 = tt.load %arg5 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>> loc(#loc16)
%18 = tt.load %arg6 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>> loc(#loc17)
%19 = tt.dot %17, %18, %arg4, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x256xf32, #mma> loc(#loc18)
The Subgroup2DBlockIO layout should be applied during this process. I propose adding a new pass which would run after MaterializeBlockPointer
but before RemoveLayoutConversions
. This new pass will apply the subgroup layout and modify downstream layout conversions to use the new layout. MaterializeBlockPointer
would still be used to apply the block_io
tag to the LoadOp, and the block_io
tag would be used in the new pass as a signal to apply layout conversion.
Note that we could probably shift the decision making about when to apply the block io tag / use the Subgroup2DBlock layout to the new pass. But I think it is easier to introduce the new pass in stages, giving it more responsibility after we demonstrate the pass works as expected within the existing pipeline.