Skip to content

[FlexAttention] The flex attention test takes too long time to run in CI. #4265

@chengjunlu

Description

@chengjunlu

The flex attention test cases takes about ~5 hours which is not acceptable in CI.

The major problem for now are:

  • Some configurations are suboptimal but in the autotune lists. It causes 4 times at most in IGC codegen to reduce the register spilling size.
    -- Maybe we can try to use the auto GRF mode on Triton side
    -- Need to update the Torch to use Intel turned configuration instead of using the common configuration from CUDA.
  • Some test cases uses high precision on fp32 matmul. It fall back to the FMA GEMM version which take a long time to generate the kernel with ~30000 instructions in one function.
    -- Maybe we can try to enhance the FMA GEMM with optimized layout to reduce the kernel size and register size.

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions