[FlexAttention] The flex attention test takes too long time to run in CI.

The flex attention test cases takes about ~5 hours which is not acceptable in CI.

The major problem for now are:
- Some configurations are suboptimal but in the autotune lists. It causes 4 times at most in IGC codegen to reduce the register spilling size. 
-- Maybe we can try to use the auto GRF mode on Triton side
-- Need to update the Torch to use Intel turned configuration instead of using the common configuration from CUDA.
- Some test cases uses high precision on fp32 matmul. It fall back to the FMA GEMM version which take a long time to generate the kernel with ~30000 instructions in one function. 
-- Maybe we can try to enhance the FMA GEMM with optimized layout to reduce the kernel size and register size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FlexAttention] The flex attention test takes too long time to run in CI. #4265

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FlexAttention] The flex attention test takes too long time to run in CI. #4265

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions