You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The flex attention test cases takes about ~5 hours which is not acceptable in CI.
The major problem for now are:
Some configurations are suboptimal but in the autotune lists. It causes 4 times at most in IGC codegen to reduce the register spilling size.
-- Maybe we can try to use the auto GRF mode on Triton side
-- Need to update the Torch to use Intel turned configuration instead of using the common configuration from CUDA.
Some test cases uses high precision on fp32 matmul. It fall back to the FMA GEMM version which take a long time to generate the kernel with ~30000 instructions in one function.
-- Maybe we can try to enhance the FMA GEMM with optimized layout to reduce the kernel size and register size.