[V0.9.1] Replace FA interface with FA_V2 to optimize perf in SelfAttention #1701

rjg-lyh · 2025-07-09T08:27:09Z

What this PR does / why we need it?

Due to the lack of support for passing compressed masks in the FA interface, performance significantly degraded in long-sequence scenarios, even leading to functional issues such as OOM errors. Therefore, I switched to using the FA_V2 interface for the selfattention computation, ensuring functionality while greatly improving performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed with existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>

ApsarasX · 2025-07-14T10:08:02Z

vllm_ascend/attention/attention_v1.py

-                                               num_heads=self.num_heads,
-                                               num_kv_heads=self.num_kv_heads,
-                                               out=output)
+                torch_npu.atb._npu_flash_attention_v2(


Do you have the performance data for this PR?

weijinqian0 · 2025-07-14T12:11:38Z

Operating during the prefill phase can save on HBM activation and improve FA computation efficiency, especially in long sequence scenarios.

[V0.9.1] Replace FA ops with FA_V2 to optimize perf

f85e754

Signed-off-by: rjg-lyh <1318825571@qq.com>

rjg-lyh force-pushed the pr-fa branch from 3520d35 to f85e754 Compare July 10, 2025 01:32

ApsarasX reviewed Jul 14, 2025

View reviewed changes

rjg-lyh closed this Jul 22, 2025

rjg-lyh deleted the pr-fa branch July 22, 2025 12:14

rjg-lyh restored the pr-fa branch July 22, 2025 12:15

rjg-lyh reopened this Jul 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[V0.9.1] Replace FA interface with FA_V2 to optimize perf in SelfAttention #1701

[V0.9.1] Replace FA interface with FA_V2 to optimize perf in SelfAttention #1701

rjg-lyh commented Jul 9, 2025

Uh oh!

ApsarasX Jul 14, 2025

Uh oh!

weijinqian0 commented Jul 14, 2025

Uh oh!

Uh oh!

[V0.9.1] Replace FA interface with FA_V2 to optimize perf in SelfAttention #1701

Are you sure you want to change the base?

[V0.9.1] Replace FA interface with FA_V2 to optimize perf in SelfAttention #1701

Conversation

rjg-lyh commented Jul 9, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ApsarasX Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

weijinqian0 commented Jul 14, 2025

Uh oh!

Uh oh!