[Bug] use hidden_size_per_attention_head for scale_value #1958

nuclearwu · 2025-07-23T06:31:03Z

Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com

What this PR does / why we need it?

self.hidden_size_per_attention_head = dist_utils.divide(
            projection_size, num_heads)
self.origin_hidden_size_per_attention_head = self.hidden_size_per_attention_head
if self.hidden_size_per_attention_head > MIN_PAD_SIZE and self.hidden_size_per_attention_head < MAX_PAD_SIZE:
            self.hidden_size_per_attention_head = MAX_PAD_SIZE

The intention of this code of __init__ method is: when the hidden size of each attention head is between 64 and 128, it will be filled to 128 to optimize the computing performance on the Ascend platform.
However, in the forward method, when calling torch_npu._npu_flash_attention_unpad, scale_value uses the original origin_hidden_size_per_attention_head. Rather than the hidden_size_per_attention_head that might be filled in:

scale_value=self.origin_hidden_size_per_attention_head**-0.5,

If hidden_size_per_attention_head is filled to 128, but scale_value still uses the origin_hidden_size_per_attention_head(for example, 84), it will lead to an incorrect scaling ratio, thereby affecting the calculation accuracy of the attention weight.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@2671334

nuclearwu · 2025-07-23T06:36:28Z

@wangxiyuan @Yikun @ApsarasX Please review. Thank you!

codecov · 2025-07-23T06:54:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.78%. Comparing base (ac0bf13) to head (458a0b5).
⚠️ Report is 39 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1958   +/-   ##
=======================================
  Coverage   65.78%   65.78%           
=======================================
  Files          78       78           
  Lines        8406     8406           
=======================================
  Hits         5530     5530           
  Misses       2876     2876

Flag	Coverage Δ
unittests	`65.78% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>

wangxiyuan · 2025-07-24T12:17:22Z

Let me check and run it locally, thanks for the fix

nuclearwu · 2025-07-28T13:10:27Z

Let me check and run it locally, thanks for the fix

@wangxiyuan Could you share the results after running it on your end?

zouyida2052 · 2025-07-30T01:38:53Z

If convenience, please provide specific results on the dataset under both precisions. The scale value here is used for normalization — we want to avoid having the zero-padded regions affect the original data. If the size is increased from 80 to 128, the padded areas will influence the normalization process, which could ultimately impact the model's accuracy.

wangxiyuan · 2025-07-30T01:57:20Z

@nuclearwu
This is the newest main accuracy test result. https://github.com/vllm-project/vllm-ascend/actions/runs/16604438080
This is the the result for this PR. https://github.com/vllm-project/vllm-ascend/actions/runs/16610789143/attempts/1#summary-46993324589

wangxiyuan · 2025-08-19T03:44:20Z

@nuclearwu any feeback about the accuracy problme?

moguizhizi · 2025-09-08T09:03:47Z

Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com

What this PR does / why we need it?
self.hidden_size_per_attention_head = dist_utils.divide(
            projection_size, num_heads)
self.origin_hidden_size_per_attention_head = self.hidden_size_per_attention_head
if self.hidden_size_per_attention_head > MIN_PAD_SIZE and self.hidden_size_per_attention_head < MAX_PAD_SIZE:
            self.hidden_size_per_attention_head = MAX_PAD_SIZE
The intention of this code of __init__ method is: when the hidden size of each attention head is between 64 and 128, it will be filled to 128 to optimize the computing performance on the Ascend platform. However, in the forward method, when calling torch_npu._npu_flash_attention_unpad, scale_value uses the original origin_hidden_size_per_attention_head. Rather than the hidden_size_per_attention_head that might be filled in:
scale_value=self.origin_hidden_size_per_attention_head**-0.5,
If hidden_size_per_attention_head is filled to 128, but scale_value still uses the origin_hidden_size_per_attention_head(for example, 84), it will lead to an incorrect scaling ratio, thereby affecting the calculation accuracy of the attention weight.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.9.2

vLLM main: vllm-project/vllm@2671334

Since we are optimizing performance, why isn't padding applied in the non-VL pipeline?

[Bug] use hidden_size_per_attention_head for scale_value

458a0b5

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>

nuclearwu force-pushed the scale branch from 925f73a to 458a0b5 Compare July 23, 2025 12:17

wangxiyuan added accuracy-test enable all accuracy test for PR ready-for-test start test by label for PR labels Jul 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] use hidden_size_per_attention_head for scale_value #1958

[Bug] use hidden_size_per_attention_head for scale_value #1958

nuclearwu commented Jul 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

nuclearwu commented Jul 23, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

wangxiyuan commented Jul 24, 2025

Uh oh!

nuclearwu commented Jul 28, 2025

Uh oh!

zouyida2052 commented Jul 30, 2025

Uh oh!

wangxiyuan commented Jul 30, 2025

Uh oh!

wangxiyuan commented Aug 19, 2025

Uh oh!

moguizhizi commented Sep 8, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Bug] use hidden_size_per_attention_head for scale_value #1958

Are you sure you want to change the base?

[Bug] use hidden_size_per_attention_head for scale_value #1958

Conversation

nuclearwu commented Jul 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

nuclearwu commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wangxiyuan commented Jul 24, 2025

Uh oh!

nuclearwu commented Jul 28, 2025

Uh oh!

zouyida2052 commented Jul 30, 2025

Uh oh!

wangxiyuan commented Jul 30, 2025

Uh oh!

wangxiyuan commented Aug 19, 2025

Uh oh!

moguizhizi commented Sep 8, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nuclearwu commented Jul 23, 2025 •

edited by github-actions bot

Loading

nuclearwu commented Jul 23, 2025 •

edited

Loading

codecov bot commented Jul 23, 2025 •

edited

Loading