-
Notifications
You must be signed in to change notification settings - Fork 487
[Bug] use hidden_size_per_attention_head for scale_value #1958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@wangxiyuan @Yikun @ApsarasX Please review. Thank you! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1958 +/- ##
=======================================
Coverage 65.78% 65.78%
=======================================
Files 78 78
Lines 8406 8406
=======================================
Hits 5530 5530
Misses 2876 2876
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Let me check and run it locally, thanks for the fix |
@wangxiyuan Could you share the results after running it on your end? |
If convenience, please provide specific results on the dataset under both precisions. The scale value here is used for normalization — we want to avoid having the zero-padded regions affect the original data. If the size is increased from 80 to 128, the padded areas will influence the normalization process, which could ultimately impact the model's accuracy. |
@nuclearwu |
@nuclearwu any feeback about the accuracy problme? |
Since we are optimizing performance, why isn't padding applied in the non-VL pipeline? |
Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com
What this PR does / why we need it?
The intention of this code of
__init__
method is: when the hidden size of each attention head is between 64 and 128, it will be filled to 128 to optimize the computing performance on the Ascend platform.However, in the
forward
method, when callingtorch_npu._npu_flash_attention_unpad
,scale_value
uses the originalorigin_hidden_size_per_attention_head
. Rather than thehidden_size_per_attention_head
that might be filled in:If
hidden_size_per_attention_head
is filled to 128, butscale_value
still uses theorigin_hidden_size_per_attention_head
(for example, 84), it will lead to an incorrect scaling ratio, thereby affecting the calculation accuracy of the attention weight.Does this PR introduce any user-facing change?
How was this patch tested?