Fix precision in Metal fused attention by awni · Pull Request #3119 · ml-explore/mlx

awni · 2026-02-10T20:22:51Z

In our vector and NAX attention we keep the scale factor in fp32. But in the non-NAX fused attention it gets downcast to bf16 which is made worse by the fact that it is multiplied by another scale as well.

It seems to have a real impact on model quality in some cases: ml-explore/mlx-lm#868 (comment)

In terms of performance I think the regression is acceptable as it's less than 1% (at most 0.5%) for all the cases I tried

And in terms of accuracy the difference between the fused attention in bf16 and fp32 with a scaling factor is noticeably lower:

The maximum absolute difference goes down by a factor of 6. For some random inputs with a typical scaling factor based on the head dimension:

Pre: 0.00585938
Post: 0.000976562

angeloskath

Nice! Thanks

fix

fa378dc

awni requested a review from angeloskath February 10, 2026 20:23

awni mentioned this pull request Feb 10, 2026

LongCat MLA ml-explore/mlx-lm#868

Merged

angeloskath approved these changes Feb 10, 2026

View reviewed changes

awni merged commit 4c86c1e into main Feb 10, 2026
16 checks passed

awni deleted the fix_attn_precision branch February 10, 2026 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix precision in Metal fused attention#3119

Fix precision in Metal fused attention#3119
awni merged 1 commit intomainfrom
fix_attn_precision

awni commented Feb 10, 2026

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

awni commented Feb 10, 2026

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants