-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model #27018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fused CUDA kernel for Q-Norm, K-Norm, and RoPE operations for the Qwen3 model, which shows significant performance improvements. The implementation is well-structured and follows patterns from existing high-performance kernels.
I've identified a critical issue regarding the use of this fused kernel with dynamic RoPE scaling methods, which could lead to incorrect results. I've also pointed out a minor issue with const correctness in the CUDA code.
Overall, this is a great performance optimization. Addressing the identified issues will make it robust and ready for merging.
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
| # If set, use the fuse QKNorm and RoPE kernel | ||
| "VLLM_FUSE_QKNORM_AND_ROPE": lambda: bool( | ||
| int(os.getenv("VLLM_FUSE_QKNORM_AND_ROPE", "0")) | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any downsides to using the fused implementation? Just asking because I'm wondering whether it would make sense to enable by default or even remove the config option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation does not support RoPE with custom forward implementations that differ from the RotaryEmbedding, such as DualChunkRotaryEmbedding(CustomOp).
Although Qwen3 might not use this type of RoPE, I've added the VLLM_FUSE_QKNORM_AND_ROPE environment variable for safety. Additionally, in qwen3.py & qwen3_moe.py, I add isinstance(self.rotary_emb, RotaryEmbedding) check to filter out unsupported RoPE implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izhuhaoran Do you think we could formulate this as a custom pass for torch.compile to substitute the CUDA kernel in for? This would generalize across arches and also take care of greedily enabling by default when valid for Qwen.
Might be tough for this particular op, but worth a try cc @ProExpertProg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I will try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please compare performance with 2.9 as we improved the Inductor-generated kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please compare performance with 2.9 as we improved the Inductor-generated kernels
ok, I will bench again with torch 2.9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izhuhaoran Do you think we could formulate this as a custom pass for torch.compile to substitute the CUDA kernel in for? This would generalize across arches and also take care of greedily enabling by default when valid for Qwen. Might be tough for this particular op, but worth a try cc @ProExpertProg
I tried converting this fusion into a custom torch.compile pass, referencing existing fusion passes. However, my initial pass pattern isn't triggering—no fusion replacement occurs (both e2e tests and unit tests fail).
The draft code is here: https://github.com/izhuhaoran/vllm/tree/fuse-qknorm-rope-compile
I don't have too much experience with torch.compile, so I might be missing something. @mgoin @ProExpertProg Could someone help me convert this into a custom pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, responded on Slack - can you open a draft PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, responded on Slack - can you open a draft PR?
The follow draft PR is #27165 . Thanks for your help
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
Here is new evaluation with torch 2.9 : Bench Servesetting: NVIDIA H20, TP=2, model=qwen3-30b-a3b-fp8, num_prompts=32, max-concurrency=32, input_len=out_len=1024
GPU Timeline Profile
CC @ProExpertProg , we can see the fusion qknorm_rope kernel also better than torch.compile (2.9) |
|
Those are great results! Just one question: I see gaps between the kernels in the profile, are you using cudagraphs when profiling? If not could you show a profile with cudagraphs? |



Purpose
This PR fuses QNorm, KNorm, and RoPE into a single CUDA kernel for the Qwen3 model, improving inference performance. And user can enable this fusion by
VLLM_FUSE_QKNORM_AND_ROPE=1Test Result
GPU Trace
Bench Serve
setting: NVIDIA H20, TP=2, model=qwen3-30b-a3b-fp8, num_prompts=32, max-concurrency=8, input_len=out_len=1024
result: TTFT from 388.56ms to 192.26ms && TPOT from 10.29ms to 9.93ms
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.