[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model #27165

izhuhaoran · 2025-10-19T04:17:33Z

Purpose

This PR is a follow PR about #27018 , and fuses QNorm, KNorm, and RoPE into a single CUDA kernel for the Qwen3 model, improving inference performance. We convert this fusion into a custom torch.compile pass, users can enable it by:

 --compilation-config='{"use_inductor": 1,  "pass_config": {"enable_qk_norm_rope_fusion": 1}}'

More details see #27018

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

gemini-code-assist

Code Review

This pull request introduces a fused CUDA kernel for QK Normalization and RoPE for the Qwen model, aiming to improve inference performance. The fusion is implemented as a torch.compile pass. The changes include the CUDA kernel, its PyTorch bindings, the fusion pass logic, and integration into the model and build system. A new test is also added to verify the fusion.

The overall approach is solid and follows existing patterns in the codebase for custom ops and fusions. However, I've found a critical issue in the fusion pass implementation that causes the fusion to produce incorrect results. The output of the fused operation is not correctly propagated in the graph, making the fusion effectively a no-op. Please see the detailed comment for the fix.

vllm/compilation/qk_norm_rope_fusion.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-10-19T04:20:53Z

vllm/model_executor/models/qwen3.py

+    def apply_qk_norm_rope(self, qkv, positions):
+        if self.use_fused_qk_norm_rope:
+            ops.fused_qk_norm_rope(
+                qkv,
+                self.num_heads,
+                self.num_kv_heads,
+                self.num_kv_heads,
+                self.head_dim,
+                self.q_norm.variance_epsilon,
+                self.q_norm.weight,
+                self.k_norm.weight,
+                self.rotary_emb.cos_sin_cache,
+                self.rotary_emb.is_neox_style,
+                positions.view(-1),
+            )


Cast RoPE cache to tensor dtype before fused op

When VLLM_FUSE_QKNORM_AND_ROPE is enabled, the new fast path calls ops.fused_qk_norm_rope using self.rotary_emb.cos_sin_cache directly. RotaryEmbedding initializes this buffer in float32 and only casts it to the query’s dtype inside RotaryEmbedding.forward() via _match_cos_sin_cache_dtype. Because the fused path bypasses that method, the buffer typically remains float32 while the kernel enforces bfloat16 (CHECK_INPUT(cos_sin_cache, torch::kBFloat16)), causing an immediate TORCH_CHECK failure the first time the fused kernel is used. Before invoking the fused op, the RoPE cache should be converted to the same dtype/device as qkv just as the unfused path does.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-19T04:20:53Z

vllm/model_executor/models/qwen3_moe.py

+    def apply_qk_norm_rope(self, qkv, positions):
+        if self.use_fused_qk_norm_rope:
+            ops.fused_qk_norm_rope(
+                qkv,
+                self.num_heads,
+                self.num_kv_heads,
+                self.num_kv_heads,
+                self.head_dim,
+                self.q_norm.variance_epsilon,
+                self.q_norm.weight,
+                self.k_norm.weight,
+                self.rotary_emb.cos_sin_cache,
+                self.rotary_emb.is_neox_style,
+                positions.view(-1),


Ensure MoE RoPE cache matches fused kernel dtype

The MoE variant has the same issue: the fused path invokes ops.fused_qk_norm_rope without first aligning self.rotary_emb.cos_sin_cache to the query tensor’s dtype/device. The buffer starts as float32, while the CUDA kernel checks for bfloat16, so enabling the fused kernel leads to a runtime TORCH_CHECK error before any computation occurs. Mirror the unfused path by calling _match_cos_sin_cache_dtype (or otherwise casting) before the fused call.

Useful? React with 👍 / 👎.

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

ZJY0516 · 2025-10-19T11:03:11Z

vllm/envs.py

+    # If set, use the fuse QKNorm and RoPE kernel
+    "VLLM_FUSE_QKNORM_AND_ROPE": lambda: bool(
+        int(os.getenv("VLLM_FUSE_QKNORM_AND_ROPE", "0"))
+    ),


We should use pass config instead of env vars.

We should use pass config instead of env vars.

The PR is still WIP. This environment variable comes from #27018 . This PR is currently attempting to convert it into a custom pass. Users will need to use (as you mentioned):
--compilation_config='{"use_inductor": 1, "pass_config": {"enable_qk_norm_rope_fusion": 1}}'
Once this PR is completed, these old contents will be cleaned up.

ZJY0516 · 2025-10-19T15:38:52Z

call_function  split_with_sizes        aten.split_with_sizes.default       (mm_3, [4096, 1024, 1024], -1)                                   {}
call_function  getitem_6               <built-in function getitem>         (split_with_sizes, 0)                                            {}
call_function  getitem_7               <built-in function getitem>         (split_with_sizes, 1)                                            {}
call_function  getitem_8               <built-in function getitem>         (split_with_sizes, 2)                                            {}
call_function  empty                   aten.empty.memory_format            ([arg1_1, 32, 128],)                                             {'dtype': torch.bfloat16, 'layout': torch.strided, 'device': device(type='cuda', index=0), 'pin_memory': False}
call_function  permute_4               aten.permute.default                (empty, [0, 1, 2])                                               {}
call_function  view_1                  aten.reshape.default                (getitem_6, [arg1_1, 32, 128])                                   {}
call_function  clone                   aten.clone.default                  (view_1,)                                                        {'memory_format': torch.contiguous_format}
call_function  auto_functionalized_2   auto_functionalized                 (<OpOverload(op='_C.rms_norm', overload='default')>,)            {'result': permute_4, 'input': clone, 'weight': arg9_1, 'epsilon': 1e-06}
call_function  getitem_10              <built-in function getitem>         (auto_functionalized_2, 1)                                       {}
call_function  empty_1                 aten.empty.memory_format            ([arg1_1, 8, 128],)                                              {'dtype': torch.bfloat16, 'layout': torch.strided, 'device': device(type='cuda', index=0), 'pin_memory': False}
call_function  permute_5               aten.permute.default                (empty_1, [0, 1, 2])                                             {}
call_function  view_3                  aten.reshape.default                (getitem_7, [arg1_1, 8, 128])                                    {}
call_function  clone_1                 aten.clone.default                  (view_3,)                                                        {'memory_format': torch.contiguous_format}
call_function  auto_functionalized_3   auto_functionalized                 (<OpOverload(op='_C.rms_norm', overload='default')>,)            {'result': permute_5, 'input': clone_1, 'weight': arg10_1, 'epsilon': 1e-06}
call_function  getitem_12              <built-in function getitem>         (auto_functionalized_3, 1)                                       {}
call_function  view_5                  aten.reshape.default                (getitem_10, [arg1_1, 4096])                                     {}
call_function  view_6                  aten.reshape.default                (getitem_12, [arg1_1, 1024])                                     {}
call_function  auto_functionalized_4   auto_functionalized                 (<OpOverload(op='_C.rotary_embedding', overload='default')>,)    {'positions': arg11_1, 'query': view_5, 'key': view_6, 'head_size': 128, 'cos_sin_cache': arg13_1, 'is_neox': True}

The target graph for replacement is quite large. Using pattern matching here, as we do in other passes, may not scale effectively and could become a maintenance burden.
Do you have any suggestions? @ProExpertProg

izhuhaoran · 2025-10-19T15:54:02Z

The target graph for replacement is quite large. Using pattern matching here, as we do in other passes, may not scale effectively and could become a maintenance burden. Do you have any suggestions? @ProExpertProg

I'm also aware of the same issue, which makes the pattern extremely hacky. So in my initial implementation, when enabling enable_qk_norm_rope_fusion, I set rms_norm and rope as custom ops, but this isn't the optimal solution either. I'm wondering if we should consider abandoning the conversion of this fusion into a custom pass and directly use the implementation from #27018 (which is also the current state in TRT-LLM)? I'd like to hear your thoughts on this, @ProExpertProg .

ProExpertProg

This looks like the right approach! Once you're done, please clean up the code and add E2E performance and lm-eval numbers.

ProExpertProg · 2025-10-20T22:33:51Z

vllm/compilation/fix_functionalization.py

                        "input_global_scale",
                    ),
                )
+            # # Defunctionalize fused_qk_norm_rope to remove higher-order wrapper.


Is this supposed to be removed or uncommented?

ProExpertProg · 2025-10-20T22:36:27Z

vllm/compilation/qk_norm_rope_fusion.py

+            # split qkv -> q,k,v
+            # q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
+            split_tuple = SPLIT_SIZES_OP(
+                qkv, [self.q_size, self.kv_size, self.kv_size], -1
+            )
+            q = operator.getitem(split_tuple, 0)
+            k = operator.getitem(split_tuple, 1)
+            v = operator.getitem(split_tuple, 2)


I think that this should work, pattern tracing is very close to forward code tracing:

Suggested change

# split qkv -> q,k,v

# q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)

split_tuple = SPLIT_SIZES_OP(

qkv, [self.q_size, self.kv_size, self.kv_size], -1

)

q = operator.getitem(split_tuple, 0)

k = operator.getitem(split_tuple, 1)

v = operator.getitem(split_tuple, 2)

# split qkv -> q,k,v

q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)

ProExpertProg · 2025-10-20T22:37:38Z

vllm/compilation/qk_norm_rope_fusion.py

+            q_out = EMPTY_LIKE_OP(q_by_head)
+            q_by_head_contiguous = CONTIGUOUS_OP(q_by_head)
+
+            qn = auto_functionalized(


Please use MatcherRMSNorm so that we can match even with rms_norm disabled (using torch impl in forward_native)

ProExpertProg · 2025-10-20T22:38:21Z

vllm/compilation/qk_norm_rope_fusion.py

+        ]
+
+        # # Register variants across rope ops and with/without contiguous()
+        # # Ensure view ops are canonicalized to reshape in the traced pattern


Was this not needed?

ProExpertProg · 2025-10-20T22:39:22Z

vllm/compilation/qk_norm_rope_fusion.py

+        if not current_platform.is_cuda_alike():
+            return


Silent disablement is not good. Should not be enabled at all on non-cuda-alike platforms

ProExpertProg · 2025-10-20T22:41:44Z

vllm/compilation/qk_norm_rope_fusion.py

+                "QK Norm+RoPE fusion enabled, but no Attention layers were discovered."
+            )
+            return
+        layer_name, layer = next(iter(attn_layers.items()))


This will only register the pattern using one layer, is that intended? Are we sure this will always pick the same layer also?

I see now you don't care which layer because you only need the shapes, please add a comment for that

ProExpertProg · 2025-10-20T22:42:56Z

vllm/compilation/qk_norm_rope_fusion.py

+        rope_op: torch._ops.OpOverload,
+        is_neox: bool,
+    ) -> None:
+        self.layer = layer


Are you just using layer for these sizes? If yes, don't save the layer object on the pattern object, just extract the size properties

ProExpertProg · 2025-10-20T22:44:47Z

vllm/config/compilation.py

+        if self.pass_config.enable_qk_norm_rope_fusion:
+            self.custom_ops.append("+rms_norm")
+            self.custom_ops.append("+rotary_embedding")


Let's try to remove this requirement; definitely for rms_norm and hopefully for RoPE as well although RoPE is less important - I assume all custom RoPE ops would be fused away anyway right?

ProExpertProg · 2025-10-20T22:45:21Z

vllm/model_executor/models/qwen3.py

            rope_scaling=rope_scaling,
            dual_chunk_attention_config=dual_chunk_attention_config,
        )
+        # Determine if we can use fused QK norm + RoPE


Please remove model definition changes now that we have the fusion pass

ProExpertProg · 2025-10-20T22:45:36Z

vllm/model_executor/models/qwen3_moe.py

            rope_scaling=rope_scaling,
            dual_chunk_attention_config=dual_chunk_attention_config,
        )
+        # Determine if we can use fused QK norm + RoPE


Here as well

izhuhaoran added 8 commits October 16, 2025 17:50

feat: fuse QNorm KNorm and RoPE into a single cuda kernel

5dfce1f

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

refactor: use const for q k weight in fused_qk_norm_rope

3997660

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

refactor: update the note comment for fusedQKNormRopeKernel

e296264

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

refactor: fix lint error

3e3e622

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Merge branch 'main' into fuse-qknorm-rope

c706c60

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

feat: add QKNormRoPEFusionPass for torch.compile

8dc3e03

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

add test for qknrom rope fuse pass

d2154f8

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

update qknorm_rope fusion pass and its unit test

9a88181

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran requested review from LucasWilkinson, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, sighingnow, simon-mo, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners October 19, 2025 04:17

izhuhaoran marked this pull request as draft October 19, 2025 04:17

mergify bot added ci/build qwen Related to Qwen models labels Oct 19, 2025

izhuhaoran mentioned this pull request Oct 19, 2025

[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model #27018

Open

5 tasks

gemini-code-assist bot reviewed Oct 19, 2025

View reviewed changes

vllm/compilation/qk_norm_rope_fusion.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 19, 2025

View reviewed changes

lint: fix lint error

cf67619

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

ZJY0516 reviewed Oct 19, 2025

View reviewed changes

ProExpertProg reviewed Oct 20, 2025

View reviewed changes

Uh oh!

[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model #27165

Are you sure you want to change the base?

[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model #27165

Conversation

izhuhaoran commented Oct 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Oct 19, 2025

Uh oh!

izhuhaoran commented Oct 19, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

izhuhaoran commented Oct 19, 2025 •

edited by github-actions bot

Loading