-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next #27492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9aaf36c to
aa947da
Compare
c3863df to
15b457c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
15b457c to
fccb4d0
Compare
08dcd1b to
2b9022e
Compare
|
@mgoin @pavanimajety may you help review the PR? |
|
If this PR is merged, can vllm still run with older flashinfer? We are internally just upgrading to flashinfer nightly-v0.4.1-20251027. This seems to bump flashinfer version again. Is it possible to consider some backward compatibility with older flashinfer version? |
Hi @mxz297 , |
| routing_method_type = getattr(layer, "routing_method_type", 2) | ||
| return torch.ops.vllm.flashinfer_fused_moe_blockscale_fp8( | ||
| routing_logits=router_logits.to(torch.float32), | ||
| routing_logits=router_logits.to(torch.float32) | ||
| if routing_method_type == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the enum rather than raw int
| # The type of method in top-K routing | ||
| # Please keep this in sync with the counterpart defined in https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/trtllm/fused_moe/runner.h | ||
| class RoutingMethodType(IntEnum): | ||
| # Default: Softmax -> TopK | ||
| Default = (0,) | ||
| # Renormalize: TopK -> Softmax | ||
| Renormalize = (1,) | ||
| # DeepSeekV3: Sigmoid -> RoutingBiasAdd -> Top2 in group -> Top4 groups | ||
| # -> Top8 experts from the Top4 groups | ||
| DeepSeekV3 = (2,) | ||
| # Llama4: Top1 -> Sigmoid | ||
| Llama4 = (3,) | ||
| # RenormalizeNaive: Softmax -> TopK -> Renormalize | ||
| RenormalizeNaive = (4,) | ||
| # TopK: TopK (no softmax) | ||
| TopK = (5,) | ||
| # Unspecified | ||
| Unspecified = 6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having a routing method type so we can reduce the need for hacks like checking the llama 4 custom routing function within the quant method.
However I think directly tying the values to the flashinfer trtllm fusedmoe is short-sighted if we are to leverage this across the codebase. I think if we do this right, we can actually remove other arguments we have in FusedMoE such as renormalize.
So I think this is the important design change in the PR. We could currently derive the routing type based on existing arguments, and of course allow for explicit override. I'm interested to hear @bnellnm thoughts too
I don't necessarily want to block this PR on getting the final design right, but I do want to get agreement with my other comments that this makes sense to be a more explicit control for routing types across all fused moe methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mgoin that it would be nice to derive the routing type from existing arguments. Would it make more sense to have a collection of router objects/functions that could be passed in directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add logic in FusedMOE top check the routing method given param.
For use of other backends and etc might need more discussion and design which might not be the scope of this PR :)
@mgoin please let me know if this make sense 😄
| if self.use_grouped_topk: | ||
| self.routing_method_type = RoutingMethodType.DeepSeekV3 | ||
| elif self.top_k == 1: | ||
| self.routing_method_type = RoutingMethodType.Llama4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to (or suggested for) this PR but we could also set apply_weights_on_input to True for this case and get rid of the runtime parameter. cc @mgoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this param used for? Searched code base, only found supports_apply_weight_on_input , no apply_weights_on_input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this param used for? Searched code base, only found
supports_apply_weight_on_input, noapply_weights_on_input
It's actually apply_router_weight_on_input. I just couldn't remember the exact name when I wrote the comment. Afaik, it is only used for llama when topk==1, so I was wondering if we could detect and store it here while deriving the routing method. We could also remove it as an extra argument to apply. You don't need to make this change for this PR. I just wanted to point it out.
608f7e8 to
2c2e9b3
Compare
|
@mgoin may you help re-review? |
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2c2e9b3 to
648547a
Compare
|
blocked by flashinfer-ai/flashinfer#2032. dont merge before issue fixed. |
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
This pull request has merge conflicts that must be resolved before it can be |

Purpose
Test Plan
Qwen3-Next-80B-A3B-Instruct-FP8 on 2xB200 TP2
Qwen3-30B-A3B-Instruct-2507-FP8 on 2xB200 TP2
Test Result
Qwen3-Next-80B-A3B-Instruct-FP8
Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.