-
Notifications
You must be signed in to change notification settings - Fork 463
[Bugfix] Fix aclgraph sizes capture error in Qwen3 Moe model case #2827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses an ACLgraph size capture error for MoE models with data parallelism by adjusting the stream buffer size and accounting for extra stream usage in MoE models. The changes look reasonable, but I've identified a potential issue in the newly introduced is_moe_model
function. The current implementation for detecting MoE models is too broad and could lead to false positives. I've suggested a more robust implementation to improve accuracy.
def is_moe_model(vllm_config: VllmConfig): | ||
config = vllm_config.model_config.hf_config | ||
return any('experts' in key.lower() for key in config.to_dict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation to detect a Mixture-of-Experts (MoE) model is based on checking if 'experts' is a substring in any of the configuration keys. This approach is fragile and might lead to false positives if a non-MoE model has a configuration key containing this substring (e.g., use_shared_experts
). This could cause incorrect logic to be applied, potentially leading to suboptimal performance or runtime errors.
A more robust approach would be to check for specific attributes that are characteristic of a MoE architecture, such as num_experts
and num_experts_per_tok
, and ensure they have meaningful values. This will make the detection more accurate and prevent incorrect behavior.
def is_moe_model(vllm_config: VllmConfig): | |
config = vllm_config.model_config.hf_config | |
return any('experts' in key.lower() for key in config.to_dict()) | |
def is_moe_model(vllm_config: VllmConfig): | |
config = vllm_config.model_config.hf_config | |
# A more robust check for MoE models by verifying specific attributes | |
# that are characteristic of such architectures. | |
return (hasattr(config, "num_experts") and | |
isinstance(config.num_experts, int) and config.num_experts > 0 and | |
hasattr(config, "num_experts_per_tok")) |
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
79698d9
to
d7755b8
Compare
fd26cec
to
917f4de
Compare
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
# TODO: Find out whether we need to take into account the pp_size | ||
parallel_factor = 1 + num_comm_groups + int( | ||
parallel_config.enable_expert_parallel) | ||
if is_moe_model(vllm_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need to check if is moe model right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for MoE, we'll have to do prepare
and finalize
around quant_method.apply
, which will have extra communication.
ut failed due to some known issues, which are unrelated to this pr. Let's merge this after the full e2e test passed |
Please rewrite the PR title as a meaningful one. I didn't get any human readable info from original title. |
Already done |
### What this PR does / why we need it? 1. Solved the problem that in the Qwen3 Moe model case, opening DP would use an extra stream, causing ACLgraph sizes capture error 2. After experimentation, it was found that in many cases, some operators would occupy more streams than expected. Therefore, the buffer area for streams in ACLgraph was not large enough. After discussion, extra 120 streams were added as buffer. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: main - vLLM main: vllm-project/vllm@0ae43db Signed-off-by: lilinsiman <lilinsiman@gmail.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
parallel_factor += (parallel_config.data_parallel_size > 1) | ||
# Calculate maximum supported batch sizes considering model architecture on the A2 Hardware Device | ||
# Assume the following case: | ||
# MAX_CAPTURE_SIZE = 1920, num_hidden_layers = 48, data_parallel_size is 1, tensor_parallel_size is 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update the note.
### What this PR does / why we need it? 1. Solved the problem that in the Qwen3 Moe model case, opening DP would use an extra stream, causing ACLgraph sizes capture error 2. After experimentation, it was found that in many cases, some operators would occupy more streams than expected. Therefore, the buffer area for streams in ACLgraph was not large enough. After discussion, extra 120 streams were added as buffer. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: main - vLLM main: vllm-project/vllm@0ae43db Signed-off-by: lilinsiman <lilinsiman@gmail.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it? 1. Solved the problem that in the Qwen3 Moe model case, opening DP would use an extra stream, causing ACLgraph sizes capture error 2. After experimentation, it was found that in many cases, some operators would occupy more streams than expected. Therefore, the buffer area for streams in ACLgraph was not large enough. After discussion, extra 120 streams were added as buffer. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: main - vLLM main: vllm-project/vllm@0ae43db Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it? 1. Solved the problem that in the Qwen3 Moe model case, opening DP would use an extra stream, causing ACLgraph sizes capture error 2. After experimentation, it was found that in many cases, some operators would occupy more streams than expected. Therefore, the buffer area for streams in ACLgraph was not large enough. After discussion, extra 120 streams were added as buffer. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: main - vLLM main: vllm-project/vllm@0ae43db Signed-off-by: lilinsiman <lilinsiman@gmail.com>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
no
How was this patch tested?
ut