-
Notifications
You must be signed in to change notification settings - Fork 544
[BugFix] Add addtional check for mc2 on different hardwares. #3429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable hardware-specific check for the MC2 communication method in MoE models, preventing potential runtime issues by enforcing token limits on different Ascend hardware versions. The logic for calculating token capacity and raising an informative error is sound. The change to restrict a related profiling run to only MoE models is also appropriate. I have one suggestion to improve the maintainability of the hardware limit implementation.
| limit = None | ||
| if soc_version in {AscendSocVersion.A3}: | ||
| limit = 512 | ||
| elif soc_version in {AscendSocVersion.A2}: | ||
| limit = 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of an if/elif chain to determine the token limit for different SoC versions is functional but can be improved for better maintainability and extensibility. As more hardware variants are supported in the future, this chain will grow, increasing complexity and the chance of errors. A dictionary mapping SoC versions to their limits would be a cleaner and more scalable approach.
| limit = None | |
| if soc_version in {AscendSocVersion.A3}: | |
| limit = 512 | |
| elif soc_version in {AscendSocVersion.A2}: | |
| limit = 256 | |
| limit_map = { | |
| AscendSocVersion.A3: 512, | |
| AscendSocVersion.A2: 256, | |
| } | |
| limit = limit_map.get(soc_version) |
| limit = 256 | ||
|
|
||
| if limit is not None and num_tokens_per_tp_rank > limit: | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps in this scenario, it would be better to fall back to other moe methods that support current token size (such as all2all) rather than just reporting an error? I think it is weired to force users to use smaller max_num_seqs due to kernel restriction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we can make a better selection on this. But the problem is that aclgraph mode requires a unified communication schema. So we need to ensure that each communication ops in a model keep unchanged with respect to different capture sizes. cc @yiz-liu
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| self.mc2_tokens_capacity, | ||
| with_prefill=True) == MoECommType.MC2: | ||
| self._dummy_run(self.mc2_tokens_capacity, with_prefill=True) | ||
| if self._is_moe_model(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may replace this with model_config.get_num_experts()
| kv_connector_output=kv_connector_output, | ||
| ) | ||
|
|
||
| def _is_moe_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use vllm_ascend.utils.is_moe_model
| # Since not all models have moe modules and requires mc2, we leave | ||
| # the initialization of mc2 related parameters later. | ||
| self.mc2_tokens_capacity = 0 | ||
| self.reserved_mc2_mask = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice change
| if self._is_moe_model(): | ||
| self._initialize_mc2() | ||
| if self.max_num_tokens > self.mc2_tokens_capacity and \ | ||
| self._select_moe_comm_method( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_select_moe_comm_method has been called in _initialize_mc2 and is called in _dummy_run . You should consider to merge them into one.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
|
||
| def _init_mc2_tokens_capacity(self): | ||
| def _init_mc2(self): | ||
| """Initialization of MC2-related parameters and verify the validity.""" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to change _init_mc2_tokens_capacity in torchair_model_runner.py as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I will revert this change.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
This if an extension based on #3411. Hardware-specific restriction for mc2 is added. Besides, I also limit the scenarios where this check take effects to only moe models that might uses mc2.
Does this PR introduce any user-facing change?
For situations that mc2 is not supported with excessive input tokens, we throws an error earlier in vLLM-Ascend.
How was this patch tested?