[WIP][BugFix]Fix accuracy issues caused by wrong etp_size passed into FusedMoEParallelConfig when using vLLM 0.9.0 #961
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
This PR fix accuracy issues incurred by codes that adapt to
FusedMoEParallelConfig
in vLLM 0.9.0 version. Thetp_size
used to split weights are wrongly passed. The root cause is that vLLM community and vLLM-Ascend are using different methods to decide whether to use Expert Parallel.vLLM:
vLLM use a flag
enable_expert_parallel
to indicate whether to use EP and use the following codes to decideep_size
:vLLM-Ascend:
vLLM-Ascend uses
etp
to specify Tensor Parallel in MoE.So there will be conflicts if we simply combine these codes together.
Does this PR introduce any user-facing change?
How was this patch tested?