-
Notifications
You must be signed in to change notification settings - Fork 459
[Bugfix] Add Eagle-3 Support for Qwen3 #2215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yuminjun <ymjlvh@163.com>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: yuminjun <ymjlvh@163.com>
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (11.53%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2215 +/- ##
==========================================
- Coverage 76.75% 76.62% -0.14%
==========================================
Files 113 113
Lines 12743 12769 +26
==========================================
+ Hits 9781 9784 +3
- Misses 2962 2985 +23
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
LGTM |
@yyoean If the KV sizes are different, the hybrid_kv_cache_manager needs to be enabled. However, the vllm code explicitly states that this is not supported (as shown in the figure below). I attempted to force modifications, but it resulted in a NotImplementedError. Therefore, I suggest you use Eagle-3 weights with the same KV dimensions, or I can further investigate whether this can be implemented, although the progress might be relatively slow. I'm also not sure why the AngelSlim team modified the KV size. @ApsarasX |
I use the AngelSlim/Qwen3-32B_eagle3. ![]() ![]() |
@yyoean |
server.sh python -m vllm.entrypoints.openai.api_server benchmark.sh |
@yyoean |
modify model arch is not allowed now. And we're working on removing all model arch in vllm ascend. Can you take any other way or contribute to vLLM directlly? |
@wangxiyuan |
@CSlearnerZM does vLLM works with qwen3_8b_eagle3 on GPU? |
@wangxiyuan It looks like version v0.10.1 already has native support. |
@CSlearnerZM the native support doesn't change model arch, so can we do the same like vLLM? |
@wangxiyuan The Eagle3 algorithm needs to obtain the hidden states of the model's middle three layers, and this cannot be avoided; the official implementation in vLLM also involves this. Currently, |
we plan drop CustomQwen3ForCausalLM in vllm-ascend and follow vLLM's origin model arch in the next few week |
@wangxiyuan OK, If that's the case, then indeed there's no point in making the modification. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
@CSlearnerZM Have you tried any way to support it without model file change? |
What this PR does / why we need it?
Adapts Eagle-3 to the Qwen3 series based on the current open-source weights of Eagle-3.
Weight source (accepted by the official Eagle team): qwen3_8b_eagle3
By modifying parts of the Qwen3 implementation, this ensures that the Eagle-3 weights corresponding to Qwen3-8B can be properly loaded and run, producing correct outputs.
Does this PR introduce any user-facing change?
1.Because the
SpeculativeConfig
of the vLLM framework limits the ability to run Eagle-3 via llama, and no modification solution suitable for vllm-ascend has been found, users are required to make the following modifications to vLLM.2.Since the open-source qwen3_8b_eagle3 is adapted for SGLang, the config.json file needs to be modified.
How was this patch tested?
After making the modifications to vLLM as mentioned in the second point, the patch can be tested using the test_eagle_correctness unit test in vllm-ascend tests. The test commands and results are as follows:
comand:
results:

Service startup command and test results are as follows:
command:
vllm serve model_path_of_Qwen3-8B --host 0.0.0.0 --port 12345 --trust-remote-code --gpu-memory-utilization 0.90 --max-model-len 8192 --served-model-name qwen --speculative-config '{"model": "model_path_of_qwen3_8b_eagle3", "method": "eagle3", "num_speculative_tokens": 2}' --max_model_len 2048 --enforce_eager
result:
