Skip to content

Conversation

CSlearnerZM
Copy link

@CSlearnerZM CSlearnerZM commented Aug 5, 2025

What this PR does / why we need it?

Adapts Eagle-3 to the Qwen3 series based on the current open-source weights of Eagle-3.

Weight source (accepted by the official Eagle team): qwen3_8b_eagle3

By modifying parts of the Qwen3 implementation, this ensures that the Eagle-3 weights corresponding to Qwen3-8B can be properly loaded and run, producing correct outputs.

Does this PR introduce any user-facing change?

1.Because the SpeculativeConfig of the vLLM framework limits the ability to run Eagle-3 via llama, and no modification solution suitable for vllm-ascend has been found, users are required to make the following modifications to vLLM.

note: the main branch of vLLM has added support for recognizing Qwen(25.08.05), but this feature is not yet available in version v0.10.0.

# path: vllm/config.py
# origin
if self.method == "eagle3" and self.target_model_config and \
    "llama" not in self.target_model_config.hf_text_config.model_type:
    raise ValueError(
        "Eagle3 is only supported for Llama models. "
        f"Got {self.target_model_config.hf_text_config.model_type=}")

# change
if self.method == "eagle3" and self.target_model_config and \
        ("llama" not in self.target_model_config.hf_text_config.model_type
         and "qwen3" not in self.target_model_config.hf_text_config.model_type):
    raise ValueError(
        "Eagle3 is only supported for Llama models. "
        f"Got {self.target_model_config.hf_text_config.model_type=}")

2.Since the open-source qwen3_8b_eagle3 is adapted for SGLang, the config.json file needs to be modified.

// origin
"architectures": [
    "LlamaForCausalLMEagle3"
],

// change
"architectures": [
    "LlamaForCausalLM"
],

How was this patch tested?

After making the modifications to vLLM as mentioned in the second point, the patch can be tested using the test_eagle_correctness unit test in vllm-ascend tests. The test commands and results are as follows:

comand:

pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py::test_eagle_correctness[eagle3]

results:
eagle3-for-qwen3

Service startup command and test results are as follows:

command:

vllm serve model_path_of_Qwen3-8B --host 0.0.0.0 --port 12345 --trust-remote-code --gpu-memory-utilization 0.90 --max-model-len 8192 --served-model-name qwen --speculative-config '{"model": "model_path_of_qwen3_8b_eagle3", "method": "eagle3", "num_speculative_tokens": 2}' --max_model_len 2048 --enforce_eager

result:
results

Signed-off-by: yuminjun <ymjlvh@163.com>
Copy link

github-actions bot commented Aug 5, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: yuminjun <ymjlvh@163.com>
Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 11.53846% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.62%. Comparing base (126cdfc) to head (6c917ea).
⚠️ Report is 241 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/models/qwen3.py 11.53% 23 Missing ⚠️

❌ Your patch check has failed because the patch coverage (11.53%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2215      +/-   ##
==========================================
- Coverage   76.75%   76.62%   -0.14%     
==========================================
  Files         113      113              
  Lines       12743    12769      +26     
==========================================
+ Hits         9781     9784       +3     
- Misses       2962     2985      +23     
Flag Coverage Δ
unittests 76.62% <11.53%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ApsarasX
Copy link
Collaborator

ApsarasX commented Aug 6, 2025

LGTM

@yyoean
Copy link

yyoean commented Aug 6, 2025

hi, did you test qwen3-32b eagle3? I use the PR and occur an error "ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type."
lQLPJwn_KnE36cnNAx7NCuywvR4e5kdC-QwIcYPapXzqAA_2796_798

@CSlearnerZM
Copy link
Author

CSlearnerZM commented Aug 7, 2025

hi, did you test qwen3-32b eagle3? I use the PR and occur an error "ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type." lQLPJwn_KnE36cnNAx7NCuywvR4e5kdC-QwIcYPapXzqAA_2796_798

@yyoean
hello, may I ask if you loaded the weights using the AngelSlim/Qwen3-32B_eagle3 repository? I found that the main reason causing this issue is that the KV size configuration of this repository differs from that of Qwen3-32B, as shown in the figure below.

Qwen3-32B
截屏2025-08-07 10 07 35

AngelSlim/Qwen3-32B_eagle3
截屏2025-08-07 10 07 09

If the KV sizes are different, the hybrid_kv_cache_manager needs to be enabled. However, the vllm code explicitly states that this is not supported (as shown in the figure below).
截屏2025-08-07 10 08 24

I attempted to force modifications, but it resulted in a NotImplementedError. Therefore, I suggest you use Eagle-3 weights with the same KV dimensions, or I can further investigate whether this can be implemented, although the progress might be relatively slow. I'm also not sure why the AngelSlim team modified the KV size.

@ApsarasX
hello, I'd like to ask if the team has any plans to support hybrid_kv_cache_manager in the future? I'm new here and not quite sure who I should consult.

@yyoean
Copy link

yyoean commented Aug 7, 2025

hi, did you test qwen3-32b eagle3? I use the PR and occur an error "ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type." lQLPJwn_KnE36cnNAx7NCuywvR4e5kdC-QwIcYPapXzqAA_2796_798

@yyoean hello, may I ask if you loaded the weights using the AngelSlim/Qwen3-32B_eagle3 repository? I found that the main reason causing this issue is that the KV size configuration of this repository differs from that of Qwen3-32B, as shown in the figure below.

Qwen3-32B 截屏2025-08-07 10 07 35

AngelSlim/Qwen3-32B_eagle3 截屏2025-08-07 10 07 09

If the KV sizes are different, the hybrid_kv_cache_manager needs to be enabled. However, the vllm code explicitly states that this is not supported (as shown in the figure below). 截屏2025-08-07 10 08 24

I attempted to force modifications, but it resulted in a NotImplementedError. Therefore, I suggest you use Eagle-3 weights with the same KV dimensions, or I can further investigate whether this can be implemented, although the progress might be relatively slow. I'm also not sure why the AngelSlim team modified the KV size.

@ApsarasX hello, I'd like to ask if the team has any plans to support hybrid_kv_cache_manager in the future? I'm new here and not quite sure who I should consult.

I use the AngelSlim/Qwen3-32B_eagle3.
And did you test the benchmark for qwen3-8b, I can start the serve and receive returns from curl successfully, but when I want to test benchmark, it occurs an error.

lQLPKdvyfl6Sl3HNBSbNCtqwRYtI0VAnbsIIclF3c2rWAQ_2778_1318 lQLPJwFU6ICDd3HNBVzNCvSwLZIuQ1BPY6sIclF3c2rWAA_2804_1372

@CSlearnerZM
Copy link
Author

benchmark

@yyoean
Could you tell me the specific path for the benchmark and the command to start the vllm's model service? I need to try it additionally.

@yyoean
Copy link

yyoean commented Aug 7, 2025

benchmark

@yyoean Could you tell me the specific path for the benchmark and the command to start the vllm's model service? I need to try it additionally.

server.sh
export ASCEND_LAUNCH_BLOCKING=1
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=2
export VLLM_VERSION=0.9.1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

python -m vllm.entrypoints.openai.api_server
--model=xxxl/Qwen3-8B
--trust-remote-code
--served-model-name auto
--distributed-executor-backend=mp
--port 8006
-tp=4
--enforce-eager
--max-num-seqs 48
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-chunked-prefill
--no-enable-prefix-caching
--additional-config '{"expert_tensor_parallel_size":4,"ascend_scheduler_config":{"enabled":true}}'
--gpu-memory-utilization 0.9
--speculative_config '{"method": "eagle3", "model": "xxx/qwen3_8b_eagle3", "num_speculative_tokens": 2, "max_model_len": 128}'
--enable-prompt-tokens-details &> run.log &
disown

benchmark.sh
python xxx/benchmark_serving.py
--backend vllm
--trust-remote-code
--model auto
--tokenizer xxx/Qwen3-8B
--dataset-name random
--random-input-len 4096
--random-output-len 1536
--ignore-eos
--num-prompts 100
--max-concurrency 48
--request-rate 0.7
--metric-percentiles "50,90,99"
--base-url http://localhost:8006

@CSlearnerZM
Copy link
Author

benchmark

@yyoean Could you tell me the specific path for the benchmark and the command to start the vllm's model service? I need to try it additionally.

server.sh export ASCEND_LAUNCH_BLOCKING=1 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=2 export VLLM_VERSION=0.9.1 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

python -m vllm.entrypoints.openai.api_server --model=xxxl/Qwen3-8B --trust-remote-code --served-model-name auto --distributed-executor-backend=mp --port 8006 -tp=4 --enforce-eager --max-num-seqs 48 --max-model-len 32768 --max-num-batched-tokens 32768 --block-size 128 --no-enable-chunked-prefill --no-enable-prefix-caching --additional-config '{"expert_tensor_parallel_size":4,"ascend_scheduler_config":{"enabled":true}}' --gpu-memory-utilization 0.9 --speculative_config '{"method": "eagle3", "model": "xxx/qwen3_8b_eagle3", "num_speculative_tokens": 2, "max_model_len": 128}' --enable-prompt-tokens-details &> run.log & disown

benchmark.sh python xxx/benchmark_serving.py --backend vllm --trust-remote-code --model auto --tokenizer xxx/Qwen3-8B --dataset-name random --random-input-len 4096 --random-output-len 1536 --ignore-eos --num-prompts 100 --max-concurrency 48 --request-rate 0.7 --metric-percentiles "50,90,99" --base-url http://localhost:8006

@yyoean
Hello, I've done some initial troubleshooting on this issue, and the problem doesn't appear to be in the code of this PR. Currently, I suspect it's likely related to paged attention; it's proving somewhat challenging and I haven't been able to resolve it yet😞. This issue also existed previously with the originally supported Llama models, so it's probably a common problem arising from Eagle3's adaptation. I'll continue investigating, but it might take some more time. If you have any thoughts, please feel free to reach out to me. Thanks!!!

@wangxiyuan
Copy link
Collaborator

modify model arch is not allowed now. And we're working on removing all model arch in vllm ascend. Can you take any other way or contribute to vLLM directlly?

@CSlearnerZM
Copy link
Author

@wangxiyuan
Thank you for your reply. Since it is necessary to obtain intermediate results during the forward pass, the eagle3 algorithm will inevitably require modifications to the model implementation files. Currently, I don't have a better approach.

@wangxiyuan
Copy link
Collaborator

@CSlearnerZM does vLLM works with qwen3_8b_eagle3 on GPU?

@CSlearnerZM
Copy link
Author

@wangxiyuan It looks like version v0.10.1 already has native support.
截屏2025-08-19 12 40 00

@wangxiyuan
Copy link
Collaborator

@CSlearnerZM the native support doesn't change model arch, so can we do the same like vLLM?

@CSlearnerZM
Copy link
Author

@wangxiyuan The Eagle3 algorithm needs to obtain the hidden states of the model's middle three layers, and this cannot be avoided; the official implementation in vLLM also involves this. Currently, Qwen2Model in vLLM provides partial functionality, so CustomQwen3Model in vllm-ascend indeed does not require modification. However, CustomQwen3ForCausalLM still involves adding the set_aux_hidden_state_layers and get_eagle3_aux_hidden_state_layers functions, unless inheriting from the Qwen3ForCausalLM class in vLLM.
After a preliminary review, the inheritance approach should be feasible. Should I proceed with the modifications and testing?And I'd also like to ask, will the CustomQwen3ForCausalLM class still be retained in vllm-ascend going forward?

@wangxiyuan
Copy link
Collaborator

will the CustomQwen3ForCausalLM class still be retained in vllm-ascend going forward?

we plan drop CustomQwen3ForCausalLM in vllm-ascend and follow vLLM's origin model arch in the next few week

@CSlearnerZM
Copy link
Author

we plan drop CustomQwen3ForCausalLM in vllm-ascend and follow vLLM's origin model arch in the next few week

@wangxiyuan OK, If that's the case, then indeed there's no point in making the modification.

@wangxiyuan wangxiyuan added the guide guide note label Aug 28, 2025
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangxiyuan
Copy link
Collaborator

@CSlearnerZM Have you tried any way to support it without model file change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants