Skip to content

Conversation

wxsIcey
Copy link
Collaborator

@wxsIcey wxsIcey commented Sep 25, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --served-model-name Eagle3 --port 8000  --model Qwen/Qwen3-8B   --seed 42     -tp 1  --speculative_config '{"model": "Tengyunw/qwen3_8b_eagle3", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'

Co-authored-by: liuruijin17 ricklrj@outlook.com

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an index out of range error related to attn_mask in high-concurrency scenarios. The fix involves pre-allocating the attention mask cache to its maximum possible size by using max_model_len, which effectively prevents race conditions during dynamic resizing. This also simplifies the code by removing a dependency on an environment variable and its corresponding os import. While the change is effective, I have raised a concern about the potential for increased memory consumption, which could be significant for models with a large max_model_len.

@wangxiyuan
Copy link
Collaborator

@liuruijin17 can you take a look this PR? Thanks.

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 26, 2025
@wxsIcey
Copy link
Collaborator Author

wxsIcey commented Sep 28, 2025

Please see: https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L333
In model runner, when constructing AttentionMask, vllm_config.model_config.max_model_len is also used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Multiple calls (maybe >100) to eagle3-qwen3-8b often incurs "attn_mask index out of range"
2 participants