-
Notifications
You must be signed in to change notification settings - Fork 563
support FULL graph mode in Qwen #3369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for 'fullandpiecesewise' graph compilation. The changes involve updating the attention mechanism to use npu_fused_infer_attention_score, modifying platform configurations, and adjusting the model runner. I've found a few critical issues in the implementation. There's an incorrect assertion in platform.py that will always fail, a misleading log message with missing logic, and an incorrect calculation of query_start_loc in the model runner's dummy run, which will break graph capturing. These issues need to be addressed to ensure the new functionality works correctly.
vllm_ascend/platform.py
Outdated
| assert compilation_config.level == CompilationLevel.PIECEWISE, \ | ||
| "When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion on line 242 will always fail because it checks if compilation_config.cudagraph_mode is CUDAGraphMode.PIECEWISE, but this code block is only executed when compilation_config.cudagraph_mode is CUDAGraphMode.FULL_AND_PIECEWISE. This will prevent the FULL_AND_PIECEWISE mode from working.
assert compilation_config.level == CompilationLevel.PIECEWISE, \
"When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEWISE"| self.query_start_loc[:num_reqs + 1] = num_tokens | ||
| self.query_start_loc_cpu[:num_reqs + 1] = num_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.query_start_loc and self.query_start_loc_cpu are being incorrectly set to a scalar value num_tokens. These tensors are expected to store the cumulative sum of token lengths for each request in the batch. Broadcasting a scalar value will result in incorrect query_start_loc values, which will likely cause errors or incorrect behavior during graph capture and profiling dummy runs. The correct logic should compute the cumulative sum of tokens per request, similar to how it's handled in the _prepare_inputs method.
vllm_ascend/platform.py
Outdated
| if compilation_config.level == CompilationLevel.PIECEWISE: | ||
| logger.warning( | ||
| "NEW NPU does not support %s compilation level. Setting CUDAGraphMode to NONE", | ||
| compilation_config.level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message at line 183 is misleading. It states "Setting CUDAGraphMode to NONE", but the code does not actually modify compilation_config.cudagraph_mode. This can cause confusion and incorrect behavior if PIECEWISE compilation level is not supported. Additionally, "NEW NPU" appears to be a typo and should likely be "NPU".
if compilation_config.level == CompilationLevel.PIECEWISE:
logger.warning(
"NPU does not support %s compilation level. Setting CUDAGraphMode to NONE",
compilation_config.level)
compilation_config.cudagraph_mode = CUDAGraphMode.NONE|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| graph_params.attn_params[num_tokens_origin].append(( | ||
| query, | ||
| self.key_cache, | ||
| self.value_cache, | ||
| key, | ||
| value, | ||
| attn_metadata.block_tables, | ||
| block_size, | ||
| seq_lens, | ||
| query_start_loc, | ||
| self.num_kv_heads, | ||
| self.num_heads, | ||
| self.scale, | ||
| attn_metadata.block_tables, | ||
| attn_metadata.seq_lens, | ||
| output, | ||
| attn_output, | ||
| softmax_lse | ||
| )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use weak_ref_tensor like #3331.
| output = torch.empty_like(query) | ||
| softmax_lse = torch.empty(num_tokens, | ||
| dtype=query.dtype, | ||
| device=query.device) | ||
| query_start_loc = attn_metadata.query_start_loc[1:].cpu().int().tolist() | ||
| seq_lens = attn_metadata.seq_lens.cpu().int().tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try not to create or modify params in forward, refactor them to prepare_inputs or somewhere else, we only need to do it once every step, not every layer.
ee7e275 to
8fe8978
Compare
0f6e3d6 to
81291e9
Compare
f9ad80d to
bcda351
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| seq_lens: torch.Tensor = None | ||
|
|
||
| query_start_loc: torch.Tensor = None | ||
| seq_lens_list: List[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add note.
| num_block, block_size, -1) | ||
| value = self.value_cache.view( # type: ignore | ||
| num_block, block_size, -1) | ||
| softmax_lse = torch.empty(num_tokens, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
softmax_lse move to init.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| forward_context: ForwardContext = get_forward_context() | ||
| if torch.version.cann.startswith("8.3"): | ||
| if forward_context.capturing: | ||
| output = self.full_graph_attention(query, key, value, attn_metadata, 128, output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hard-code block_size to 128 here?
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
55bee48 to
bdb15fb
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
a98f46f to
81e3e8d
Compare
55e6279 to
a6354ec
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
move to PR:https://github.com/vllm-project/vllm-ascend/actions/runs/19228494454/job/54961305233?pr=3970
What this PR does / why we need it?
The current library only supports the FullDecodeOnly graph mode, which enables full graph execution during the decode. This PR extends support to allow full graph execution in both the prefill and decode, referred to as FULL graph mode.

support FULL graph mode:
Does this PR introduce any user-facing change?
How was this patch tested?