-
Notifications
You must be signed in to change notification settings - Fork 459
MLA layer eliminates redundant index operators #993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please add pr description
- Run bash format.sh locally to fix lint failures
vllm_ascend/attention/attention.py
Outdated
|
||
q_pe = self.rope_single(q_pe, cos, sin) | ||
k_pe, k_nope = self.exec_kv(hidden_states_or_kv_c_normed, cos, sin, | ||
if self.layer_idx == 0 or self.cos is None or self.sin is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make more comments on why updating self.cos
and self.sin
only when layer_idx == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the autoregressive decoding process, the cos and sin values are exactly the same for each layer(such as 61 layers). Therefore, they only need to be calculated in the first layer, and subsequent layers can directly reuse them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explaination, let's add this comments into code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vllm_ascend/models/deepseek_v2.py
Outdated
kv_a_layernorm=self.kv_a_layernorm, | ||
kv_b_proj=self.kv_b_proj, | ||
o_proj=self.o_proj, | ||
ascend_prefix=prefix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommand to pass by debug_layer_idx
instead of ascend_prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
742a3ee
to
8cd3c81
Compare
c07e6e1
to
7ab8189
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
pytest tests/multicard/test_offline_inference_distributed.py test deepseek-ai/DeepSeek-V2-Lite |
![]() |
Some bugs in the original solution are being fixed. |
I am very much looking forward to using your optimization, which will help our tpot. How long will it take to use it? |
bfc94f3
to
e0efc4d
Compare
### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: huiying <chenhuiying4@huawei.com>
|
|
You can try again. |
ok |
) -> Tuple[torch.Tensor, torch.Tensor]: | ||
if (envs.VLLM_USE_V1 and attn_metadata is not None | ||
and attn_metadata.num_decodes is not None | ||
and attn_metadata.atten_state is not None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When dp!=1, when _dummy_run is triggered, attn_metadata is None, attn_metadata.num_decodes and attn_metadata.atten_state do not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When dp!=1, when _dummy_run is triggered, attn_metadata is None, attn_metadata.num_decodes and attn_metadata.atten_state do not exist
If attn_metadata is empty, this branch will not be entered and the original code will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If attn_metadata is empty, attn_metadata has no num_decodes and atten_state attributes.I changed it to if (envs.VLLM_USE_V1 and attn_metadata is not None):
and it can run normally. How much is your profit?
### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
During the autoregressive decoding process, the cos and sin values are exactly the same for each layer(such as 61 layers). Therefore, they only need to be calculated in the first layer, and subsequent layers can directly reuse them.
Does this PR introduce any user-facing change?
None
How was this patch tested?
pytest tests/multicard/test_offline_inference_distributed.py