Skip to content

Conversation

huiyingCCCC
Copy link

@huiyingCCCC huiyingCCCC commented May 28, 2025

What this PR does / why we need it?

During the autoregressive decoding process, the cos and sin values are exactly the same for each layer(such as 61 layers). Therefore, they only need to be calculated in the first layer, and subsequent layers can directly reuse them.

Does this PR introduce any user-facing change?

None

How was this patch tested?

pytest tests/multicard/test_offline_inference_distributed.py

Copy link
Collaborator

@MengqingCao MengqingCao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add pr description
  2. Run bash format.sh locally to fix lint failures


q_pe = self.rope_single(q_pe, cos, sin)
k_pe, k_nope = self.exec_kv(hidden_states_or_kv_c_normed, cos, sin,
if self.layer_idx == 0 or self.cos is None or self.sin is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make more comments on why updating self.cos and self.sin only when layer_idx == 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the autoregressive decoding process, the cos and sin values are exactly the same for each layer(such as 61 layers). Therefore, they only need to be calculated in the first layer, and subsequent layers can directly reuse them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explaination, let's add this comments into code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

kv_a_layernorm=self.kv_a_layernorm,
kv_b_proj=self.kv_b_proj,
o_proj=self.o_proj,
ascend_prefix=prefix,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommand to pass by debug_layer_idx instead of ascend_prefix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@huiyingCCCC huiyingCCCC force-pushed the main branch 3 times, most recently from 742a3ee to 8cd3c81 Compare May 29, 2025 09:28
@huiyingCCCC huiyingCCCC changed the title MLA层消除冗余index小算子 MLA layer eliminates redundant index operators May 30, 2025
@huiyingCCCC huiyingCCCC force-pushed the main branch 3 times, most recently from c07e6e1 to 7ab8189 Compare June 3, 2025 06:16
@wangxiyuan wangxiyuan mentioned this pull request Jun 4, 2025
76 tasks
Copy link

github-actions bot commented Jun 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

github-actions bot commented Jun 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@ttanzhiqiang
Copy link
Contributor

pytest tests/multicard/test_offline_inference_distributed.py test deepseek-ai/DeepSeek-V2-Lite
Have you tried deepseekr1?

@ttanzhiqiang
Copy link
Contributor

ttanzhiqiang commented Jun 6, 2025

截屏2025-06-06 17 27 23 I used https://github.com//pull/1101 as the baseline and found an error

@huiyingCCCC huiyingCCCC closed this Jun 9, 2025
@huiyingCCCC huiyingCCCC reopened this Jun 11, 2025
@huiyingCCCC
Copy link
Author

截屏2025-06-06 17 27 23 I used https://github.com/[/pull/1101](https://github.com/vllm-project/vllm-ascend/pull/1101) as the baseline and found an error

Some bugs in the original solution are being fixed.

@ttanzhiqiang
Copy link
Contributor

截屏2025-06-06 17 27 23 I used [https://github.com//pull/1101](#1101) as the baseline and found an error

Some bugs in the original solution are being fixed.

I am very much looking forward to using your optimization, which will help our tpot. How long will it take to use it?

@huiyingCCCC huiyingCCCC force-pushed the main branch 2 times, most recently from bfc94f3 to e0efc4d Compare June 11, 2025 13:39
ganyi1996ppo pushed a commit that referenced this pull request Jun 12, 2025
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: huiying <chenhuiying4@huawei.com>
@huiyingCCCC
Copy link
Author

截屏2025-06-06 17 27 23 I used [https://github.com//pull/1101](#1101) as the baseline and found an error

Some bugs in the original solution are being fixed.

I am very much looking forward to using your optimization, which will help our tpot. How long will it take to use it?

@huiyingCCCC
Copy link
Author

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@huiyingCCCC huiyingCCCC reopened this Jun 16, 2025
@huiyingCCCC
Copy link
Author

截屏2025-06-06 17 27 23 I used [https://github.com//pull/1101](#1101) as the baseline and found an error

Some bugs in the original solution are being fixed.

I am very much looking forward to using your optimization, which will help our tpot. How long will it take to use it?

You can try again.

@ttanzhiqiang
Copy link
Contributor

截屏2025-06-06 17 27 23 I used [https://github.com//pull/1101](#1101) as the baseline and found an error

Some bugs in the original solution are being fixed.

I am very much looking forward to using your optimization, which will help our tpot. How long will it take to use it?

You can try again.

ok

) -> Tuple[torch.Tensor, torch.Tensor]:
if (envs.VLLM_USE_V1 and attn_metadata is not None
and attn_metadata.num_decodes is not None
and attn_metadata.atten_state is not None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When dp!=1, when _dummy_run is triggered, attn_metadata is None, attn_metadata.num_decodes and attn_metadata.atten_state do not exist

Copy link
Author

@huiyingCCCC huiyingCCCC Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When dp!=1, when _dummy_run is triggered, attn_metadata is None, attn_metadata.num_decodes and attn_metadata.atten_state do not exist

If attn_metadata is empty, this branch will not be entered and the original code will be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If attn_metadata is empty, attn_metadata has no num_decodes and atten_state attributes.I changed it to if (envs.VLLM_USE_V1 and attn_metadata is not None):
and it can run normally. How much is your profit?

momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants