Skip to content

[Performance]: custom ascendc kernel(rotary_embedding) performance #802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ttanzhiqiang opened this issue May 9, 2025 · 5 comments
Open

Comments

@ttanzhiqiang
Copy link

Proposal to improve performance

I wrote a benmark test script @wangxiyuan @ganyi1996ppo
https://github.com/ttanzhiqiang/vllm-ascend/blob/rotary_embedding_fix/benchmarks/ops/ben_rotary_embedding.py
It is found that the performance is better than the torch version when seq_len < 1024, and the performance is worse than the torch version when seq_len > 1024

[root@71a951642766 ops]# pytest -s ben_rotary_embedding.py
================================================================= test session starts =================================================================
platform linux -- Python 3.11.6, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/deepseek/tanzhiqiang.tzq/code/0508/custom_op/vllm-ascend
configfile: pytest.ini
plugins: anyio-4.9.0
collecting ... INFO 05-09 17:53:40 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 17:53:40 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 17:53:40 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 17:53:40 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 17:53:40 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 17:53:40 [init.py:44] plugin ascend loaded.
INFO 05-09 17:53:40 [init.py:230] Platform plugin ascend is activated
WARNING 05-09 17:53:41 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
collected 9 items

ben_rotary_embedding.py
Test Configuration:
Sequence length: 1
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([1])
Query: torch.Size([1, 32, 64])
Key: torch.Size([1, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.252 ms
Custom implementation: 0.027 ms
Speedup: 9.36x
.
Test Configuration:
Sequence length: 4
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([4])
Query: torch.Size([4, 32, 64])
Key: torch.Size([4, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.223 ms
Custom implementation: 0.027 ms
Speedup: 8.26x
.
Test Configuration:
Sequence length: 16
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([16])
Query: torch.Size([16, 32, 64])
Key: torch.Size([16, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.221 ms
Custom implementation: 0.030 ms
Speedup: 7.44x
.
Test Configuration:
Sequence length: 64
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([64])
Query: torch.Size([64, 32, 64])
Key: torch.Size([64, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.219 ms
Custom implementation: 0.044 ms
Speedup: 5.02x
.
Test Configuration:
Sequence length: 256
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([256])
Query: torch.Size([256, 32, 64])
Key: torch.Size([256, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.192 ms
Custom implementation: 0.071 ms
Speedup: 2.71x
.
Test Configuration:
Sequence length: 512
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([512])
Query: torch.Size([512, 32, 64])
Key: torch.Size([512, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.201 ms
Custom implementation: 0.075 ms
Speedup: 2.68x
.
Test Configuration:
Sequence length: 1024
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([1024])
Query: torch.Size([1024, 32, 64])
Key: torch.Size([1024, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.205 ms
Custom implementation: 0.135 ms
Speedup: 1.52x
.
Test Configuration:
Sequence length: 4091
Cache size: 4096
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([4091])
Query: torch.Size([4091, 32, 64])
Key: torch.Size([4091, 1, 64])
Cos/Sin cache: torch.Size([4096, 64])

Performance Results:
Reference implementation: 0.357 ms
Custom implementation: 0.491 ms
Speedup: 0.73x
.
Test Configuration:
Sequence length: 8192
Cache size: 4116
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([8192])
Query: torch.Size([8192, 32, 64])
Key: torch.Size([8192, 1, 64])
Cos/Sin cache: torch.Size([4116, 64])

Performance Results:
Reference implementation: 0.517 ms
Custom implementation: 0.981 ms
Speedup: 0.53x
.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
@Yikun
Copy link
Collaborator

Yikun commented May 10, 2025

Thanks for report, so are you suggesting we should disable the custom ops by default? or improve it?

@ganyi1996ppo
Copy link
Collaborator

ganyi1996ppo commented May 10, 2025

Strange, this custom kernel is supposed to be better when size getting larger rather than wrose. Of course this kernel can still improved by adopting double buffer or by just tuning the load size, but this result is unexpected. I'll take a deep look when I got time. Thanks for the report.

@ttanzhiqiang
Copy link
Author

Thanks for report, so are you suggesting we should disable the custom ops by default? or improve it?

I think custom operators are a good idea. Open source your npu operators, like vllm's custom operators, can allow users to optimize. I think new features must be accompanied by new operators, and custom operators will accelerate the development of the vllm-ascend community.

@ttanzhiqiang
Copy link
Author

Strange, this custom kernel is supposed to be better when size getting larger rather than wrose. Of course this kernel can still improved by adopting double buffer or by just tuning the load size, but this result is unexpected. I'll take a deep look when I got time. Thanks for the report.

You can refer to https://gitee.com/bonnie-boxi-liu/atb-op-plugin/tree/br_feature_cann_8.1.RC1_228POC_20250331/ascend-op-common-lib/mixops/rope/op_kernel.
Add pipeline optimization and parallel optimization strategies

@ganyi1996ppo
Copy link
Collaborator

You can refer to https://gitee.com/bonnie-boxi-liu/atb-op-plugin/tree/br_feature_cann_8.1.RC1_228POC_20250331/ascend-op-common-lib/mixops/rope/op_kernel. Add pipeline optimization and parallel optimization strategies

Actually, I think this performance regression is not related to the multi stage or parallel execute. I guess it is because of the strict restriction of leading dim support and the assumption of head_dim != rope_dim, those 2 assumption make us can only run ops with limited calc instruction width, thus impact the overall performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants