-
Notifications
You must be signed in to change notification settings - Fork 142
[Performance]: custom ascendc kernel(rotary_embedding) performance #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for report, so are you suggesting we should disable the custom ops by default? or improve it? |
Strange, this custom kernel is supposed to be better when size getting larger rather than wrose. Of course this kernel can still improved by adopting double buffer or by just tuning the load size, but this result is unexpected. I'll take a deep look when I got time. Thanks for the report. |
I think custom operators are a good idea. Open source your npu operators, like vllm's custom operators, can allow users to optimize. I think new features must be accompanied by new operators, and custom operators will accelerate the development of the vllm-ascend community. |
You can refer to https://gitee.com/bonnie-boxi-liu/atb-op-plugin/tree/br_feature_cann_8.1.RC1_228POC_20250331/ascend-op-common-lib/mixops/rope/op_kernel. |
Actually, I think this performance regression is not related to the multi stage or parallel execute. I guess it is because of the strict restriction of leading dim support and the assumption of |
Proposal to improve performance
I wrote a benmark test script @wangxiyuan @ganyi1996ppo
https://github.com/ttanzhiqiang/vllm-ascend/blob/rotary_embedding_fix/benchmarks/ops/ben_rotary_embedding.py
It is found that the performance is better than the torch version when seq_len < 1024, and the performance is worse than the torch version when seq_len > 1024
[root@71a951642766 ops]# pytest -s ben_rotary_embedding.py
================================================================= test session starts =================================================================
platform linux -- Python 3.11.6, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/deepseek/tanzhiqiang.tzq/code/0508/custom_op/vllm-ascend
configfile: pytest.ini
plugins: anyio-4.9.0
collecting ... INFO 05-09 17:53:40 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 17:53:40 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 17:53:40 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 17:53:40 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 17:53:40 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 17:53:40 [init.py:44] plugin ascend loaded.
INFO 05-09 17:53:40 [init.py:230] Platform plugin ascend is activated
WARNING 05-09 17:53:41 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
collected 9 items
ben_rotary_embedding.py
Test Configuration:
Sequence length: 1
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([1])
Query: torch.Size([1, 32, 64])
Key: torch.Size([1, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])
Performance Results:
Reference implementation: 0.252 ms
Custom implementation: 0.027 ms
Speedup: 9.36x
.
Test Configuration:
Sequence length: 4
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([4])
Query: torch.Size([4, 32, 64])
Key: torch.Size([4, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])
Performance Results:
Reference implementation: 0.223 ms
Custom implementation: 0.027 ms
Speedup: 8.26x
.
Test Configuration:
Sequence length: 16
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([16])
Query: torch.Size([16, 32, 64])
Key: torch.Size([16, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])
Performance Results:
Reference implementation: 0.221 ms
Custom implementation: 0.030 ms
Speedup: 7.44x
.
Test Configuration:
Sequence length: 64
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([64])
Query: torch.Size([64, 32, 64])
Key: torch.Size([64, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])
Performance Results:
Reference implementation: 0.219 ms
Custom implementation: 0.044 ms
Speedup: 5.02x
.
Test Configuration:
Sequence length: 256
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([256])
Query: torch.Size([256, 32, 64])
Key: torch.Size([256, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])
Performance Results:
Reference implementation: 0.192 ms
Custom implementation: 0.071 ms
Speedup: 2.71x
.
Test Configuration:
Sequence length: 512
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([512])
Query: torch.Size([512, 32, 64])
Key: torch.Size([512, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])
Performance Results:
Reference implementation: 0.201 ms
Custom implementation: 0.075 ms
Speedup: 2.68x
.
Test Configuration:
Sequence length: 1024
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([1024])
Query: torch.Size([1024, 32, 64])
Key: torch.Size([1024, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])
Performance Results:
Reference implementation: 0.205 ms
Custom implementation: 0.135 ms
Speedup: 1.52x
.
Test Configuration:
Sequence length: 4091
Cache size: 4096
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([4091])
Query: torch.Size([4091, 32, 64])
Key: torch.Size([4091, 1, 64])
Cos/Sin cache: torch.Size([4096, 64])
Performance Results:
Reference implementation: 0.357 ms
Custom implementation: 0.491 ms
Speedup: 0.73x
.
Test Configuration:
Sequence length: 8192
Cache size: 4116
Query heads: 32
Key heads: 1
Head size: 64
Tensor shapes:
Positions: torch.Size([8192])
Query: torch.Size([8192, 32, 64])
Key: torch.Size([8192, 1, 64])
Cos/Sin cache: torch.Size([4116, 64])
Performance Results:
Reference implementation: 0.517 ms
Custom implementation: 0.981 ms
Speedup: 0.53x
.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The text was updated successfully, but these errors were encountered: