Skip to content

[Performance]: custom ascendc kernel(rotary_embedding) performance #802

Open
@ttanzhiqiang

Description

@ttanzhiqiang

Proposal to improve performance

I wrote a benmark test script @wangxiyuan @ganyi1996ppo
https://github.com/ttanzhiqiang/vllm-ascend/blob/rotary_embedding_fix/benchmarks/ops/ben_rotary_embedding.py
It is found that the performance is better than the torch version when seq_len < 1024, and the performance is worse than the torch version when seq_len > 1024

[root@71a951642766 ops]# pytest -s ben_rotary_embedding.py
================================================================= test session starts =================================================================
platform linux -- Python 3.11.6, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/deepseek/tanzhiqiang.tzq/code/0508/custom_op/vllm-ascend
configfile: pytest.ini
plugins: anyio-4.9.0
collecting ... INFO 05-09 17:53:40 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 17:53:40 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 17:53:40 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 17:53:40 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 17:53:40 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 17:53:40 [init.py:44] plugin ascend loaded.
INFO 05-09 17:53:40 [init.py:230] Platform plugin ascend is activated
WARNING 05-09 17:53:41 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
collected 9 items

ben_rotary_embedding.py
Test Configuration:
Sequence length: 1
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([1])
Query: torch.Size([1, 32, 64])
Key: torch.Size([1, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.252 ms
Custom implementation: 0.027 ms
Speedup: 9.36x
.
Test Configuration:
Sequence length: 4
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([4])
Query: torch.Size([4, 32, 64])
Key: torch.Size([4, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.223 ms
Custom implementation: 0.027 ms
Speedup: 8.26x
.
Test Configuration:
Sequence length: 16
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([16])
Query: torch.Size([16, 32, 64])
Key: torch.Size([16, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.221 ms
Custom implementation: 0.030 ms
Speedup: 7.44x
.
Test Configuration:
Sequence length: 64
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([64])
Query: torch.Size([64, 32, 64])
Key: torch.Size([64, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.219 ms
Custom implementation: 0.044 ms
Speedup: 5.02x
.
Test Configuration:
Sequence length: 256
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([256])
Query: torch.Size([256, 32, 64])
Key: torch.Size([256, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.192 ms
Custom implementation: 0.071 ms
Speedup: 2.71x
.
Test Configuration:
Sequence length: 512
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([512])
Query: torch.Size([512, 32, 64])
Key: torch.Size([512, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.201 ms
Custom implementation: 0.075 ms
Speedup: 2.68x
.
Test Configuration:
Sequence length: 1024
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([1024])
Query: torch.Size([1024, 32, 64])
Key: torch.Size([1024, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.205 ms
Custom implementation: 0.135 ms
Speedup: 1.52x
.
Test Configuration:
Sequence length: 4091
Cache size: 4096
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([4091])
Query: torch.Size([4091, 32, 64])
Key: torch.Size([4091, 1, 64])
Cos/Sin cache: torch.Size([4096, 64])

Performance Results:
Reference implementation: 0.357 ms
Custom implementation: 0.491 ms
Speedup: 0.73x
.
Test Configuration:
Sequence length: 8192
Cache size: 4116
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([8192])
Query: torch.Size([8192, 32, 64])
Key: torch.Size([8192, 1, 64])
Cos/Sin cache: torch.Size([4116, 64])

Performance Results:
Reference implementation: 0.517 ms
Custom implementation: 0.981 ms
Speedup: 0.53x
.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions