[Performance]: custom ascendc kernel(rotary_embedding) performance

### Proposal to improve performance

I wrote a benmark test script @wangxiyuan @ganyi1996ppo 
https://github.com/ttanzhiqiang/vllm-ascend/blob/rotary_embedding_fix/benchmarks/ops/ben_rotary_embedding.py
It is found that the performance is better than the torch version when seq_len < 1024, and the performance is worse than the torch version when seq_len > 1024

[root@71a951642766 ops]# pytest -s  ben_rotary_embedding.py 
================================================================= test session starts =================================================================
platform linux -- Python 3.11.6, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/deepseek/tanzhiqiang.tzq/code/0508/custom_op/vllm-ascend
configfile: pytest.ini
plugins: anyio-4.9.0
collecting ... INFO 05-09 17:53:40 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 17:53:40 [__init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 17:53:40 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 17:53:40 [__init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 17:53:40 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 17:53:40 [__init__.py:44] plugin ascend loaded.
INFO 05-09 17:53:40 [__init__.py:230] Platform plugin ascend is activated
WARNING 05-09 17:53:41 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
collected 9 items                                                                                                                                     

ben_rotary_embedding.py 
Test Configuration:
Sequence length: 1
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([1])
Query: torch.Size([1, 32, 64])
Key: torch.Size([1, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.252 ms
Custom implementation: 0.027 ms
Speedup: 9.36x
.
Test Configuration:
Sequence length: 4
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([4])
Query: torch.Size([4, 32, 64])
Key: torch.Size([4, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.223 ms
Custom implementation: 0.027 ms
Speedup: 8.26x
.
Test Configuration:
Sequence length: 16
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([16])
Query: torch.Size([16, 32, 64])
Key: torch.Size([16, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.221 ms
Custom implementation: 0.030 ms
Speedup: 7.44x
.
Test Configuration:
Sequence length: 64
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([64])
Query: torch.Size([64, 32, 64])
Key: torch.Size([64, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.219 ms
Custom implementation: 0.044 ms
Speedup: 5.02x
.
Test Configuration:
Sequence length: 256
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([256])
Query: torch.Size([256, 32, 64])
Key: torch.Size([256, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.192 ms
Custom implementation: 0.071 ms
Speedup: 2.71x
.
Test Configuration:
Sequence length: 512
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([512])
Query: torch.Size([512, 32, 64])
Key: torch.Size([512, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.201 ms
Custom implementation: 0.075 ms
Speedup: 2.68x
.
Test Configuration:
Sequence length: 1024
Cache size: 4103
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([1024])
Query: torch.Size([1024, 32, 64])
Key: torch.Size([1024, 1, 64])
Cos/Sin cache: torch.Size([4103, 64])

Performance Results:
Reference implementation: 0.205 ms
Custom implementation: 0.135 ms
Speedup: 1.52x
.
Test Configuration:
Sequence length: 4091
Cache size: 4096
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([4091])
Query: torch.Size([4091, 32, 64])
Key: torch.Size([4091, 1, 64])
Cos/Sin cache: torch.Size([4096, 64])

Performance Results:
Reference implementation: 0.357 ms
Custom implementation: 0.491 ms
Speedup: 0.73x
.
Test Configuration:
Sequence length: 8192
Cache size: 4116
Query heads: 32
Key heads: 1
Head size: 64

Tensor shapes:
Positions: torch.Size([8192])
Query: torch.Size([8192, 32, 64])
Key: torch.Size([8192, 1, 64])
Cos/Sin cache: torch.Size([4116, 64])

Performance Results:
Reference implementation: 0.517 ms
Custom implementation: 0.981 ms
Speedup: 0.53x
.


### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance]: custom ascendc kernel(rotary_embedding) performance #802

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: custom ascendc kernel(rotary_embedding) performance #802

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions