[RFC]: Custom Ascendc Kernel Of 'Prepare Input' in Multi-Step Feature. #807

wonderful199082 · 2025-05-11T11:18:52Z

Motivation.

In the current implementation of vLLM_Ascend V0 Engine, the advance_step function in attention.py contains a section of Python-based logic that handles the update of input_tokens, seq_lens, input_positions, and slot_mapping.

This logic was marked with a clear TODO:

# TODO optimize these codes using ascendc just like flash attention backend using cuda

indicating an explicit need for optimization using custom operators.

Proposed Change.

This RFC proposes to replace the above Python logic with a highly optimized custom operator implemented in AscendC, designed to execute directly on the NPU for improved efficiency in multi-step decoding scenarios.

The logic covered by this operator includes:

Updating model_input.input_tokens
Updating model_input.input_positions
Incrementing and updating seq_lens_tensor
Computing slot_mapping using block_tables

Feedback Period.

This RFC will be open for feedback until [2025-05-18], which is one week from the initial submission date.

Please leave your comments, questions, or suggestions before this date. The author will address all feedback and revise the proposal accordingly if needed.

CC List.

@Yikun @wangxiyuan

Any Other Things.

No response

The text was updated successfully, but these errors were encountered:

wangxiyuan · 2025-05-12T06:43:47Z

Nice, welcome for the contribution

wonderful199082 added the RFC Request For Comments label May 11, 2025

wonderful199082 mentioned this issue May 13, 2025

[Performance]: Custom AscendC Kernel of Multi-Step Prepare Input #814

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Custom Ascendc Kernel Of 'Prepare Input' in Multi-Step Feature. #807

[RFC]: Custom Ascendc Kernel Of 'Prepare Input' in Multi-Step Feature. #807

wonderful199082 commented May 11, 2025

wangxiyuan commented May 12, 2025