You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current implementation of vLLM_Ascend V0 Engine, the advance_step function in attention.py contains a section of Python-based logic that handles the update of input_tokens, seq_lens, input_positions, and slot_mapping.
This logic was marked with a clear TODO:
# TODO optimize these codes using ascendc just like flash attention backend using cuda
indicating an explicit need for optimization using custom operators.
Proposed Change.
This RFC proposes to replace the above Python logic with a highly optimized custom operator implemented in AscendC, designed to execute directly on the NPU for improved efficiency in multi-step decoding scenarios.
The logic covered by this operator includes:
Updating model_input.input_tokens
Updating model_input.input_positions
Incrementing and updating seq_lens_tensor
Computing slot_mapping using block_tables
Feedback Period.
This RFC will be open for feedback until [2025-05-18], which is one week from the initial submission date.
Please leave your comments, questions, or suggestions before this date. The author will address all feedback and revise the proposal accordingly if needed.
Motivation.
In the current implementation of
vLLM_Ascend
V0 Engine, theadvance_step
function inattention.py
contains a section of Python-based logic that handles the update ofinput_tokens
,seq_lens
,input_positions
, andslot_mapping
.This logic was marked with a clear
TODO
:# TODO optimize these codes using ascendc just like flash attention backend using cuda
indicating an explicit need for optimization using custom operators.
Proposed Change.
This RFC proposes to replace the above Python logic with a highly optimized custom operator implemented in AscendC, designed to execute directly on the NPU for improved efficiency in multi-step decoding scenarios.
The logic covered by this operator includes:
model_input.input_tokens
model_input.input_positions
seq_lens_tensor
slot_mapping
usingblock_tables
Feedback Period.
This RFC will be open for feedback until [2025-05-18], which is one week from the initial submission date.
Please leave your comments, questions, or suggestions before this date. The author will address all feedback and revise the proposal accordingly if needed.
CC List.
@Yikun @wangxiyuan
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: