Skip to content

[Doc] Add model costomization doc #1737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/developer_guide/modeling/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ This section provides tutorials of how to implement and register a new model int
:maxdepth: 1
adding_a_new_model
adding_a_new_multimodal_model
model_customization
:::
37 changes: 37 additions & 0 deletions docs/source/developer_guide/modeling/model_customization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Customization of Models

This doc will elaborate our customization on models in vllm-ascend, including what is added or patched to the model and related reasons as well.

## DeepSeek Series

### DeepSeek-V2

In our DeepSeek-V2 model, `CustomDeepseekV2MLAAttention`, `CustomDeepseekV2MoE` and `CustomDeepseekV2MLP` are used in our `CustomDeepseekV2DecoderLayer`.

In `CustomDeepseekV2MLAAttention` layer, we can enbale multi-stream mla for `npu_prefetch()`, and we have cutomized the forward code when using torchair graph.

In `CustomDeepseekV2MoE` layer, we use contomized `AscendFusedMoE` ops to replace the experts layers.

In `CustomDeepseekV2MLP` layer, we use customized layers including `CustomDeepseekV2MergedReplicatedLinear` and `CustomDeepseekV2SiluAndMul` (will call `torch_npu.npu_dequant_swiglu_quant()`) for better performance on NPU device.

### DeepSeek-MTP

In our DeepSeek-MTP model, `CustomDeepSeekMultiTokenPredictorLayer` is added and we just replaced the original mtp layers to `CustomDeepseekV2DecoderLayer`.

### DeepSeek-DBO

In our DeepSeek-DBO model, if `VLLM_ASCEND_ENABLE_DBO` is enabled, the Multi-Stream feature can be used to optimize the foward of sparse layers. Specificly, we add a `ms_pre_layer` before MoE layers and a `ms_post_layer` after MoE layers, respectively.

Plus, `CustomDeepseekDBOMLAAttention`, `CustomDeepseekDBOMoE` and `CustomDeepseekDBOMLP` are also added to the decode layers. Find more detalis at the [code](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/models/deepseek_dbo.py).

## Qwen Series

### Qwen2/2.5-VL

In our Qwen2-VL model, we use `AscendQwen2VisionTransformer` to pad the weights of attention to get a performance improvement on NPU devices. Plus, we also replaced the ops in ViT modules with `torch_npu._npu_flash_attention_unpad()`.

As for our Qwen2.5-VL model, the costimization is very similar to that of Qwen2-VL. Find more details at the [code](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/models/qwen2_5_vl.py). Plus, an unpadded version of Qwen2.5-VL is also added to vllm-ascend to avoid errors in verl scenario, you can set `USE_OPTIMIZED_MODEL=False` to use this model.

### Qwen3-MoE

In our Qwen3-MoE model, we just added `packed_modules_mapping` to the modeling class and others keep the same with upstream.
Loading