Skip to content

Significant inference speed difference between vllm-ascend and MindIE #1621

@Fly-Pluche

Description

@Fly-Pluche

Your current environment

Environment: Two machines, each with 8 * Ascend910B-64GB GPUs.

MindIE image: mindie_2.0.T18.B010-800I-A2-py3.11-openeuler24.03-lts-aarch64

Vllm-ascend image: 0.8.5rc1-torch_npu2.5.1-cann8.1.rc1-python3.10-oe2203lts

Model weights: DeepSeek-V3-0324-w8a8-modelers (https://modelers.cn/models/Modelers_Park/DeepSeek-V3-0324-w8a8/tree/main)

Describe the problem

When testing inference speeds using different inference frameworks.

Inference speeds observed:
vllm-ascend: Approximately 1.4 Tokens/s
MindIE: Approximately 14 Tokens/s

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions