[Perf] Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems #1281

Workflow file for this run

.github/workflows/vllm_ascend_test_pd.yaml at 3be8a40

	#
	# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
	# This file is a part of the vllm-ascend project.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.
	#
	name: 'e2e test / pd-disaggregation'

	on:
	schedule:
	# Runs at 23:00 UTC (7:00 AM Beijing) every day
	- cron: '0 23 * * *'
	pull_request:
	types: [ labeled ]

	# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
	# declared as "shell: bash -el {0}" on steps that need to be properly activated.
	# It's used to activate ascend-toolkit environment variables.
	defaults:
	run:
	shell: bash -el {0}

	# only 1 job can runs on static-8-01-cards
	concurrency:
	group: static-8-01-cards
	cancel-in-progress: false

	jobs:
	prefilling-decoding-disaggregation:
	# pd-test will be triggered when tag 'pd-test' & 'ready-for-test' or schedule job
	if: ${{ contains(github.event.pull_request.labels..name, 'pd-test') && contains(github.event.pull_request.labels..name, 'ready-for-test') \|\| github.event_name == 'schedule' }}
	strategy:
	matrix:
	vllm_verison: [
	main,
	v0.9.1
	]
	name: vLLM Ascend prefilling decoding disaggregation test
	runs-on: linux-arm64-npu-static-8

	container:
	image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.2.rc1-910b-ubuntu22.04-py3.11
	volumes:
	- /usr/local/dcmi:/usr/local/dcmi
	- /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
	- /usr/local/Ascend/driver/:/usr/local/Ascend/driver/
	# Use self-host cache speed up pip and model download
	- /home/action/.cache:/github/home/.cache/
	options: >-
	--device /dev/davinci0
	--device /dev/davinci1
	--device /dev/davinci_manager
	--device /dev/devmm_svm
	--device /dev/hisi_hdc
	env:
	VLLM_USE_MODELSCOPE: True
	steps:
	- name: Check npu and CANN info
	run: \|
	npu-smi info
	cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info

	- name: Config mirrors
	run: \|
	# keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
	sed -i 's\|ports.ubuntu.com\|mirrors.tuna.tsinghua.edu.cn\|g' /etc/apt/sources.list
	pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
	apt-get update -y
	apt install git -y
	git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/

	- name: Checkout vllm-project/vllm-ascend repo
	uses: actions/checkout@v4

	- name: Install system dependencies
	run: \|
	apt-get -y install `cat packages.txt`
	apt-get -y install gcc g++ cmake libnuma-dev

	- name: Checkout vllm-project/vllm repo
	uses: actions/checkout@v4
	with:
	repository: vllm-project/vllm
	ref: ${{ matrix.vllm_verison }}
	path: ./vllm-empty

	- name: Install vllm-project/vllm from source
	working-directory: ./vllm-empty
	run: \|
	VLLM_TARGET_DEVICE=empty pip install -e .

	- name: Install vllm-project/vllm-ascend
	env:
	PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
	run: \|
	pip install -r requirements-dev.txt
	pip install -v -e .

	- name: Run vllm-project/vllm-ascend PD Disaggregation edge test
	run: \|
	git config --global --add safe.directory/__w/vllm-ascend/vllm-ascend
	bash tests/e2e/pd_disaggreate/run_edge_case_test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems #1281

Workflow file

[Perf] Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems #1281

Uh oh!

Jobs

Run details

Workflow file for this run