Skip to content

Conversation

Mercykid-bash
Copy link
Contributor

@Mercykid-bash Mercykid-bash commented Sep 19, 2025

Purpose

This Pull Request enhances the EPLB (Expert Parallelism Load Balancing) system by introducing a novel balancing algorithm: FlashLB.

Motivation

  1. The default algorithm adopts a two-stage greedy strategy:
    a. Replica allotment: Determine the number of expert replicas by minimizing the maximum load per replica (Min Max Replica, MMR).
    b. Replica placement: Distribute replicas across devices by repeatedly assigning the heaviest replica to the least loaded device (Longest Processing Time First, LPT).

    However, this sequential process lacks inter-stage collaborative optimization, often leading to suboptimal load balancing. For example, in the simple case shown in the figure below: given 8 logical experts with hotness values of 600, 560, 120, 120, 20, 10, 10, 10, and 2 replicas allocated per device across 8 devices, the EPLB algorithm yields a maximum per-device hotness of 232, while our proposed FlashLB algorithm can reduce this value to 205.

  2. The default algorithm relies on the averaged expert hotness over a fixed time window for optimization. While this provides a coarse approximation of the hotness distribution, it fails to capture oscillatory deviations and temporal correlations of expert hotness observed across iterations in real-world scenarios, limiting optimization quality.

  3. The default algorithm periodically regenerates the expert placement table. However, it generates the table for each individual layer, and the new table does not account for correlations with the previous one; these two factors collectively lead to nearly full-scale expert reassignment.

FlashLB Algorithm Principle

  1. Joint Optimization
    FlashLB achieves joint optimization of replica allotment and placement through group-based decision-making. Each group gradually determines the replica count and placement for a subset of experts, ensuring that the expected inter-device load balance (considering both deployed and pending expert replicas) is holistically optimized. To attain superior load balancing, FlashLB employs tree search to expand the solution space while integrating pruning and precompilation techniques for acceleration, thereby delivering load balancing that is both high-quality and practically efficient.

  2. Multi-Shot Enhancement
    FlashLB partitions each profiling interval (e.g., 1024 iterations) into consecutive smaller sub-intervals (e.g., 16 iterations), each capturing independent hotness measurements. It then performs multi-shot optimization to co-optimize these sub-intervals simultaneously—enabling adaptation to time-variant expert hotness while enhancing robustness.

  3. Incremental Adjustment
    To reduce the overhead of frequent expert re-deployment, FlashLB introduces an incremental adjustment scheme operating at both inter-layer and intra-layer levels:
    a. Inter-Layer: Hotness variations are tracked at the layer level. Only layers with fluctuations exceeding a predefined threshold trigger re-computation of expert placement, avoiding unnecessary redeployment for stable layers;
    b. Intra-Layer (Optional): A lightweight incremental LPT algorithm (LPT-Incremental) is applied. Instead of recomputing full placement for all experts in a layer, it selectively adjusts only the hottest experts or those with replica count changes, further reducing migration overhead.

    This incremental strategy significantly reduces adjustment costs while maintaining balanced performance across layers and devices.

Co-author:

Co-authored-by: Skywalker-EP 173723846@qq.com

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces FlashLB, a new and sophisticated load balancing algorithm. The implementation leverages Numba for performance, which is a good choice for the complex numerical computations involved. However, I've identified several critical issues in the implementation that could lead to incorrect behavior, crashes, or instability. These include the use of shared class state, improper error handling that terminates the process, multiple potential division-by-zero errors, and unhandled edge cases that can cause crashes. I have provided detailed comments and suggestions to address these critical problems. Additionally, there's a high-severity issue with a module-level warmup function that causes a significant side effect on import. Addressing these points is crucial for the stability and correctness of the new policy.

FlashLB.par_history = defaultdict(float)
FlashLB.hotness_window = {}

warm_up() No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling warm_up() at the module level is a significant side effect that occurs upon import. This can make module loading slow and unpredictable. This function, which seems to be for JIT compilation warmup, should be called explicitly by the application that initializes the load balancing system, rather than being a side effect of importing this module.

Suggested change
warm_up()
# warm_up()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warm-up call at the module level is intentionally designed to trigger JIT compilation during import rather than at inference time. This approach ensures that numba's just-in-time compilation overhead is incurred upfront during module loading, preventing unexpected latency spikes during actual inference operations.

While module imports may take slightly longer, this trade-off is necessary to guarantee consistent performance during runtime, where low and predictable latency is critical for the load balancing system's effectiveness. Explicitly deferring this warm-up to application initialization could lead to missed compilation steps and subsequent runtime delays if the initialization sequence is not strictly followed.

The current implementation prioritizes inference-time performance predictability over import speed, which aligns with the requirements of latency-sensitive load balancing operations.

algo.rebalance_experts(expert_tensor, torch.randint(1, 1000, (58, 32, 9)))


warm_up()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Only one question: warm_up should be called when import the module? is there any way to let it only runs when the policy is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach is to incorporate a check within policy_factory.py that triggers the warm_up operation exclusively when the RealTimeLB policy is selected.

This way, the warm_up process will only run if and when the RealTimeLB policy is actually instantiated (i.e., when this specific policy is set), avoiding unnecessary execution for other policy types.

Would this align with your intended implementation? Please let me know if you need further adjustments.

class PolicyFactory:

@staticmethod
def generate_policy(policy_type: int, config: DynamicConfig) -> EplbPolicy:
    policy = {
        # Constraint applying Dynamic EPLB policy V2:
        # If there exists redundant expert:
        # only one redundant expert can be placed in one NPU and its physical expert index must be 0

        # Applying greedy d2d expert weight update composing
        0:
        RandomLoadBalance,  # RandomLoadBalance: shuffle last physical expert on NPU 1 and 3
        1:
        DynamicEplb,  # Dynamic EPLB policy: overall expert replacement based on current moe load
        2:
        DynamicEplbV2,  # Dynamic EPLB policy V2:  expert replacement with constrained number of expert shuffle
        3:
        FlashLB,  # FlashLB EPLB policy: expert replacement based on Joint Optimization, Multi-Shot Enhancement and Incremental Adjustment
    }
    policy_class = policy.get(policy_type, RandomLoadBalance)
    policy_instance = policy_class(config)
    if policy_type == 3:
        policy_instance.warm_up()
    return policy_instance

sdmyzlp and others added 13 commits September 22, 2025 21:11
…t#3005)

Add missing barrier when no implicit synchonize by `repeat_interleave`
is available. Otherwise, the `non_blocking=True` copy of `output_splits`
and `input_splits` from NPU may failed to complete before later
`async_all_to_all` uses them.

### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@ef7eefe

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Mercykid-bash and others added 25 commits September 22, 2025 21:11
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@486c559

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it?
Increase doctest timeout to 300s and time print, according to time print
in vllm-project#3045 , most of time
consumed in `Graph capturing`, so I think it's fine to increase doctest
timeout

This PR also add time log for each task.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Run `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`
- CI passed

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@a684c01

Closes: vllm-project#3045

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
vllm-project#3021)

### What this PR does / why we need it?

Some custom models in vllm-ascend define packed_modules_mapping, which
prevent keeping same model class with vllm community. So move these
custom packed_modules_mapping to quant utils.py. After this pr, some
custom models can be removed.

### Does this PR introduce _any_ user-facing change?

tested by CI

### How was this patch tested?

tested by CI

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@5089fd7

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it?
According to issue [vllm-project#1298
](vllm-project#1298) ,this pull
request adds unit test code for compilation/acl_graph.py.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@f2718d2

---------

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ze (vllm-project#2830)

### What this PR does / why we need it?
Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy

### Does this PR introduce _any_ user-facing change?

Users can't set dynamic batch_size or use lm_eval test accuracy when
using models(sliding_window)

### How was this patch tested?
accuarcy of LLM-Research/Phi-4-mini-instruct is ok :
```
vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8105|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.8097|±  |0.0108|
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@3c96e7b

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…lm-project#3047)

### What this PR does / why we need it?
1. update expected accuracy for DeepSeek-V2-Lite
2. add batch size

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Accuracy CI passed

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@838d711

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it?
This PR prepares for deleting this enviroment variable,
`VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True`
to run

- Fixes vllm-project/vllm#21834

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
See CI

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@99cc41a

---------

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…llm-project#2907)

### What this PR does / why we need it?
1. This pr bump vllm commit to
vllm-project/vllm@6d8246a
2. fix upstream changes vllm-project/vllm#24548
abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable
3. fix metadata_builder changes introduced by
vllm-project/vllm#23693
4. fix `structured_outputs_config` changes introduced by
vllm-project/vllm#22772
5. fix `moe_config` changes introduced by
vllm-project/vllm#22537

Co-authored-by:  MengqingCao <cmq0113@163.com>
Co-authored-by:  Yikun Jiang <yikunkero@gmail.com>

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@c60e613

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…m-project#3067)

### What this PR does / why we need it?
Bump main to
vllm-project/vllm@c60e613

- Updated imports in `vllm.config` to
`vllm.config.model`(vllm-project/vllm@aed1687)
vllm-project/vllm#25252

- Refactored `vllm_ascend/sample/sampler.py` to use string values for
`logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs
mode handling and improving compatibility with recent vLLM changes
(vllm-project/vllm@aed1687)
vllm-project/vllm#25252

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@6d8246a

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…oject#3064)

### What this PR does / why we need it?
1. Refactor ci to reuse base workflow and enable main 2 hours trigger
job:
- Extract e2e test in to _e2e_test.yaml
- Reuse _e2e_test in light / full job
- Enable main 2 hours trigger job

2. Rename e2e test to ascend test to make sure action display label
3. Re-enable ut coverage which was failed since
vllm-project@5bcb4c1
and disable on
vllm-project@6d8bc38

### Does this PR introduce _any_ user-facing change?
Only developer behavior changes:
- Every job trigger full test with vllm release and hash
- Run full job per 2 hours with vllm main
- e2e light test (30 mins): `lint` (6mins) ---> ut (10mins) --->
`v0.10.2 + main / 4 jobs` (15mins)
- e2e full test (1.5h): `ready label` ---> `v0.10.2 + main / 4 jobs`,
about 1.5h
- schedule test: 2hours ---> `v0.10.2 + main / 4 jobs`, about 1.5h

### How was this patch tested?
CI passed

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@c60e613

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
)

### What this PR does / why we need it?
Followup on vllm-project#3064
1. should limit vllm version to the same hash with mypy
2. fix the vllm version bug for e2e light test.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@c60e613

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…and Qwen3NextAttention (vllm-project#3019)

### What this PR does / why we need it?
remove redundant Qwen3NextSparseMoeBlock and Qwen3NextAttention

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "The future of AI is",
    ]

    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        # model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-30B-A3B",
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64,
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9d1c50a

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…zed deepseek with unquantized MTP layer (vllm-project#3068)

### What this PR does / why we need it?
While running quantized deepseek models with unquantized MTP layer, free
NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This
results from the wasted VRAM buffer allocation casued by calling
`dist.all_to_all_single` without correct device process group argument.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We run vllm online serving with quantized deepseek-r1 and unquantized
MTP layer, and observed that free_memory increased without redundat VRAM
buffer for HCCL communication op (all_to_all_single).

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@6d8246a

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Bumps [actions/labeler](https://github.com/actions/labeler) from 5 to 6.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@c60e613

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it?
Update the format of the accuracy report

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@c60e613

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ce in splitfuse cases and resolve long-seq mask problems (vllm-project#2962)

### What this PR does / why we need it?
Add new npu_fused_infer_attention_score op to improve perfomance in
splitfuse cases and resolve long-seq mask problems .

1. The original op's performance is suboptimal in certain scenarios,
necessitating optimization through the _new op_
(npu_fused_infer_attention_score)。
2. For ultra-long sequences (128k), the original operator will allocate
a large attn_mask, which consumes excessive CPU memory. In contrast, the
_new op_ supports a fixed-size compressed mask, effectively resolving
this issue.

NOTE1: The current PR retains the original logic and uses a version
check of the CANN package to determine whether the _new op_ can be
enabled. This ensures no impact on existing users. In future versions,
this version check and the original logic will be deprecated, and the
_new op_ scheduling will be uniformly adopted.
NOTE2: This pr relies on future CANN version, which is not available
now.
NOTE3: To enable the new op in chunked prefill, the parameter
additional_config should be set like `--additional-config
'{"ascend_scheduler_config":
{"enabled":true,"enable_chunked_prefill":true}}' \` at least.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@6c5f82e

---------

Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it?
Follow up `UniformTypeKVCacheSpecs` changes introduced by
vllm-project/vllm#25101, which support different
hidden size in uniform type kvcache specs

This also fix the CI issue about `TypeError: AttentionGroup.__init__()
missing 1 required positional argument: 'kv_cache_spec'`

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Tests passed with exsiting e2e tests.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@c60e613

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…m-project#2128)

Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…m-project#3094)

### What this PR does / why we need it?
This PR removed the redundant log prints in register_custom_ops.py, in
order to make output clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ject#3001)

### What this PR does / why we need it?
Fix issues mentioned in
vllm-project#2791 and some minor
refactoring.
1. Use Enum instead of string.
2. Avoid setting a new property to forward_context in
AscendFusedMoE.forward().
3. Enabling TokenDispatcherWithMoge.
4. Remove redundant code.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:
1. Enable/Disable EP
2. Aclgraph & eager

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ect#3087)

### What this PR does / why we need it?
A new kv_role "kv_both" is added to run mixed deployment scenarios. The
mixed deployment will involve a decode phase, where with_prefill should
be false.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@c60e613

Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 23, 2025
@wangxiyuan wangxiyuan merged commit 29c173a into vllm-project:main Sep 23, 2025
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.