[Feat] Add a new kind of linear operation: LayerShardLinear #2931

clrs97 · 2025-09-15T09:05:56Z

What this PR does / why we need it?

To cut down the memory usage of large weight matrices, we often rely on various linear operations:

ReplicatedLinear: Stores the entire matrix, consuming excessive memory.
RowParallelLinear: Requires an all_reduce to merge answer, introducing additional communication overhead and potential accuracy loss. Each token is handled across multiple devices rather than a single device, which is undesirable in SP scenario.
...

This PR introduces a new kind of linear operation: LayerShardLinear, which combines the advantages of the existing approaches:

It evenly distributes a set of layers with identical structures across devices. Each layer retains its complete weights, eliminating redundant memory usage, and no need to merge answer.
It supports asynchronous broadcasting to prefetch weights for upcoming layers.
It preserves the custom process_weights_after_loading() method to make keeping NZ format possible.

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM main: vllm-project/vllm@f4a948f

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@eeb135e

Co-authored-by: CalvinXKY <kyxiezju@163.com> Signed-off-by: clrs97 <524936896@qq.com>

github-actions · 2025-09-15T09:06:04Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces LayerShardLinear, a new linear operation to reduce memory usage by sharding layers across devices and prefetching weights. The implementation is comprehensive, but there are a few critical issues related to handling small layer clusters and a high-risk monkey-patching pattern that should be addressed to ensure correctness and maintainability.

vllm_ascend/ops/layer_shard_linear.py

CalvinXKY · 2025-09-16T02:33:49Z

vllm_ascend/ops/layer_shard_linear.py

+
+    After loading the model, you must call `post_process_after_loading_for_cluster()` to complete the initialization.
+
+    Each time a new layer is reached, you must call `reach_layer()` to prefetch the weights.


Maybe we should add an example of this calling relationship to the tests.

CalvinXKY · 2025-09-16T02:55:54Z

vllm_ascend/ops/layer_shard_linear.py

+        return_bias: bool = True,
+        cluster_name: str,
+        group: GroupCoordinator,
+        start_layer: int,


Default values are used for start_layer and end_layer if not set.

vllm_ascend/ops/layer_shard_linear.py

Co-authored-by: CalvinXKY <kyxiezju@163.com> Signed-off-by: clrs97 <524936896@qq.com>

add LayerShardLinear ops

2260f92

Co-authored-by: CalvinXKY <kyxiezju@163.com> Signed-off-by: clrs97 <524936896@qq.com>

github-actions bot added the module:ops label Sep 15, 2025

gemini-code-assist bot reviewed Sep 15, 2025

View reviewed changes

vllm_ascend/ops/layer_shard_linear.py Outdated Show resolved Hide resolved

vllm_ascend/ops/layer_shard_linear.py Outdated Show resolved Hide resolved

vllm_ascend/ops/layer_shard_linear.py Outdated Show resolved Hide resolved

weijinqian0 mentioned this pull request Sep 15, 2025

【main】FlashComm2 For Qwen3 MoE #2520

Open

CalvinXKY approved these changes Sep 16, 2025

View reviewed changes

clrs97 and others added 2 commits September 16, 2025 15:46

update field name and comments

b00218f

Co-authored-by: CalvinXKY <kyxiezju@163.com> Signed-off-by: clrs97 <524936896@qq.com>

remove to torchair dir

6e037ed

Co-authored-by: CalvinXKY <kyxiezju@163.com> Signed-off-by: clrs97 <524936896@qq.com>

github-actions bot removed the module:ops label Sep 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Add a new kind of linear operation: LayerShardLinear #2931

[Feat] Add a new kind of linear operation: LayerShardLinear #2931

clrs97 commented Sep 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CalvinXKY Sep 16, 2025

Uh oh!

CalvinXKY Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		After loading the model, you must call `post_process_after_loading_for_cluster()` to complete the initialization.

		Each time a new layer is reached, you must call `reach_layer()` to prefetch the weights.

[Feat] Add a new kind of linear operation: LayerShardLinear #2931

Are you sure you want to change the base?

[Feat] Add a new kind of linear operation: LayerShardLinear #2931

Conversation

clrs97 commented Sep 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CalvinXKY Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

CalvinXKY Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clrs97 commented Sep 15, 2025 •

edited by github-actions bot

Loading