Skip to content

Conversation

clrs97
Copy link

@clrs97 clrs97 commented Sep 15, 2025

What this PR does / why we need it?

To cut down the memory usage of large weight matrices, we often rely on various linear operations:

  • ReplicatedLinear: Stores the entire matrix, consuming excessive memory.
  • RowParallelLinear: Requires an all_reduce to merge answer, introducing additional communication overhead and potential accuracy loss. Each token is handled across multiple devices rather than a single device, which is undesirable in SP scenario.
  • ...

This PR introduces a new kind of linear operation: LayerShardLinear, which combines the advantages of the existing approaches:

  • It evenly distributes a set of layers with identical structures across devices. Each layer retains its complete weights, eliminating redundant memory usage, and no need to merge answer.
  • It supports asynchronous broadcasting to prefetch weights for upcoming layers.
  • It preserves the custom process_weights_after_loading() method to make keeping NZ format possible.

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM main: vllm-project/vllm@f4a948f

Co-authored-by: CalvinXKY <kyxiezju@163.com>
Signed-off-by: clrs97 <524936896@qq.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces LayerShardLinear, a new linear operation to reduce memory usage by sharding layers across devices and prefetching weights. The implementation is comprehensive, but there are a few critical issues related to handling small layer clusters and a high-risk monkey-patching pattern that should be addressed to ensure correctness and maintainability.


After loading the model, you must call `post_process_after_loading_for_cluster()` to complete the initialization.

Each time a new layer is reached, you must call `reach_layer()` to prefetch the weights.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add an example of this calling relationship to the tests.

return_bias: bool = True,
cluster_name: str,
group: GroupCoordinator,
start_layer: int,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default values are used for start_layer and end_layer if not set.

clrs97 and others added 2 commits September 16, 2025 15:46
Co-authored-by: CalvinXKY <kyxiezju@163.com>
Signed-off-by: clrs97 <524936896@qq.com>
Co-authored-by: CalvinXKY <kyxiezju@163.com>
Signed-off-by: clrs97 <524936896@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants