-
Notifications
You must be signed in to change notification settings - Fork 444
[Feat] Add a new kind of linear operation: LayerShardLinear #2931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: CalvinXKY <kyxiezju@163.com> Signed-off-by: clrs97 <524936896@qq.com>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces LayerShardLinear
, a new linear operation to reduce memory usage by sharding layers across devices and prefetching weights. The implementation is comprehensive, but there are a few critical issues related to handling small layer clusters and a high-risk monkey-patching pattern that should be addressed to ensure correctness and maintainability.
|
||
After loading the model, you must call `post_process_after_loading_for_cluster()` to complete the initialization. | ||
|
||
Each time a new layer is reached, you must call `reach_layer()` to prefetch the weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add an example of this calling relationship to the tests.
return_bias: bool = True, | ||
cluster_name: str, | ||
group: GroupCoordinator, | ||
start_layer: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default values are used for start_layer and end_layer if not set.
Co-authored-by: CalvinXKY <kyxiezju@163.com> Signed-off-by: clrs97 <524936896@qq.com>
Co-authored-by: CalvinXKY <kyxiezju@163.com> Signed-off-by: clrs97 <524936896@qq.com>
What this PR does / why we need it?
To cut down the memory usage of large weight matrices, we often rely on various linear operations:
ReplicatedLinear
: Stores the entire matrix, consuming excessive memory.RowParallelLinear
: Requires anall_reduce
to merge answer, introducing additional communication overhead and potential accuracy loss. Each token is handled across multiple devices rather than a single device, which is undesirable in SP scenario.This PR introduces a new kind of linear operation:
LayerShardLinear
, which combines the advantages of the existing approaches:process_weights_after_loading()
method to make keeping NZ format possible.Does this PR introduce any user-facing change?
No
How was this patch tested?
vLLM main: vllm-project/vllm@f4a948f