-
Notifications
You must be signed in to change notification settings - Fork 459
[BugFix] Fix dummy_run memory explosion in eager mode #3132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical memory explosion issue in the single-operator dummy_run by explicitly deleting self.split_hidden_states
after its use to release memory. The fix is applied in both FusedMoEPrepareAndFinalizeWithMC2
and FusedMoEPrepareAndFinalizeWithAll2All
classes. The change is correct and effectively resolves the memory leak. My review includes a suggestion to refactor the duplicated finalize
method into a common base class to improve maintainability and reduce the risk of future inconsistencies, which you can consider during the planned subsequent restructuring.
# TODO: It is a quick bugfix for the single-operator memory explosion issue | ||
# that requires further restructuring. | ||
# If the cache is not cleared after `self.split_hidden_states` is created, | ||
# it can lead to the single-operator memory explosion. | ||
del self.split_hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change correctly fixes the memory leak, I noticed that the finalize
method in FusedMoEPrepareAndFinalizeWithAll2All
is identical to the one in FusedMoEPrepareAndFinalizeWithMC2
. This code duplication, including the bug fix you are applying, increases maintenance overhead and the risk of introducing inconsistencies in the future. Since you've already noted that further restructuring is required, I recommend abstracting this common finalize
logic into a shared base class as part of that effort. This will improve maintainability.
A friendly reminder is that you can use "eager mode" instead of "single-operator" (which is not a thing, kinda like google translation lol). |
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
# that requires further restructuring. | ||
# If the cache is not cleared after `self.split_hidden_states` is created, | ||
# it can lead to the memory explosion in eager mode. | ||
del self.split_hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider not assigning self.split_hidden_states
during the prepare phase. Instead, use get_tp_group().allgather()
in the finalize phase, which internally employs allgather_into_tensor
to eliminate tensor move overhead.
) ### What this PR does / why we need it? It is a quick bugfix for the memory explosion issue that requires further refactoring. The dummy_run in eager mode may lead to OOM and the reason is that `hidden_states` were not released in time. The PR temporarily resolves the issue by manually clearing the cache, and further refactoring will be conducted subsequently. Before the modification, the dummy_run's memory showed an accumulation issue. <img width="1796" height="207" alt="image" src="https://github.com/user-attachments/assets/05e2b04c-2f99-4085-9eda-c78b7d9a57b0" /> After modification, it can be observed that the memory is released promptly. And it was verified that the model responded normally after a single data input. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@b106890 --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com> Signed-off-by: huangdong2022 <huangdong51@huawei.com>
What this PR does / why we need it?
It is a quick bugfix for the memory explosion issue that requires further refactoring.
The dummy_run in eager mode may lead to OOM and the reason is that
hidden_states
were not released in time.The PR temporarily resolves the issue by manually clearing the cache, and further refactoring will be conducted subsequently.
Before the modification, the dummy_run's memory showed an accumulation issue.

After modification, it can be observed that the memory is released promptly.

And it was verified that the model responded normally after a single data input.