Skip to content

Conversation

dragondream-chen
Copy link
Contributor

@dragondream-chen dragondream-chen commented Sep 23, 2025

What this PR does / why we need it?

It is a quick bugfix for the memory explosion issue that requires further refactoring.
The dummy_run in eager mode may lead to OOM and the reason is that hidden_states were not released in time.
The PR temporarily resolves the issue by manually clearing the cache, and further refactoring will be conducted subsequently.

Before the modification, the dummy_run's memory showed an accumulation issue.
image

After modification, it can be observed that the memory is released promptly.
image

And it was verified that the model responded normally after a single data input.

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical memory explosion issue in the single-operator dummy_run by explicitly deleting self.split_hidden_states after its use to release memory. The fix is applied in both FusedMoEPrepareAndFinalizeWithMC2 and FusedMoEPrepareAndFinalizeWithAll2All classes. The change is correct and effectively resolves the memory leak. My review includes a suggestion to refactor the duplicated finalize method into a common base class to improve maintainability and reduce the risk of future inconsistencies, which you can consider during the planned subsequent restructuring.

Comment on lines 279 to 283
# TODO: It is a quick bugfix for the single-operator memory explosion issue
# that requires further restructuring.
# If the cache is not cleared after `self.split_hidden_states` is created,
# it can lead to the single-operator memory explosion.
del self.split_hidden_states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change correctly fixes the memory leak, I noticed that the finalize method in FusedMoEPrepareAndFinalizeWithAll2All is identical to the one in FusedMoEPrepareAndFinalizeWithMC2. This code duplication, including the bug fix you are applying, increases maintenance overhead and the risk of introducing inconsistencies in the future. Since you've already noted that further restructuring is required, I recommend abstracting this common finalize logic into a shared base class as part of that effort. This will improve maintainability.

@yiz-liu
Copy link
Collaborator

yiz-liu commented Sep 23, 2025

A friendly reminder is that you can use "eager mode" instead of "single-operator" (which is not a thing, kinda like google translation lol).

@dragondream-chen dragondream-chen changed the title [BugFix] Fix single-operator dummy_run memory explosion [BugFix] Fix dummy_run memory explosion in eager mode Sep 23, 2025
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
# that requires further restructuring.
# If the cache is not cleared after `self.split_hidden_states` is created,
# it can lead to the memory explosion in eager mode.
del self.split_hidden_states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider not assigning self.split_hidden_states during the prepare phase. Instead, use get_tp_group().allgather() in the finalize phase, which internally employs allgather_into_tensor to eliminate tensor move overhead.

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 24, 2025
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
@wangxiyuan wangxiyuan merged commit 07f4710 into vllm-project:main Sep 25, 2025
19 checks passed
huangdong2022 pushed a commit to huangdong2022/vllm-ascend that referenced this pull request Sep 26, 2025
)

### What this PR does / why we need it?

It is a quick bugfix for the memory explosion issue that requires
further refactoring.
The dummy_run in eager mode may lead to OOM and the reason is that
`hidden_states` were not released in time.
The PR temporarily resolves the issue by manually clearing the cache,
and further refactoring will be conducted subsequently.

Before the modification, the dummy_run's memory showed an accumulation
issue.
<img width="1796" height="207" alt="image"
src="https://github.com/user-attachments/assets/05e2b04c-2f99-4085-9eda-c78b7d9a57b0"
/>

After modification, it can be observed that the memory is released
promptly.
And it was verified that the model responded normally after a single
data input.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@b106890

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
Signed-off-by: huangdong2022 <huangdong51@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ops ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants