-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[wip] [Feature] [V1] intermediate logging #21215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Initial prototype, still need to adjust structure for intermediates_logging.py (add tests) and add a script for comparison report generation cc @simon-mo @yeqcharlotte @houseroad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new intermediate tensor logging feature, which is a great addition for debugging model behavior. The implementation is comprehensive, covering configuration, argument parsing, engine integration, and the core logging logic with PyTorch hooks.
I've identified a few critical and high-severity issues that should be addressed to ensure the feature is robust and doesn't introduce unintended side effects. These include a potential crash in the hook removal logic, silent exception swallowing, and debug artifacts that could impact production environments.
Once these points are addressed, this will be a solid contribution to the project.
step_hook = model.register_forward_hook(step_fwd) | ||
self.hooks.append(("", None, step_hook)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line appends a 3-element tuple to self.hooks
for the step_hook
. However, the remove_hooks
method expects 4-element tuples when unpacking. This will cause a ValueError: not enough values to unpack
when remove_hooks
is called.
To fix this, you should ensure all tuples in self.hooks
have the same structure. Since step_hook
is a forward hook (a post-hook), it should be the 4th element in the tuple, with the pre-hook being None
.
step_hook = model.register_forward_hook(step_fwd) | |
self.hooks.append(("", None, step_hook)) | |
self.hooks.append(('', None, None, step_hook)) |
from vllm.logger import init_logger | ||
logger = init_logger(__name__) | ||
|
||
self._compiled_module_matches = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if log_call_idx >= 0 and current_call_idx != log_call_idx: | ||
should_log = False | ||
|
||
print(f"{should_log=} for {module_name}, {current_call_idx=}, {log_call_idx=}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.hooks = [] | ||
|
||
path = Path(config.output_run_dir) | ||
path.mkdir(parents=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path.mkdir(parents=True)
will raise a FileExistsError
if the directory already exists. Since this code might be run multiple times or the directory might be created by another process, you should add exist_ok=True
to prevent this error.
path.mkdir(parents=True) | |
path.mkdir(parents=True, exist_ok=True) |
ab68fd3
to
9252ae5
Compare
This pull request has merge conflicts that must be resolved before it can be |
@@ -0,0 +1,545 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyright?
more changes Apply suggestions from code review Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> fix tp Signed-off-by: Lu Fang <fanglu@fb.com> add comparison tool tmp add unit test and fix format Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
# Filter steps | ||
if steps: | ||
step_names = [f"step_{step}" for step in steps] | ||
steps_to_include = {step: True for step in step_names} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a set?
@@ -3952,6 +3953,117 @@ class KVEventsConfig: | |||
""" | |||
|
|||
|
|||
@config | |||
@dataclass | |||
class IntermediateLoggingConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible for us to provide some default logging config. so users don't have to set every configuration in this dataclass to dump tensors?
@@ -4409,6 +4521,10 @@ class VllmConfig: | |||
"""The configurations for distributed KV cache transfer.""" | |||
kv_events_config: Optional[KVEventsConfig] = None | |||
"""The configurations for event publishing.""" | |||
il_config: Optional[IntermediateLoggingConfig] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: il
for intermediate logging is very unintuitive 😅 maybe just spell out the whole thing
@@ -0,0 +1,550 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this file belongs to utils not worker tho
@@ -73,6 +73,8 @@ def __init__(self, | |||
|
|||
# Setup Model. | |||
self.model_executor = executor_class(vllm_config) | |||
self.collective_rpc("register_intermediate_hooks", args=(vllm_config.il_config,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we register intermediate hooks by invoking kwargs even when config is None?
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update