Skip to content

Conversation

@zhangruoxu
Copy link

@zhangruoxu zhangruoxu commented Nov 27, 2025

Purpose

We observed that enabling both use_gzip and dump_self_cuda_time_total in the vLLM torch profiler introduces significant overhead during profiling.

For example, when profiling 10 randomly generated requests (1000 input tokens, 200 output tokens) on an A100 using the Qwen3-32B model, we found that gzip compression of the profiling trace and dumping the CUDA time table take ~68 seconds, dominating the overall profiling time.

The main sources of overhead appear to be:

  1. Gzip compression of the profiling trace file
  2. Generation and dumping of the CUDA time summary table

After disabling these two features, the total profiling dump time is reduced to ~18 seconds.

In many profiling scenarios (e.g., quick performance checks or small-scale experiments), users may not need gzip compression or the CUDA time table. Therefore, it would be helpful to make these two behaviors individually configurable via environment variables—enabled by default for completeness, but optionally turnable off when faster profiling turnaround is preferred. Moreover, gzip compression could potentially be performed asynchronously after the trace is dumped, allowing lower-latency profiling in staging or pre-production environments.

This patch proposes adding such configurability so users can selectively disable gzip compression and/or CUDA time table generation when they want a faster and lighter profiling workflow.

Fixes #29564

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link

mergify bot commented Nov 27, 2025

Documentation preview: https://vllm--29568.org.readthedocs.build/en/29568/

@mergify mergify bot added documentation Improvements or additions to documentation nvidia v1 labels Nov 27, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new environment variables, VLLM_TORCH_PROFILER_USE_GZIP and VLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTAL, to make parts of the PyTorch profiler functionality configurable. This allows users to disable gzip compression of profiling traces and the dumping of CUDA time tables, which can help reduce profiling overhead.

The changes are implemented correctly:

  • New environment variables are added in vllm/envs.py with appropriate defaults and parsing logic that is consistent with existing variables.
  • The use_gzip parameter for torch.profiler.tensorboard_trace_handler is now controlled by VLLM_TORCH_PROFILER_USE_GZIP in vllm/profiler/gpu_profiler.py, vllm/v1/engine/async_llm.py, and vllm/v1/worker/xpu_worker.py.
  • The logic for dumping the CUDA time total table in vllm/profiler/gpu_profiler.py is now conditional on the VLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTAL flag.
  • Documentation in docs/contributing/profiling.md has been updated to reflect these new options.

The changes are well-contained and correctly implement the intended functionality. I have not found any high or critical issues. The code quality is good.

@zhangruoxu zhangruoxu force-pushed the add_profiler_options branch 7 times, most recently from e2e5b68 to 2efb082 Compare November 27, 2025 03:57
Signed-off-by: Yifei Zhang <yifei.zhang1992@outlook.com>
@zhangruoxu zhangruoxu changed the title Make PyTorch profiler gzip and CUDA time dump configurable (#29564) Make PyTorch profiler gzip and CUDA time dump configurable Nov 27, 2025
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me (I would definitely use this); thanks!

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 28, 2025
@LucasWilkinson LucasWilkinson enabled auto-merge (squash) November 28, 2025 19:11
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025
@LucasWilkinson LucasWilkinson changed the title Make PyTorch profiler gzip and CUDA time dump configurable [Misc][Profiling] Make PyTorch profiler gzip and CUDA time dump configurable Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

[Doc]: Make PyTorch profiler gzip and CUDA time dump configurable

2 participants