-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Misc][Profiling] Make PyTorch profiler gzip and CUDA time dump configurable #29568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Documentation preview: https://vllm--29568.org.readthedocs.build/en/29568/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two new environment variables, VLLM_TORCH_PROFILER_USE_GZIP and VLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTAL, to make parts of the PyTorch profiler functionality configurable. This allows users to disable gzip compression of profiling traces and the dumping of CUDA time tables, which can help reduce profiling overhead.
The changes are implemented correctly:
- New environment variables are added in
vllm/envs.pywith appropriate defaults and parsing logic that is consistent with existing variables. - The
use_gzipparameter fortorch.profiler.tensorboard_trace_handleris now controlled byVLLM_TORCH_PROFILER_USE_GZIPinvllm/profiler/gpu_profiler.py,vllm/v1/engine/async_llm.py, andvllm/v1/worker/xpu_worker.py. - The logic for dumping the CUDA time total table in
vllm/profiler/gpu_profiler.pyis now conditional on theVLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTALflag. - Documentation in
docs/contributing/profiling.mdhas been updated to reflect these new options.
The changes are well-contained and correctly implement the intended functionality. I have not found any high or critical issues. The code quality is good.
e2e5b68 to
2efb082
Compare
Signed-off-by: Yifei Zhang <yifei.zhang1992@outlook.com>
2efb082 to
5eca561
Compare
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me (I would definitely use this); thanks!
Purpose
We observed that enabling both use_gzip and dump_self_cuda_time_total in the vLLM torch profiler introduces significant overhead during profiling.
For example, when profiling 10 randomly generated requests (1000 input tokens, 200 output tokens) on an A100 using the Qwen3-32B model, we found that gzip compression of the profiling trace and dumping the CUDA time table take ~68 seconds, dominating the overall profiling time.
The main sources of overhead appear to be:
After disabling these two features, the total profiling dump time is reduced to ~18 seconds.
In many profiling scenarios (e.g., quick performance checks or small-scale experiments), users may not need gzip compression or the CUDA time table. Therefore, it would be helpful to make these two behaviors individually configurable via environment variables—enabled by default for completeness, but optionally turnable off when faster profiling turnaround is preferred. Moreover, gzip compression could potentially be performed asynchronously after the trace is dumped, allowing lower-latency profiling in staging or pre-production environments.
This patch proposes adding such configurability so users can selectively disable gzip compression and/or CUDA time table generation when they want a faster and lighter profiling workflow.
Fixes #29564
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.