Skip to content

Conversation

jarlsondre
Copy link
Collaborator

@jarlsondre jarlsondre commented Mar 28, 2025

Basic Integration with the py-spy Profiler

The goal of this PR is to add a basic level of integration with the py-spy profiler. This includes:

  • Adding py-spy to the project dependencies
  • CLI functionality for aggregating some basic output
  • Launching py-spy in the SLURM script with support for multi-GPU training
  • CLI functionality for making a Flamegraph

Basic Data Aggregation

The basic data aggregation consists of finding all the final function calls in each stack and sorting them by number of samples, which is a proxy of how much time was spent in them. This is printed as a table and the idea is that you expect most of the time to be spent in the backward and forward calls from PyTorch. If you see something else at the top, then this might suggest a bottleneck.

Example of table output (with and without aggregation):

❯ itwinai generate-py-spy-report --file profiling_data.txt
 name                      | path                       |   line | itwinai_function_name      | itwinai_function_path    |   itwinai_function_line | proportion (n)
---------------------------+----------------------------+--------+----------------------------+--------------------------+-------------------------+------------------
 _engine_run_backward      | torch/autograd/graph.py    |    769 | train_step                 | itwinai/torch/trainer.py |                     692 | 31.74% (908)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | train_step                 | itwinai/torch/trainer.py |                     690 | 8.88% (254)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | train_step                 | itwinai/torch/trainer.py |                     690 | 6.26% (179)
 _max_pool2d               | torch/nn/functional.py     |    796 | train_step                 | itwinai/torch/trainer.py |                     690 | 3.95% (113)
 dropout2d                 | torch/nn/functional.py     |   1387 | train_step                 | itwinai/torch/trainer.py |                     690 | 1.71% (49)
 append_to                 | mlflow/utils/file_utils.py |    505 | log                        | itwinai/loggers.py       |                     636 | 1.40% (40)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | validation_step            | itwinai/torch/trainer.py |                     784 | 1.26% (36)
 _max_pool2d               | torch/nn/functional.py     |    796 | train_step                 | itwinai/torch/trainer.py |                     690 | 1.26% (36)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | validation_step            | itwinai/torch/trainer.py |                     784 | 1.01% (29)
 _call_with_frames_removed | built-in                   |     -1 | exec_pipeline_with_compose | itwinai/cli.py           |                     460 | 0.87% (25)
❯ itwinai generate-py-spy-report --file profiling_data.txt --aggregate-leaf-paths
 name                      | path                       |   line | itwinai_function_name      | itwinai_function_path    |   itwinai_function_line | proportion (n)
---------------------------+----------------------------+--------+----------------------------+--------------------------+-------------------------+------------------
 _engine_run_backward      | torch/autograd/graph.py    |    769 | train_step                 | itwinai/torch/trainer.py |                     692 | 31.74% (908)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | train_step                 | itwinai/torch/trainer.py |                     690 | 15.13% (433)
 _max_pool2d               | torch/nn/functional.py     |    796 | train_step                 | itwinai/torch/trainer.py |                     690 | 5.21% (149)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | validation_step            | itwinai/torch/trainer.py |                     784 | 2.27% (65)
 append_to                 | mlflow/utils/file_utils.py |    505 | log                        | itwinai/loggers.py       |                     636 | 2.17% (62)
 list_all                  | mlflow/utils/file_utils.py |    146 | log                        | itwinai/loggers.py       |                     636 | 1.99% (57)
 _call_with_frames_removed | built-in                   |     -1 | exec_pipeline_with_compose | itwinai/cli.py           |                     460 | 1.82% (52)
 dropout2d                 | torch/nn/functional.py     |   1387 | train_step                 | itwinai/torch/trainer.py |                     690 | 1.71% (49)
 get_single_data           | yaml/constructor.py        |     49 | log                        | itwinai/loggers.py       |                     636 | 1.22% (35)
 isdir                     | <frozen genericpath>       |     42 | log                        | itwinai/loggers.py       |                     636 | 0.98% (28)

@jarlsondre jarlsondre changed the title Integration with the py-spy profiler [DRAFT] Integration with the py-spy profiler Mar 31, 2025
@jarlsondre jarlsondre changed the title [DRAFT] Integration with the py-spy profiler [DRAFT] Basic integration with the py-spy profiler Mar 31, 2025
@jarlsondre jarlsondre self-assigned this Apr 10, 2025
@jarlsondre jarlsondre added the Feature request New feature or request label Apr 10, 2025
@jarlsondre jarlsondre added this to the itwinai 0.3 milestone Apr 10, 2025
@jarlsondre jarlsondre requested review from matbun and lineick April 10, 2025 15:54
@jarlsondre jarlsondre changed the title [DRAFT] Basic integration with the py-spy profiler Basic integration with the py-spy profiler Apr 10, 2025
@jarlsondre jarlsondre marked this pull request as ready for review April 10, 2025 15:55
@jarlsondre jarlsondre removed request for matbun and lineick April 11, 2025 09:15
@jarlsondre jarlsondre marked this pull request as draft April 11, 2025 09:15
@jarlsondre jarlsondre changed the title Basic integration with the py-spy profiler [DRAFT] Basic integration with the py-spy profiler Apr 11, 2025
@lineick
Copy link
Collaborator

lineick commented Apr 28, 2025

Printing multiple lines for identical function calls is not a good idea if one cannot relate them back to the call stack in the file.

I would suggest aggregating identical function calls for the printed overview. Otherwise if there are identifiers per call stack, one can also add those as a column to the overview (but that may be a bit confusing too).

@jarlsondre
Copy link
Collaborator Author

Printing multiple lines for identical function calls is not a good idea if one cannot relate them back to the call stack in the file.

I would suggest aggregating identical function calls for the printed overview. Otherwise if there are identifiers per call stack, one can also add those as a column to the overview (but that may be a bit confusing too).

Added this feature now, also updated the screenshot at the top

@jarlsondre jarlsondre marked this pull request as ready for review May 2, 2025 13:37
@jarlsondre jarlsondre changed the title [DRAFT] Basic integration with the py-spy profiler Basic integration with the py-spy profiler May 2, 2025
@jarlsondre jarlsondre requested review from matbun, lineick and Copilot May 2, 2025 13:37
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds basic integration with the py‑spy profiler into the project. Key changes include:

  • Adding py‑spy and tabulate to the project dependencies.
  • Implementing new CLI commands for generating a py‑spy report and a flamegraph.
  • Integrating py‑spy profiling options into the SLURM script generation flow for multi‑GPU training.

Reviewed Changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
use-cases/mnist/torch/config.yaml Updated configuration flags for GPU data and communication overhead measurement.
src/itwinai/torch/profiling/py_spy_aggregation.py Added functions to parse and aggregate profiling stack traces.
src/itwinai/slurm/utils.py Introduced new arguments (profiling rate and py‑spy flag) for the SLURM job parser.
src/itwinai/slurm/slurm_script_builder.py Extended script builder to optionally prepend the py‑spy record command to the training command.
src/itwinai/slurm/slurm_constants.py Updated directives to include GPU resource options.
src/itwinai/cli.py Added new CLI commands for flamegraph and py‑spy report generation, along with replacing print with typer.echo.
pyproject.toml Added dependencies for py‑spy and tabulate.
licenses/CDDL-1.0.txt Included license information for flamegraph.pl.
.github/workflows/lint.yml Updated linter configuration to ignore Perl files.
Files not reviewed (1)
  • THIRD_PARTY_LICENSES: Language not supported
Comments suppressed due to low confidence (1)

src/itwinai/cli.py:68

  • Ensure that the new py‑spy report generation functionality is adequately covered by unit tests, especially to handle potential edge cases in parsing and data aggregation.
def generate_py_spy_report(

Comment on lines +168 to +173
# Prepending the py-spy profiling command
py_spy_profiling_output = "py_spy_profiling_$SLURM_NODEID.txt"
bash_command = (
f"py-spy record -r {self.profiling_sampling_rate} -s "
f"-o {py_spy_profiling_output} -f raw -- " + bash_command
)
Copy link
Preview

Copilot AI May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider extracting the construction and prepending of the py‑spy profiling command into a helper function to improve readability and ease future modifications.

Suggested change
# Prepending the py-spy profiling command
py_spy_profiling_output = "py_spy_profiling_$SLURM_NODEID.txt"
bash_command = (
f"py-spy record -r {self.profiling_sampling_rate} -s "
f"-o {py_spy_profiling_output} -f raw -- " + bash_command
)
bash_command = self._prepend_py_spy_profiling(bash_command)

Copilot uses AI. Check for mistakes.

@jarlsondre jarlsondre merged commit e015ef4 into main May 2, 2025
9 checks passed
@jarlsondre jarlsondre deleted the py-spy-profiling branch May 2, 2025 14:07
lineick pushed a commit that referenced this pull request May 14, 2025
* add aggregation script to cli

* add py-spy dependency and some param hints

* format with ruff

* formatting

* add flamegraph and cli for flamegraph

* add more advanced error handling for cli application

* remove perl linting

* format cli.py

* use typer.exit instead of return for error

* add slurm option for using py-spy profiling

* add missing arguments to cli.py for slurm generator

* replace print with typer.echo in cli applications

* add docstrings to new CLI functionality

* update error handling in cli

* change profiler output to raw

* add licensing and comments from PR

* aggregate functions

* add relevant itwinai calls

* add percentage to data aggregation

* add aggregation wrt itwinai info as well

* clean up code and add table library

* add tabulate to project dependencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants