Basic integration with the py-spy profiler #356

jarlsondre · 2025-03-28T14:17:08Z

Basic Integration with the py-spy Profiler

The goal of this PR is to add a basic level of integration with the py-spy profiler. This includes:

Adding py-spy to the project dependencies
CLI functionality for aggregating some basic output
Launching py-spy in the SLURM script with support for multi-GPU training
CLI functionality for making a Flamegraph

Basic Data Aggregation

The basic data aggregation consists of finding all the final function calls in each stack and sorting them by number of samples, which is a proxy of how much time was spent in them. This is printed as a table and the idea is that you expect most of the time to be spent in the backward and forward calls from PyTorch. If you see something else at the top, then this might suggest a bottleneck.

Example of table output (with and without aggregation):

❯ itwinai generate-py-spy-report --file profiling_data.txt
 name                      | path                       |   line | itwinai_function_name      | itwinai_function_path    |   itwinai_function_line | proportion (n)
---------------------------+----------------------------+--------+----------------------------+--------------------------+-------------------------+------------------
 _engine_run_backward      | torch/autograd/graph.py    |    769 | train_step                 | itwinai/torch/trainer.py |                     692 | 31.74% (908)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | train_step                 | itwinai/torch/trainer.py |                     690 | 8.88% (254)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | train_step                 | itwinai/torch/trainer.py |                     690 | 6.26% (179)
 _max_pool2d               | torch/nn/functional.py     |    796 | train_step                 | itwinai/torch/trainer.py |                     690 | 3.95% (113)
 dropout2d                 | torch/nn/functional.py     |   1387 | train_step                 | itwinai/torch/trainer.py |                     690 | 1.71% (49)
 append_to                 | mlflow/utils/file_utils.py |    505 | log                        | itwinai/loggers.py       |                     636 | 1.40% (40)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | validation_step            | itwinai/torch/trainer.py |                     784 | 1.26% (36)
 _max_pool2d               | torch/nn/functional.py     |    796 | train_step                 | itwinai/torch/trainer.py |                     690 | 1.26% (36)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | validation_step            | itwinai/torch/trainer.py |                     784 | 1.01% (29)
 _call_with_frames_removed | built-in                   |     -1 | exec_pipeline_with_compose | itwinai/cli.py           |                     460 | 0.87% (25)
❯ itwinai generate-py-spy-report --file profiling_data.txt --aggregate-leaf-paths
 name                      | path                       |   line | itwinai_function_name      | itwinai_function_path    |   itwinai_function_line | proportion (n)
---------------------------+----------------------------+--------+----------------------------+--------------------------+-------------------------+------------------
 _engine_run_backward      | torch/autograd/graph.py    |    769 | train_step                 | itwinai/torch/trainer.py |                     692 | 31.74% (908)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | train_step                 | itwinai/torch/trainer.py |                     690 | 15.13% (433)
 _max_pool2d               | torch/nn/functional.py     |    796 | train_step                 | itwinai/torch/trainer.py |                     690 | 5.21% (149)
 _conv_forward             | torch/nn/modules/conv.py   |    454 | validation_step            | itwinai/torch/trainer.py |                     784 | 2.27% (65)
 append_to                 | mlflow/utils/file_utils.py |    505 | log                        | itwinai/loggers.py       |                     636 | 2.17% (62)
 list_all                  | mlflow/utils/file_utils.py |    146 | log                        | itwinai/loggers.py       |                     636 | 1.99% (57)
 _call_with_frames_removed | built-in                   |     -1 | exec_pipeline_with_compose | itwinai/cli.py           |                     460 | 1.82% (52)
 dropout2d                 | torch/nn/functional.py     |   1387 | train_step                 | itwinai/torch/trainer.py |                     690 | 1.71% (49)
 get_single_data           | yaml/constructor.py        |     49 | log                        | itwinai/loggers.py       |                     636 | 1.22% (35)
 isdir                     | <frozen genericpath>       |     42 | log                        | itwinai/loggers.py       |                     636 | 0.98% (28)

src/itwinai/flamegraph.pl

src/itwinai/slurm/utils.py

lineick · 2025-04-28T09:26:26Z

Printing multiple lines for identical function calls is not a good idea if one cannot relate them back to the call stack in the file.

I would suggest aggregating identical function calls for the printed overview. Otherwise if there are identifiers per call stack, one can also add those as a column to the overview (but that may be a bit confusing too).

jarlsondre · 2025-04-29T15:21:53Z

Printing multiple lines for identical function calls is not a good idea if one cannot relate them back to the call stack in the file.

I would suggest aggregating identical function calls for the printed overview. Otherwise if there are identifiers per call stack, one can also add those as a column to the overview (but that may be a bit confusing too).

Added this feature now, also updated the screenshot at the top

Copilot

Pull Request Overview

This PR adds basic integration with the py‑spy profiler into the project. Key changes include:

Adding py‑spy and tabulate to the project dependencies.
Implementing new CLI commands for generating a py‑spy report and a flamegraph.
Integrating py‑spy profiling options into the SLURM script generation flow for multi‑GPU training.

Reviewed Changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
use-cases/mnist/torch/config.yaml	Updated configuration flags for GPU data and communication overhead measurement.
src/itwinai/torch/profiling/py_spy_aggregation.py	Added functions to parse and aggregate profiling stack traces.
src/itwinai/slurm/utils.py	Introduced new arguments (profiling rate and py‑spy flag) for the SLURM job parser.
src/itwinai/slurm/slurm_script_builder.py	Extended script builder to optionally prepend the py‑spy record command to the training command.
src/itwinai/slurm/slurm_constants.py	Updated directives to include GPU resource options.
src/itwinai/cli.py	Added new CLI commands for flamegraph and py‑spy report generation, along with replacing print with typer.echo.
pyproject.toml	Added dependencies for py‑spy and tabulate.
licenses/CDDL-1.0.txt	Included license information for flamegraph.pl.
.github/workflows/lint.yml	Updated linter configuration to ignore Perl files.

Files not reviewed (1)

THIRD_PARTY_LICENSES: Language not supported

Comments suppressed due to low confidence (1)

src/itwinai/cli.py:68

Ensure that the new py‑spy report generation functionality is adequately covered by unit tests, especially to handle potential edge cases in parsing and data aggregation.

def generate_py_spy_report(

Copilot · 2025-05-02T13:38:44Z

src/itwinai/slurm/slurm_script_builder.py

+                # Prepending the py-spy profiling command
+                py_spy_profiling_output = "py_spy_profiling_$SLURM_NODEID.txt"
+                bash_command = (
+                    f"py-spy record -r {self.profiling_sampling_rate} -s "
+                    f"-o {py_spy_profiling_output} -f raw -- " + bash_command
+                )


[nitpick] Consider extracting the construction and prepending of the py‑spy profiling command into a helper function to improve readability and ease future modifications.

Suggested change

# Prepending the py-spy profiling command

py_spy_profiling_output = "py_spy_profiling_$SLURM_NODEID.txt"

bash_command = (

f"py-spy record -r {self.profiling_sampling_rate} -s "

f"-o {py_spy_profiling_output} -f raw -- " + bash_command

)

bash_command = self._prepend_py_spy_profiling(bash_command)

* add aggregation script to cli * add py-spy dependency and some param hints * format with ruff * formatting * add flamegraph and cli for flamegraph * add more advanced error handling for cli application * remove perl linting * format cli.py * use typer.exit instead of return for error * add slurm option for using py-spy profiling * add missing arguments to cli.py for slurm generator * replace print with typer.echo in cli applications * add docstrings to new CLI functionality * update error handling in cli * change profiler output to raw * add licensing and comments from PR * aggregate functions * add relevant itwinai calls * add percentage to data aggregation * add aggregation wrt itwinai info as well * clean up code and add table library * add tabulate to project dependencies

add aggregation script to cli

6b737a2

jarlsondre changed the title ~~Integration with the py-spy profiler~~ [DRAFT] Integration with the py-spy profiler Mar 31, 2025

jarlsondre changed the title ~~[DRAFT] Integration with the py-spy profiler~~ [DRAFT] Basic integration with the py-spy profiler Mar 31, 2025

jarlsondre added 13 commits April 4, 2025 16:12

add py-spy dependency and some param hints

72ff379

format with ruff

41e0f0f

formatting

00178c5

add flamegraph and cli for flamegraph

fa71093

add more advanced error handling for cli application

2a29e45

remove perl linting

b2bb06b

format cli.py

7552826

use typer.exit instead of return for error

a0fed61

add slurm option for using py-spy profiling

ccddd45

add missing arguments to cli.py for slurm generator

aa10ac9

Merge branch 'main' into py-spy-profiling

7be631f

replace print with typer.echo in cli applications

25e50de

add docstrings to new CLI functionality

75db9d5

jarlsondre self-assigned this Apr 10, 2025

jarlsondre added the Feature request New feature or request label Apr 10, 2025

jarlsondre added this to the itwinai 0.3 milestone Apr 10, 2025

update error handling in cli

73c6b22

jarlsondre requested review from matbun and lineick April 10, 2025 15:54

jarlsondre changed the title ~~[DRAFT] Basic integration with the py-spy profiler~~ Basic integration with the py-spy profiler Apr 10, 2025

jarlsondre marked this pull request as ready for review April 10, 2025 15:55

jarlsondre removed request for matbun and lineick April 11, 2025 09:15

jarlsondre marked this pull request as draft April 11, 2025 09:15

jarlsondre changed the title ~~Basic integration with the py-spy profiler~~ [DRAFT] Basic integration with the py-spy profiler Apr 11, 2025

change profiler output to raw

b6c952d

matbun reviewed Apr 14, 2025

View reviewed changes

src/itwinai/flamegraph.pl Outdated Show resolved Hide resolved

src/itwinai/slurm/utils.py Outdated Show resolved Hide resolved

add licensing and comments from PR

9aaa73f

aggregate functions

9f2a21c

jarlsondre added 5 commits April 30, 2025 17:16

add relevant itwinai calls

3958a69

add percentage to data aggregation

9d454a8

add aggregation wrt itwinai info as well

72df970

clean up code and add table library

bc43287

add tabulate to project dependencies

c72e6cb

jarlsondre marked this pull request as ready for review May 2, 2025 13:37

jarlsondre changed the title ~~[DRAFT] Basic integration with the py-spy profiler~~ Basic integration with the py-spy profiler May 2, 2025

jarlsondre requested review from matbun, lineick and Copilot May 2, 2025 13:37

Copilot AI reviewed May 2, 2025

View reviewed changes

matbun approved these changes May 2, 2025

View reviewed changes

jarlsondre merged commit e015ef4 into main May 2, 2025
9 checks passed

jarlsondre deleted the py-spy-profiling branch May 2, 2025 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Basic integration with the py-spy profiler #356

Basic integration with the py-spy profiler #356

Uh oh!

jarlsondre commented Mar 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

lineick commented Apr 28, 2025

Uh oh!

jarlsondre commented Apr 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 2, 2025

Uh oh!

Uh oh!

Uh oh!

Basic integration with the py-spy profiler #356

Basic integration with the py-spy profiler #356

Uh oh!

Conversation

jarlsondre commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Basic Integration with the py-spy Profiler

Basic Data Aggregation

Uh oh!

Uh oh!

Uh oh!

lineick commented Apr 28, 2025

Uh oh!

jarlsondre commented Apr 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jarlsondre commented Mar 28, 2025 •

edited

Loading