Skip to content

Commit 5c44283

Browse files
authored
Added "Profile PyTorch" section in README.md (#49)
1 parent 53c2f91 commit 5c44283

File tree

2 files changed

+81
-0
lines changed

2 files changed

+81
-0
lines changed

tools/unitrace/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Intel(R) GPU applications.
1616
- cmake 3.22 or above (cmake versions prior to 3.22 are not fully tested or validated)
1717
- C++ compiler with C++17 support
1818
- Intel(R) oneAPI Base Toolkits
19+
- Python
1920
- Intel(R) MPI (optional)
2021

2122
## Build
@@ -219,6 +220,8 @@ Similarly, one can use **--chrome-dnn-logging** for oneDNN.
219220
The **--ccl-summary-report [-r]** option outputs CCL call timing summary:
220221
![CCL Call Timing!](/tools/unitrace/doc/images/ccl_summary_report.png)
221222

223+
If the application is a PyTorch workload, presence of **--chrome-mpi-logging** or **--chrome-ccl-logging** or **--chrome-dnn-logging** also enables PyTorch profiling(see **Profile PyTorch** section for more information).
224+
222225
## Location of Trace Data
223226

224227
By default, all output data are written in the current working directory. However, one can specify a different directory for output:
@@ -302,6 +305,84 @@ python mergetrace.py -o <output-trace-file> <input-trace-file-1> <input-trace-fi
302305
```
303306
![Multiple MPI Ranks Host-Device Timelines!](/tools/unitrace/doc/images/multipl-ranks-timelines.png)
304307

308+
## Profile PyTorch
309+
310+
To profile PyTorch, you need to enclose the code to be profiled with
311+
312+
```sh
313+
with torch.autograd.profiler.emit_itt():
314+
......
315+
```
316+
317+
For example:
318+
319+
```sh
320+
with torch.autograd.profiler.emit_itt(record_shapes=False):
321+
for batch_idx, (data, target) in enumerate(train_loader):
322+
optimizer.zero_grad()
323+
data = data.to("xpu")
324+
target = target.to("xpu")
325+
with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
326+
output = model(data)
327+
loss = criterion(output, target)
328+
loss.backward()
329+
optimizer.step()
330+
```
331+
332+
To profile PyTorch, option **--chrome-mpi-logging** or **--chrome-ccl-logging** or **--chrome-dnn-logging** must be present. For example:
333+
334+
```sh
335+
unitrace --chrome-kernel-logging --chrome-dnn-logging --chrome-ccl-logging python ./rn50.py
336+
```
337+
338+
![PyTorch Profiling!](/tools/unitrace/doc/images/pytorch.png)
339+
340+
You can use **PTI_ENABLE_COLLECTION** environment variable to selectively enable/disable profiling.
341+
342+
```sh
343+
with torch.autograd.profiler.emit_itt(record_shapes=False):
344+
os.environ["PTI_ENABLE_COLLECTION"] = "1"
345+
for batch_idx, (data, target) in enumerate(train_loader):
346+
optimizer.zero_grad()
347+
data = data.to("xpu")
348+
target = target.to("xpu")
349+
with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
350+
output = model(data)
351+
loss = criterion(output, target)
352+
loss.backward()
353+
optimizer.step()
354+
if (batch_idx == 2):
355+
os.environ["PTI_ENABLE_COLLECTION"] = "0"
356+
```
357+
Alternatively, you can use itt-python to do selective profiling as well. The itt-python can be installed from conda-forge
358+
359+
```sh
360+
conda install -c conda-forge --override-channels itt-python
361+
```
362+
363+
```sh
364+
import itt
365+
......
366+
367+
with torch.autograd.profiler.emit_itt(record_shapes=False):
368+
itt.resume()
369+
for batch_idx, (data, target) in enumerate(train_loader):
370+
optimizer.zero_grad()
371+
data = data.to("xpu")
372+
target = target.to("xpu")
373+
with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
374+
output = model(data)
375+
loss = criterion(output, target)
376+
loss.backward()
377+
optimizer.step()
378+
if (batch_idx == 2):
379+
itt.pause()
380+
```
381+
382+
```sh
383+
unitrace --chrome-kernel-logging --chrome-dnn-logging --conditional-collection python ./rn50.py
384+
```
385+
305386
## View Large Traces
306387
307388
By default, the memory limit of the internal representation of a trace is 2GB. To view large traces that requires more than 2GB of memory, an external trace processor is needed.

tools/unitrace/doc/images/pytorch.png

60.8 KB
Loading

0 commit comments

Comments
 (0)