@@ -16,6 +16,7 @@ Intel(R) GPU applications.
16
16
- cmake 3.22 or above (cmake versions prior to 3.22 are not fully tested or validated)
17
17
- C++ compiler with C++17 support
18
18
- Intel(R) oneAPI Base Toolkits
19
+ - Python
19
20
- Intel(R) MPI (optional)
20
21
21
22
## Build
@@ -219,6 +220,8 @@ Similarly, one can use **--chrome-dnn-logging** for oneDNN.
219
220
The ** --ccl-summary-report [ -r] ** option outputs CCL call timing summary:
220
221
![ CCL Call Timing!] ( /tools/unitrace/doc/images/ccl_summary_report.png )
221
222
223
+ If the application is a PyTorch workload, presence of ** --chrome-mpi-logging** or ** --chrome-ccl-logging** or ** --chrome-dnn-logging** also enables PyTorch profiling(see ** Profile PyTorch** section for more information).
224
+
222
225
## Location of Trace Data
223
226
224
227
By default, all output data are written in the current working directory. However, one can specify a different directory for output:
@@ -302,6 +305,84 @@ python mergetrace.py -o <output-trace-file> <input-trace-file-1> <input-trace-fi
302
305
```
303
306
![ Multiple MPI Ranks Host-Device Timelines!] ( /tools/unitrace/doc/images/multipl-ranks-timelines.png )
304
307
308
+ ## Profile PyTorch
309
+
310
+ To profile PyTorch, you need to enclose the code to be profiled with
311
+
312
+ ``` sh
313
+ with torch.autograd.profiler.emit_itt ():
314
+ ......
315
+ ```
316
+
317
+ For example:
318
+
319
+ ``` sh
320
+ with torch.autograd.profiler.emit_itt(record_shapes=False):
321
+ for batch_idx, (data, target) in enumerate(train_loader):
322
+ optimizer.zero_grad ()
323
+ data = data.to(" xpu" )
324
+ target = target.to(" xpu" )
325
+ with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
326
+ output = model(data)
327
+ loss = criterion(output, target)
328
+ loss.backward ()
329
+ optimizer.step ()
330
+ ` ` `
331
+
332
+ To profile PyTorch, option ** --chrome-mpi-logging** or ** --chrome-ccl-logging** or ** --chrome-dnn-logging** must be present. For example:
333
+
334
+ ` ` ` sh
335
+ unitrace --chrome-kernel-logging --chrome-dnn-logging --chrome-ccl-logging python ./rn50.py
336
+ ` ` `
337
+
338
+ ! [PyTorch Profiling! ](/tools/unitrace/doc/images/pytorch.png)
339
+
340
+ You can use ** PTI_ENABLE_COLLECTION** environment variable to selectively enable/disable profiling.
341
+
342
+ ` ` ` sh
343
+ with torch.autograd.profiler.emit_itt(record_shapes=False):
344
+ os.environ[" PTI_ENABLE_COLLECTION" ] = " 1"
345
+ for batch_idx, (data, target) in enumerate(train_loader):
346
+ optimizer.zero_grad ()
347
+ data = data.to(" xpu" )
348
+ target = target.to(" xpu" )
349
+ with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
350
+ output = model(data)
351
+ loss = criterion(output, target)
352
+ loss.backward ()
353
+ optimizer.step ()
354
+ if (batch_idx == 2):
355
+ os.environ[" PTI_ENABLE_COLLECTION" ] = " 0"
356
+ ` ` `
357
+ Alternatively, you can use itt-python to do selective profiling as well. The itt-python can be installed from conda-forge
358
+
359
+ ` ` ` sh
360
+ conda install -c conda-forge --override-channels itt-python
361
+ ` ` `
362
+
363
+ ` ` ` sh
364
+ import itt
365
+ ......
366
+
367
+ with torch.autograd.profiler.emit_itt(record_shapes=False):
368
+ itt.resume ()
369
+ for batch_idx, (data, target) in enumerate(train_loader):
370
+ optimizer.zero_grad ()
371
+ data = data.to(" xpu" )
372
+ target = target.to(" xpu" )
373
+ with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
374
+ output = model(data)
375
+ loss = criterion(output, target)
376
+ loss.backward ()
377
+ optimizer.step ()
378
+ if (batch_idx == 2):
379
+ itt.pause ()
380
+ ` ` `
381
+
382
+ ` ` ` sh
383
+ unitrace --chrome-kernel-logging --chrome-dnn-logging --conditional-collection python ./rn50.py
384
+ ` ` `
385
+
305
386
# # View Large Traces
306
387
307
388
By default, the memory limit of the internal representation of a trace is 2GB. To view large traces that requires more than 2GB of memory, an external trace processor is needed.
0 commit comments