Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/actions/spelling/allow.txt
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,7 @@ pytorch
quantumespresso
quasiparticles
quickstart
recv
rgw
ripgrep
rocm
Expand Down
10 changes: 10 additions & 0 deletions docs/software/communication/nccl.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,16 @@ While the container engine sets these automatically when using the NCCL hook, th

[_Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms_](https://arxiv.org/abs/2507.04786v2) contains detailed information about NCCL algorithms and protocols, which can be helpful for deciding if your application could benefit from an alternative configuration.

In addition to the above variables, setting `NCCL_NCHANNELS_PER_NET_PEER` can improve point-to-point performance (operations based directly on send/recv):

```bash
export NCCL_NCHANNELS_PER_NET_PEER=4
```

A value of 4 is generally a good compromise to improve point-to-point performance without affecting collectives performance.
Setting it to a higher value such as 16 or 32 can still further improve send/recv performance, but may degrade collectives performance, so the optimal value depends on the mix of operations used in an application.
The option is undocumented, but [this issue](https://github.com/NVIDIA/nccl/issues/1272) and the paper linked above contain additional details.

!!! warning "NCCL watchdog timeout or hanging process"
In some cases, still under investigation, NCCL may hang resulting in a stuck process or a watchdog timeout error.
In this scenario, we recommend disabling Slingshot eager messages with the following workaround:
Expand Down