From 46dc0c72f65261e81522b5ef9f7555dec5b3cef8 Mon Sep 17 00:00:00 2001
From: Alex Omar <aomar@nvidia.com>
Date: Thu, 12 Jun 2025 07:25:39 -0700
Subject: [PATCH 1/4] Add information about pytorch multi gpu setup

---
 docs/source/features/multi_gpu.rst | 56 ++++++++++++++++++++++++------
 1 file changed, 45 insertions(+), 11 deletions(-)

diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index 81f6fd40a88..ce417698bae 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -16,17 +16,48 @@ other workflows.
 Multi-GPU Training
 ------------------
 
-For complex reinforcement learning environments, it may be desirable to scale up training across multiple GPUs.
-This is possible in Isaac Lab through the use of the
-`PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_ framework or the
-`JAX distributed <https://jax.readthedocs.io/en/latest/jax.distributed.html>`_ module respectively.
-
-In PyTorch, the :meth:`torch.distributed` API is used to launch multiple processes of training, where the number of
-processes must be equal to or less than the number of GPUs available. Each process runs on
-a dedicated GPU and launches its own instance of Isaac Sim and the Isaac Lab environment.
-Each process collects its own rollouts during the training process and has its own copy of the policy
-network. During training, gradients are aggregated across the processes and broadcasted back to the process
-at the end of the epoch.
+Isaac Lab supports the following multi-GPU training frameworks:
+
+* `Torchrun <https://docs.pytorch.org/docs/stable/elastic/run.html>`_ through `PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_
+* `JAX distributed <https://jax.readthedocs.io/en/latest/jax.distributed.html>`_
+
+Pytorch Torchrun Implementation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We are using `Pytorch Torchrun <https://docs.pytorch.org/docs/stable/elastic/run.html>`_ to manage multi-GPU
+training. Torchrun manages the distributed training through:
+
+* **Process Management**: Launching one process per GPU, where each process is assigned to a specific GPU.
+* **Script Execution**: Running the same training script (e.g., RL Games trainer) on each process.
+* **Environment Instances**: Each process creates its own instance of the Isaac Lab environment.
+* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized gradients back to each process after each training step.
+
+.. tip::
+    Check out this `3 minute youtube video from PyTorch <https://www.youtube.com/watch?v=Cvdhwx-OBBo&list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj&index=2>`_
+    to understand how Torchrun works.
+
+The key components in this setup are:
+
+* **Torchrun**: Handles process spawning, communication, and gradient synchronization.
+* **RL Games**: The reinforcement learning framework that runs the actual training algorithm.
+* **Isaac Lab**: Provides the simulation environment that each process instantiates independently.
+
+Under the hood, Torchrun uses the `DistributedDataParallel <https://docs.pytorch.org/docs/2.7/notes/ddp.html#internal-design>`_
+module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens:
+
+* Each GPU runs an independent process.
+* Each process executes the full training script.
+* Each process maintains its own:
+  * Isaac Lab environment instance (with *n* parallel environments).
+  * Policy network copy.
+  * Experience buffer for rollout collection.
+* All processes synchronize only for gradient updates.
+
+For a deeper dive into how Torchrun works, checkout
+`PyTorch Docs: DistributedDataParallel - Internal Design <https://pytorch.org/docs/stable/notes/ddp.html#internal-design>`_.
+
+Jax Implementation
+^^^^^^^^^^^^^^^^^^
 
 In JAX, since the ML framework doesn't automatically start multiple processes from a single program invocation,
 the skrl library provides a module to start them.
@@ -45,6 +76,9 @@ the skrl library provides a module to start them.
 
 |
 
+Running Multi-GPU Training
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
 To train with multiple GPUs, use the following command, where ``--nproc_per_node`` represents the number of available GPUs:
 
 .. tab-set::

From 0b1cdfc2978553b024d97ff4e843f57829afcffd Mon Sep 17 00:00:00 2001
From: Alex Omar <aomar@nvidia.com>
Date: Wed, 18 Jun 2025 12:45:54 -0700
Subject: [PATCH 2/4] Update SKRL docs for JAX

---
 CONTRIBUTORS.md                    |  1 +
 docs/source/features/multi_gpu.rst | 24 ++++++++++++++----------
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
index 90b9befa261..aac263f8747 100644
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@@ -36,6 +36,7 @@ Guidelines for modifications:
 ## Contributors
 
 * Alessandro Assirelli
+* Alex Omar
 * Alice Zhou
 * Amr Mousa
 * Andrej Orsula
diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index ce417698bae..67b60924bc1 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -17,7 +17,6 @@ Multi-GPU Training
 ------------------
 
 Isaac Lab supports the following multi-GPU training frameworks:
-
 * `Torchrun <https://docs.pytorch.org/docs/stable/elastic/run.html>`_ through `PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_
 * `JAX distributed <https://jax.readthedocs.io/en/latest/jax.distributed.html>`_
 
@@ -25,12 +24,13 @@ Pytorch Torchrun Implementation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 We are using `Pytorch Torchrun <https://docs.pytorch.org/docs/stable/elastic/run.html>`_ to manage multi-GPU
-training. Torchrun manages the distributed training through:
+training. Torchrun manages the distributed training by:
 
 * **Process Management**: Launching one process per GPU, where each process is assigned to a specific GPU.
 * **Script Execution**: Running the same training script (e.g., RL Games trainer) on each process.
 * **Environment Instances**: Each process creates its own instance of the Isaac Lab environment.
-* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized gradients back to each process after each training step.
+* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized
+gradients back to each process after each training step.
 
 .. tip::
     Check out this `3 minute youtube video from PyTorch <https://www.youtube.com/watch?v=Cvdhwx-OBBo&list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj&index=2>`_
@@ -45,13 +45,13 @@ The key components in this setup are:
 Under the hood, Torchrun uses the `DistributedDataParallel <https://docs.pytorch.org/docs/2.7/notes/ddp.html#internal-design>`_
 module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens:
 
-* Each GPU runs an independent process.
-* Each process executes the full training script.
+* Each GPU runs an independent process
+* Each process executes the full training script
 * Each process maintains its own:
-  * Isaac Lab environment instance (with *n* parallel environments).
-  * Policy network copy.
-  * Experience buffer for rollout collection.
-* All processes synchronize only for gradient updates.
+  * Isaac Lab environment instance (with *n* parallel environments)
+  * Policy network copy
+  * Experience buffer for rollout collection
+* All processes synchronize only for gradient updates
 
 For a deeper dive into how Torchrun works, checkout
 `PyTorch Docs: DistributedDataParallel - Internal Design <https://pytorch.org/docs/stable/notes/ddp.html#internal-design>`_.
@@ -59,7 +59,11 @@ For a deeper dive into how Torchrun works, checkout
 Jax Implementation
 ^^^^^^^^^^^^^^^^^^
 
-In JAX, since the ML framework doesn't automatically start multiple processes from a single program invocation,
+.. tip::
+    JAX is only supported with the skrl library.
+
+With JAX, we are using `skrl.utils.distributed.jax <https://skrl.readthedocs.io/en/latest/api/utils/distributed.html>`_
+Since the ML framework doesn't automatically start multiple processes from a single program invocation,
 the skrl library provides a module to start them.
 
 .. image:: ../_static/multi-gpu-rl/a3c-light.svg

From 4c3a2d1adb8a408561947746872069959441631b Mon Sep 17 00:00:00 2001
From: Alex Omar <aomar@nvidia.com>
Date: Mon, 30 Jun 2025 14:17:27 -0700
Subject: [PATCH 3/4] Update RL multi-GPU documentation.

---
 docs/source/features/multi_gpu.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index 67b60924bc1..cdf5d9b2f87 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -39,7 +39,7 @@ gradients back to each process after each training step.
 The key components in this setup are:
 
 * **Torchrun**: Handles process spawning, communication, and gradient synchronization.
-* **RL Games**: The reinforcement learning framework that runs the actual training algorithm.
+* **RL Framework**: The reinforcement learning framework that runs the actual training algorithm.
 * **Isaac Lab**: Provides the simulation environment that each process instantiates independently.
 
 Under the hood, Torchrun uses the `DistributedDataParallel <https://docs.pytorch.org/docs/2.7/notes/ddp.html#internal-design>`_

From 8304f7f769580b442f42a7b0f5f1cdfb2e970a51 Mon Sep 17 00:00:00 2001
From: Alex Omar <aomar@nvidia.com>
Date: Mon, 30 Jun 2025 14:19:18 -0700
Subject: [PATCH 4/4] Update RL multi-GPU documentation.

---
 docs/source/features/multi_gpu.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst
index cdf5d9b2f87..220cfdafe9d 100644
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -39,7 +39,7 @@ gradients back to each process after each training step.
 The key components in this setup are:
 
 * **Torchrun**: Handles process spawning, communication, and gradient synchronization.
-* **RL Framework**: The reinforcement learning framework that runs the actual training algorithm.
+* **RL Library**: The reinforcement learning library that runs the actual training algorithm.
 * **Isaac Lab**: Provides the simulation environment that each process instantiates independently.
 
 Under the hood, Torchrun uses the `DistributedDataParallel <https://docs.pytorch.org/docs/2.7/notes/ddp.html#internal-design>`_