From 46dc0c72f65261e81522b5ef9f7555dec5b3cef8 Mon Sep 17 00:00:00 2001 From: Alex Omar Date: Thu, 12 Jun 2025 07:25:39 -0700 Subject: [PATCH 1/4] Add information about pytorch multi gpu setup --- docs/source/features/multi_gpu.rst | 56 ++++++++++++++++++++++++------ 1 file changed, 45 insertions(+), 11 deletions(-) diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 81f6fd40a88..ce417698bae 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -16,17 +16,48 @@ other workflows. Multi-GPU Training ------------------ -For complex reinforcement learning environments, it may be desirable to scale up training across multiple GPUs. -This is possible in Isaac Lab through the use of the -`PyTorch distributed `_ framework or the -`JAX distributed `_ module respectively. - -In PyTorch, the :meth:`torch.distributed` API is used to launch multiple processes of training, where the number of -processes must be equal to or less than the number of GPUs available. Each process runs on -a dedicated GPU and launches its own instance of Isaac Sim and the Isaac Lab environment. -Each process collects its own rollouts during the training process and has its own copy of the policy -network. During training, gradients are aggregated across the processes and broadcasted back to the process -at the end of the epoch. +Isaac Lab supports the following multi-GPU training frameworks: + +* `Torchrun `_ through `PyTorch distributed `_ +* `JAX distributed `_ + +Pytorch Torchrun Implementation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +We are using `Pytorch Torchrun `_ to manage multi-GPU +training. Torchrun manages the distributed training through: + +* **Process Management**: Launching one process per GPU, where each process is assigned to a specific GPU. +* **Script Execution**: Running the same training script (e.g., RL Games trainer) on each process. +* **Environment Instances**: Each process creates its own instance of the Isaac Lab environment. +* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized gradients back to each process after each training step. + +.. tip:: + Check out this `3 minute youtube video from PyTorch `_ + to understand how Torchrun works. + +The key components in this setup are: + +* **Torchrun**: Handles process spawning, communication, and gradient synchronization. +* **RL Games**: The reinforcement learning framework that runs the actual training algorithm. +* **Isaac Lab**: Provides the simulation environment that each process instantiates independently. + +Under the hood, Torchrun uses the `DistributedDataParallel `_ +module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens: + +* Each GPU runs an independent process. +* Each process executes the full training script. +* Each process maintains its own: + * Isaac Lab environment instance (with *n* parallel environments). + * Policy network copy. + * Experience buffer for rollout collection. +* All processes synchronize only for gradient updates. + +For a deeper dive into how Torchrun works, checkout +`PyTorch Docs: DistributedDataParallel - Internal Design `_. + +Jax Implementation +^^^^^^^^^^^^^^^^^^ In JAX, since the ML framework doesn't automatically start multiple processes from a single program invocation, the skrl library provides a module to start them. @@ -45,6 +76,9 @@ the skrl library provides a module to start them. | +Running Multi-GPU Training +^^^^^^^^^^^^^^^^^^^^^^^^^^ + To train with multiple GPUs, use the following command, where ``--nproc_per_node`` represents the number of available GPUs: .. tab-set:: From 0b1cdfc2978553b024d97ff4e843f57829afcffd Mon Sep 17 00:00:00 2001 From: Alex Omar Date: Wed, 18 Jun 2025 12:45:54 -0700 Subject: [PATCH 2/4] Update SKRL docs for JAX --- CONTRIBUTORS.md | 1 + docs/source/features/multi_gpu.rst | 24 ++++++++++++++---------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index 90b9befa261..aac263f8747 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -36,6 +36,7 @@ Guidelines for modifications: ## Contributors * Alessandro Assirelli +* Alex Omar * Alice Zhou * Amr Mousa * Andrej Orsula diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index ce417698bae..67b60924bc1 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -17,7 +17,6 @@ Multi-GPU Training ------------------ Isaac Lab supports the following multi-GPU training frameworks: - * `Torchrun `_ through `PyTorch distributed `_ * `JAX distributed `_ @@ -25,12 +24,13 @@ Pytorch Torchrun Implementation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We are using `Pytorch Torchrun `_ to manage multi-GPU -training. Torchrun manages the distributed training through: +training. Torchrun manages the distributed training by: * **Process Management**: Launching one process per GPU, where each process is assigned to a specific GPU. * **Script Execution**: Running the same training script (e.g., RL Games trainer) on each process. * **Environment Instances**: Each process creates its own instance of the Isaac Lab environment. -* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized gradients back to each process after each training step. +* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized +gradients back to each process after each training step. .. tip:: Check out this `3 minute youtube video from PyTorch `_ @@ -45,13 +45,13 @@ The key components in this setup are: Under the hood, Torchrun uses the `DistributedDataParallel `_ module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens: -* Each GPU runs an independent process. -* Each process executes the full training script. +* Each GPU runs an independent process +* Each process executes the full training script * Each process maintains its own: - * Isaac Lab environment instance (with *n* parallel environments). - * Policy network copy. - * Experience buffer for rollout collection. -* All processes synchronize only for gradient updates. + * Isaac Lab environment instance (with *n* parallel environments) + * Policy network copy + * Experience buffer for rollout collection +* All processes synchronize only for gradient updates For a deeper dive into how Torchrun works, checkout `PyTorch Docs: DistributedDataParallel - Internal Design `_. @@ -59,7 +59,11 @@ For a deeper dive into how Torchrun works, checkout Jax Implementation ^^^^^^^^^^^^^^^^^^ -In JAX, since the ML framework doesn't automatically start multiple processes from a single program invocation, +.. tip:: + JAX is only supported with the skrl library. + +With JAX, we are using `skrl.utils.distributed.jax `_ +Since the ML framework doesn't automatically start multiple processes from a single program invocation, the skrl library provides a module to start them. .. image:: ../_static/multi-gpu-rl/a3c-light.svg From 4c3a2d1adb8a408561947746872069959441631b Mon Sep 17 00:00:00 2001 From: Alex Omar Date: Mon, 30 Jun 2025 14:17:27 -0700 Subject: [PATCH 3/4] Update RL multi-GPU documentation. --- docs/source/features/multi_gpu.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 67b60924bc1..cdf5d9b2f87 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -39,7 +39,7 @@ gradients back to each process after each training step. The key components in this setup are: * **Torchrun**: Handles process spawning, communication, and gradient synchronization. -* **RL Games**: The reinforcement learning framework that runs the actual training algorithm. +* **RL Framework**: The reinforcement learning framework that runs the actual training algorithm. * **Isaac Lab**: Provides the simulation environment that each process instantiates independently. Under the hood, Torchrun uses the `DistributedDataParallel `_ From 8304f7f769580b442f42a7b0f5f1cdfb2e970a51 Mon Sep 17 00:00:00 2001 From: Alex Omar Date: Mon, 30 Jun 2025 14:19:18 -0700 Subject: [PATCH 4/4] Update RL multi-GPU documentation. --- docs/source/features/multi_gpu.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index cdf5d9b2f87..220cfdafe9d 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -39,7 +39,7 @@ gradients back to each process after each training step. The key components in this setup are: * **Torchrun**: Handles process spawning, communication, and gradient synchronization. -* **RL Framework**: The reinforcement learning framework that runs the actual training algorithm. +* **RL Library**: The reinforcement learning library that runs the actual training algorithm. * **Isaac Lab**: Provides the simulation environment that each process instantiates independently. Under the hood, Torchrun uses the `DistributedDataParallel `_