diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index fe4b06dd263..efe9450ae5a 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -36,6 +36,7 @@ Guidelines for modifications: ## Contributors * Alessandro Assirelli +* Alex Omar * Alice Zhou * Amr Mousa * Andrej Orsula diff --git a/docs/source/features/multi_gpu.rst b/docs/source/features/multi_gpu.rst index 81f6fd40a88..220cfdafe9d 100644 --- a/docs/source/features/multi_gpu.rst +++ b/docs/source/features/multi_gpu.rst @@ -16,19 +16,54 @@ other workflows. Multi-GPU Training ------------------ -For complex reinforcement learning environments, it may be desirable to scale up training across multiple GPUs. -This is possible in Isaac Lab through the use of the -`PyTorch distributed `_ framework or the -`JAX distributed `_ module respectively. - -In PyTorch, the :meth:`torch.distributed` API is used to launch multiple processes of training, where the number of -processes must be equal to or less than the number of GPUs available. Each process runs on -a dedicated GPU and launches its own instance of Isaac Sim and the Isaac Lab environment. -Each process collects its own rollouts during the training process and has its own copy of the policy -network. During training, gradients are aggregated across the processes and broadcasted back to the process -at the end of the epoch. - -In JAX, since the ML framework doesn't automatically start multiple processes from a single program invocation, +Isaac Lab supports the following multi-GPU training frameworks: +* `Torchrun `_ through `PyTorch distributed `_ +* `JAX distributed `_ + +Pytorch Torchrun Implementation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +We are using `Pytorch Torchrun `_ to manage multi-GPU +training. Torchrun manages the distributed training by: + +* **Process Management**: Launching one process per GPU, where each process is assigned to a specific GPU. +* **Script Execution**: Running the same training script (e.g., RL Games trainer) on each process. +* **Environment Instances**: Each process creates its own instance of the Isaac Lab environment. +* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized +gradients back to each process after each training step. + +.. tip:: + Check out this `3 minute youtube video from PyTorch `_ + to understand how Torchrun works. + +The key components in this setup are: + +* **Torchrun**: Handles process spawning, communication, and gradient synchronization. +* **RL Library**: The reinforcement learning library that runs the actual training algorithm. +* **Isaac Lab**: Provides the simulation environment that each process instantiates independently. + +Under the hood, Torchrun uses the `DistributedDataParallel `_ +module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens: + +* Each GPU runs an independent process +* Each process executes the full training script +* Each process maintains its own: + * Isaac Lab environment instance (with *n* parallel environments) + * Policy network copy + * Experience buffer for rollout collection +* All processes synchronize only for gradient updates + +For a deeper dive into how Torchrun works, checkout +`PyTorch Docs: DistributedDataParallel - Internal Design `_. + +Jax Implementation +^^^^^^^^^^^^^^^^^^ + +.. tip:: + JAX is only supported with the skrl library. + +With JAX, we are using `skrl.utils.distributed.jax `_ +Since the ML framework doesn't automatically start multiple processes from a single program invocation, the skrl library provides a module to start them. .. image:: ../_static/multi-gpu-rl/a3c-light.svg @@ -45,6 +80,9 @@ the skrl library provides a module to start them. | +Running Multi-GPU Training +^^^^^^^^^^^^^^^^^^^^^^^^^^ + To train with multiple GPUs, use the following command, where ``--nproc_per_node`` represents the number of available GPUs: .. tab-set::