|
| 1 | +########################################### |
| 2 | +Training models with billions of parameters |
| 3 | +########################################### |
| 4 | + |
| 5 | +Today, large models with billions of parameters are trained with many GPUs across several machines in parallel. |
| 6 | +Even a single H100 GPU with 80 GB of VRAM (one of the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision). |
| 7 | +The memory consumption for training is generally made up of |
| 8 | + |
| 9 | +1. the model parameters, |
| 10 | +2. the layer activations (forward), |
| 11 | +3. the gradients (backward), |
| 12 | +4. the optimizer states (e.g., Adam has two additional exponential averages per parameter) and |
| 13 | +5. model outputs and loss. |
| 14 | + |
| 15 | +| |
| 16 | +
|
| 17 | +When the sum of these memory components exceed the VRAM of a single GPU, regular data-parallel training (DDP) can no longer be employed. |
| 18 | +To alleviate this limitation, we need to introduce **Model Parallelism**. |
| 19 | + |
| 20 | + |
| 21 | +---- |
| 22 | + |
| 23 | + |
| 24 | +************************** |
| 25 | +What is Model Parallelism? |
| 26 | +************************** |
| 27 | + |
| 28 | +There are different types of model parallelism, each with its own trade-offs. |
| 29 | + |
| 30 | +**Fully Sharded Data Parallelism (FSDP)** shards both model parameters and optimizer states across multiple GPUs, significantly reducing memory usage per GPU. |
| 31 | +This method, while highly memory-efficient, involves frequent synchronization between GPUs, introducing communication overhead and complexity in implementation. |
| 32 | +FSDP is advantageous when memory constraints are the primary issue, provided there are high-bandwidth interconnects to minimize latency. |
| 33 | + |
| 34 | +**Tensor Parallelism (TP)** splits individual tensors across GPUs, enabling fine-grained distribution of computation and memory. |
| 35 | +It scales well to a large number of GPUs but requires synchronization of tensor slices after each operation, which adds communication overhead. |
| 36 | +TP is most effective with models that have many linear layers (LLMs), offering a balance between memory distribution and computational efficiency. |
| 37 | + |
| 38 | +**Pipeline Parallelism (PP)** divides model layers into segments, each processed by different GPUs, reducing memory load per GPU and minimizing inter-GPU communication to pipeline stage boundaries. |
| 39 | +While this reduces communication overhead, it can introduce pipeline bubbles where some GPUs idle, leading to potential inefficiencies. |
| 40 | +PP is ideal for deep models with sequential architectures (LLMs), though it requires careful management to minimize idle times. |
| 41 | + |
| 42 | +Choosing a model parallelism style involves considering model architecture, hardware interconnects, and training efficiency. |
| 43 | +In practice, hybrid approaches combining FSDP, TP, and PP are often used to leverage the strengths of each method while mitigating their weaknesses. |
| 44 | + |
| 45 | + |
| 46 | +---- |
| 47 | + |
| 48 | + |
| 49 | +*********** |
| 50 | +Get started |
| 51 | +*********** |
| 52 | + |
| 53 | +.. raw:: html |
| 54 | + |
| 55 | + <div class="display-card-container"> |
| 56 | + <div class="row"> |
| 57 | + |
| 58 | +.. displayitem:: |
| 59 | + :header: Fully-Sharded Data Parallel (FSDP) |
| 60 | + :description: Get started training large multi-billion parameter models with minimal code changes |
| 61 | + :col_css: col-md-4 |
| 62 | + :button_link: fsdp.html |
| 63 | + :height: 180 |
| 64 | + :tag: advanced |
| 65 | + |
| 66 | +.. displayitem:: |
| 67 | + :header: Tensor Parallel (TP) |
| 68 | + :description: Learn the principles behind tensor parallelism and how to apply it to your model |
| 69 | + :col_css: col-md-4 |
| 70 | + :button_link: tp.html |
| 71 | + :height: 180 |
| 72 | + :tag: advanced |
| 73 | + |
| 74 | +.. displayitem:: |
| 75 | + :header: 2D Parallel (FSDP + TP) |
| 76 | + :description: Combine Tensor Parallelism with FSDP (2D Parallel) to train efficiently on 100s of GPUs |
| 77 | + :button_link: tp_fsdp.html |
| 78 | + :col_css: col-md-4 |
| 79 | + :height: 180 |
| 80 | + :tag: advanced |
| 81 | + |
| 82 | +.. displayitem:: |
| 83 | + :header: Pipeline Parallelism |
| 84 | + :description: Coming soon |
| 85 | + :col_css: col-md-4 |
| 86 | + :height: 180 |
| 87 | + :tag: advanced |
| 88 | + |
| 89 | +.. raw:: html |
| 90 | + |
| 91 | + </div> |
| 92 | + </div> |
| 93 | + |
| 94 | + |
| 95 | +---- |
| 96 | + |
| 97 | + |
| 98 | +********************* |
| 99 | +Parallelisms compared |
| 100 | +********************* |
| 101 | + |
| 102 | + |
| 103 | +**Distributed Data Parallel (DDP)** |
| 104 | + |
| 105 | +.. raw:: html |
| 106 | + |
| 107 | + <ul class="no-bullets"> |
| 108 | + <li>✅ No model code changes required</li> |
| 109 | + <li>✅ Training with very large batch sizes (batch size scales with number of GPUs)</li> |
| 110 | + <li>❗ Model (weights, optimizer state, activations / gradients) must fit into a GPU</li> |
| 111 | + </ul> |
| 112 | + |
| 113 | +| |
| 114 | +
|
| 115 | +**Fully-Sharded Data Parallel (FSDP)** |
| 116 | + |
| 117 | +.. raw:: html |
| 118 | + |
| 119 | + <ul class="no-bullets"> |
| 120 | + <li>✅ No model code changes required </li> |
| 121 | + <li>✅ Training with very large batch sizes (batch size scales with number of GPUs) </li> |
| 122 | + <li>✅ Model (weights, optimizer state, gradients) gets distributed across all GPUs </li> |
| 123 | + <li>❗ A single FSDP layer when gathered during forward/backward must fit into the GPU </li> |
| 124 | + <li>❗ Requires some knowledge about model architecture to set configuration options correctly </li> |
| 125 | + <li>❗ Requires very fast networking (multi-node), data transfers between GPUs often become a bottleneck </li> |
| 126 | + </ul> |
| 127 | + |
| 128 | +| |
| 129 | +
|
| 130 | +**Tensor Parallel (TP)** |
| 131 | + |
| 132 | +.. raw:: html |
| 133 | + |
| 134 | + <ul class="no-bullets"> |
| 135 | + <li>❗ Model code changes required </li> |
| 136 | + <li>🤔 Fixed global batch size (does not scale with number of GPUs) </li> |
| 137 | + <li>✅ Model (weights, optimizer state, activations) gets distributed across all GPUs</li> |
| 138 | + <li>✅ Parallelizes the computation of layers that are too large to fit onto a single GPU </li> |
| 139 | + <li>❗ Requires lots of knowledge about model architecture to set configuration options correctly </li> |
| 140 | + <li>🤔 Less GPU data transfers required, but data transfers don't overlap with computation like in FSDP </li> |
| 141 | + </ul> |
| 142 | + |
| 143 | +| |
| 144 | +
|
| 145 | +**2D Parallel (FSDP + TP)** |
| 146 | + |
| 147 | +.. raw:: html |
| 148 | + |
| 149 | + <ul class="no-bullets"> |
| 150 | + <li>❗ Model code changes required</li> |
| 151 | + <li>✅ Training with very large batch sizes (batch size scales across data-parallel dimension)</li> |
| 152 | + <li>✅ Model (weights, optimizer state, activations) gets distributed across all GPUs</li> |
| 153 | + <li>✅ Parallelizes the computation of layers that are too large to fit onto a single GPU</li> |
| 154 | + <li>❗ Requires lots of knowledge about model architecture to set configuration options correctly</li> |
| 155 | + <li>✅ Tensor-parallel within machines and FSDP across machines reduces data transfer bottlenecks</li> |
| 156 | + </ul> |
| 157 | + |
| 158 | +| |
| 159 | +
|
| 160 | +PyTorch Lightning supports all the parallelisms mentioned above natively through PyTorch, with the exception of pipeline parallelism (PP) which is not yet supported. |
| 161 | + |
| 162 | +| |
0 commit comments