Skip to content

Commit c09356d

Browse files
authored
(10/10) Support 2D Parallelism - Port Fabric docs to PL (#19899)
1 parent 7874cd0 commit c09356d

File tree

14 files changed

+795
-12
lines changed

14 files changed

+795
-12
lines changed

docs/source-fabric/advanced/model_parallel/tp_fsdp.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ We will start off with the same feed forward example model as in the :doc:`Tenso
2121

2222
.. code-block:: python
2323
24-
import torch
2524
import torch.nn as nn
2625
import torch.nn.functional as F
2726
@@ -164,7 +163,7 @@ Finally, the tensor parallelism will apply to each group, splitting the sharded
164163
model = fabric.setup(model)
165164
166165
# Define the optimizer
167-
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3, foreach=True)
166+
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3)
168167
optimizer = fabric.setup_optimizers(optimizer)
169168
170169
# Define dataset/dataloader

docs/source-fabric/glossary/index.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,11 @@ Glossary
1919
<div class="display-card-container">
2020
<div class="row">
2121

22+
.. displayitem::
23+
:header: 2D Parallelism
24+
:button_link: ../advanced/model_parallel/tp_fsdp.html
25+
:col_css: col-md-4
26+
2227
.. displayitem::
2328
:header: Accelerator
2429
:button_link: ../fundamentals/accelerators.html

docs/source-pytorch/_static/main.css

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
11
col {
22
width: 50% !important;
33
}
4+
5+
ul.no-bullets {
6+
list-style-type: none; /* Remove default bullets */
7+
padding-left: 0; /* Remove default padding */
8+
}
9+
10+
ul.no-bullets li {
11+
padding-left: 0.5em;
12+
text-indent: -2em;
13+
}

docs/source-pytorch/accelerators/gpu_advanced.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ For experts pushing the state-of-the-art in model development, Lightning offers
2222
:header: Train models with billions of parameters
2323
:description:
2424
:col_css: col-md-4
25-
:button_link: ../advanced/model_parallel.html
25+
:button_link: ../advanced/model_parallel/index.html
2626
:height: 150
2727
:tag: advanced
2828

docs/source-pytorch/advanced/model_parallel/fsdp.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ The memory consumption for training is generally made up of
2020
|
2121
2222
When the sum of these memory components exceed the VRAM of a single GPU, regular data-parallel training (DDP) can no longer be employed.
23-
One of the methods that can alleviate this limitation is called **model-parallel** training, and known as **FSDP** in PyTorch, and in this guide, you will learn how to effectively scale large models with it.
23+
One of the methods that can alleviate this limitation is called **Fully Sharded Data Parallel (FSDP)**, and in this guide, you will learn how to effectively scale large models with it.
2424

2525

2626
----
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
###########################################
2+
Training models with billions of parameters
3+
###########################################
4+
5+
Today, large models with billions of parameters are trained with many GPUs across several machines in parallel.
6+
Even a single H100 GPU with 80 GB of VRAM (one of the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision).
7+
The memory consumption for training is generally made up of
8+
9+
1. the model parameters,
10+
2. the layer activations (forward),
11+
3. the gradients (backward),
12+
4. the optimizer states (e.g., Adam has two additional exponential averages per parameter) and
13+
5. model outputs and loss.
14+
15+
|
16+
17+
When the sum of these memory components exceed the VRAM of a single GPU, regular data-parallel training (DDP) can no longer be employed.
18+
To alleviate this limitation, we need to introduce **Model Parallelism**.
19+
20+
21+
----
22+
23+
24+
**************************
25+
What is Model Parallelism?
26+
**************************
27+
28+
There are different types of model parallelism, each with its own trade-offs.
29+
30+
**Fully Sharded Data Parallelism (FSDP)** shards both model parameters and optimizer states across multiple GPUs, significantly reducing memory usage per GPU.
31+
This method, while highly memory-efficient, involves frequent synchronization between GPUs, introducing communication overhead and complexity in implementation.
32+
FSDP is advantageous when memory constraints are the primary issue, provided there are high-bandwidth interconnects to minimize latency.
33+
34+
**Tensor Parallelism (TP)** splits individual tensors across GPUs, enabling fine-grained distribution of computation and memory.
35+
It scales well to a large number of GPUs but requires synchronization of tensor slices after each operation, which adds communication overhead.
36+
TP is most effective with models that have many linear layers (LLMs), offering a balance between memory distribution and computational efficiency.
37+
38+
**Pipeline Parallelism (PP)** divides model layers into segments, each processed by different GPUs, reducing memory load per GPU and minimizing inter-GPU communication to pipeline stage boundaries.
39+
While this reduces communication overhead, it can introduce pipeline bubbles where some GPUs idle, leading to potential inefficiencies.
40+
PP is ideal for deep models with sequential architectures (LLMs), though it requires careful management to minimize idle times.
41+
42+
Choosing a model parallelism style involves considering model architecture, hardware interconnects, and training efficiency.
43+
In practice, hybrid approaches combining FSDP, TP, and PP are often used to leverage the strengths of each method while mitigating their weaknesses.
44+
45+
46+
----
47+
48+
49+
***********
50+
Get started
51+
***********
52+
53+
.. raw:: html
54+
55+
<div class="display-card-container">
56+
<div class="row">
57+
58+
.. displayitem::
59+
:header: Fully-Sharded Data Parallel (FSDP)
60+
:description: Get started training large multi-billion parameter models with minimal code changes
61+
:col_css: col-md-4
62+
:button_link: fsdp.html
63+
:height: 180
64+
:tag: advanced
65+
66+
.. displayitem::
67+
:header: Tensor Parallel (TP)
68+
:description: Learn the principles behind tensor parallelism and how to apply it to your model
69+
:col_css: col-md-4
70+
:button_link: tp.html
71+
:height: 180
72+
:tag: advanced
73+
74+
.. displayitem::
75+
:header: 2D Parallel (FSDP + TP)
76+
:description: Combine Tensor Parallelism with FSDP (2D Parallel) to train efficiently on 100s of GPUs
77+
:button_link: tp_fsdp.html
78+
:col_css: col-md-4
79+
:height: 180
80+
:tag: advanced
81+
82+
.. displayitem::
83+
:header: Pipeline Parallelism
84+
:description: Coming soon
85+
:col_css: col-md-4
86+
:height: 180
87+
:tag: advanced
88+
89+
.. raw:: html
90+
91+
</div>
92+
</div>
93+
94+
95+
----
96+
97+
98+
*********************
99+
Parallelisms compared
100+
*********************
101+
102+
103+
**Distributed Data Parallel (DDP)**
104+
105+
.. raw:: html
106+
107+
<ul class="no-bullets">
108+
<li>✅ &nbsp; No model code changes required</li>
109+
<li>✅ &nbsp; Training with very large batch sizes (batch size scales with number of GPUs)</li>
110+
<li>❗ &nbsp; Model (weights, optimizer state, activations / gradients) must fit into a GPU</li>
111+
</ul>
112+
113+
|
114+
115+
**Fully-Sharded Data Parallel (FSDP)**
116+
117+
.. raw:: html
118+
119+
<ul class="no-bullets">
120+
<li>✅ &nbsp; No model code changes required </li>
121+
<li>✅ &nbsp; Training with very large batch sizes (batch size scales with number of GPUs) </li>
122+
<li>✅ &nbsp; Model (weights, optimizer state, gradients) gets distributed across all GPUs </li>
123+
<li>❗ &nbsp; A single FSDP layer when gathered during forward/backward must fit into the GPU </li>
124+
<li>❗ &nbsp; Requires some knowledge about model architecture to set configuration options correctly </li>
125+
<li>❗ &nbsp; Requires very fast networking (multi-node), data transfers between GPUs often become a bottleneck </li>
126+
</ul>
127+
128+
|
129+
130+
**Tensor Parallel (TP)**
131+
132+
.. raw:: html
133+
134+
<ul class="no-bullets">
135+
<li>❗ &nbsp; Model code changes required </li>
136+
<li>🤔 &nbsp; Fixed global batch size (does not scale with number of GPUs) </li>
137+
<li>✅ &nbsp; Model (weights, optimizer state, activations) gets distributed across all GPUs</li>
138+
<li>✅ &nbsp; Parallelizes the computation of layers that are too large to fit onto a single GPU </li>
139+
<li>❗ &nbsp; Requires lots of knowledge about model architecture to set configuration options correctly </li>
140+
<li>🤔 &nbsp; Less GPU data transfers required, but data transfers don't overlap with computation like in FSDP </li>
141+
</ul>
142+
143+
|
144+
145+
**2D Parallel (FSDP + TP)**
146+
147+
.. raw:: html
148+
149+
<ul class="no-bullets">
150+
<li>❗ &nbsp; Model code changes required</li>
151+
<li>✅ &nbsp; Training with very large batch sizes (batch size scales across data-parallel dimension)</li>
152+
<li>✅ &nbsp; Model (weights, optimizer state, activations) gets distributed across all GPUs</li>
153+
<li>✅ &nbsp; Parallelizes the computation of layers that are too large to fit onto a single GPU</li>
154+
<li>❗ &nbsp; Requires lots of knowledge about model architecture to set configuration options correctly</li>
155+
<li>✅ &nbsp; Tensor-parallel within machines and FSDP across machines reduces data transfer bottlenecks</li>
156+
</ul>
157+
158+
|
159+
160+
PyTorch Lightning supports all the parallelisms mentioned above natively through PyTorch, with the exception of pipeline parallelism (PP) which is not yet supported.
161+
162+
|

0 commit comments

Comments
 (0)