From fd280110ab8e2253879045efbf3db7fdee4057c6 Mon Sep 17 00:00:00 2001
From: Anton Alyakin <alyakin314@gmail.com>
Date: Wed, 7 Aug 2024 02:21:30 -0400
Subject: [PATCH 1/4] fixed the init_module and deepspeed docs

---
 docs/source-fabric/advanced/model_init.rst | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/docs/source-fabric/advanced/model_init.rst b/docs/source-fabric/advanced/model_init.rst
index f5f76e8aa087b..61dbf00e28fd5 100644
--- a/docs/source-fabric/advanced/model_init.rst
+++ b/docs/source-fabric/advanced/model_init.rst
@@ -69,7 +69,7 @@ When training distributed models with :doc:`FSDP/TP <model_parallel/index>` or D
 
 .. code-block:: python
 
-    # Recommended for FSDP, TP and DeepSpeed
+    # Recommended for FSDP and TP
     with fabric.init_module(empty_init=True):
         model = GPT3()  # parameters are placed on the meta-device
 
@@ -79,6 +79,18 @@ When training distributed models with :doc:`FSDP/TP <model_parallel/index>` or D
     optimizer = torch.optim.Adam(model.parameters())
     optimizer = fabric.setup_optimizers(optimizer)
 
+With DeepSpeed Stage 3, the use of :meth:`~lightning.fabric.fabric.Fabric.init_module` context manager is necessesary for the model to be sharded correctly instead of attempted to be put on the GPU in its entirety. Deepspeed, however, requires the models and optimizer to be set up jointly.
+
+.. code-block:: python
+
+    # Required with DeepSpeed Stage 3
+    with fabric.init_module(empty_init=True):
+        model = GPT3()
+
+    optimizer = torch.optim.Adam(model.parameters())
+    model, optimizer = fabric.setup(model, optimizer)
+
+
 .. note::
     Empty-init is experimental and the behavior may change in the future.
     For distributed models, it is required that all user-defined modules that manage parameters implement a ``reset_parameters()`` method (all PyTorch built-in modules have this too).

From fe53b0c0ae8faa55015f680afe3c3178fe19c0dc Mon Sep 17 00:00:00 2001
From: Anton Alyakin <alyakin314@gmail.com>
Date: Wed, 7 Aug 2024 02:53:38 -0400
Subject: [PATCH 2/4] extra newline removal

---
 docs/source-fabric/advanced/model_init.rst | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/source-fabric/advanced/model_init.rst b/docs/source-fabric/advanced/model_init.rst
index 61dbf00e28fd5..ce0ffe5c92e8a 100644
--- a/docs/source-fabric/advanced/model_init.rst
+++ b/docs/source-fabric/advanced/model_init.rst
@@ -90,7 +90,6 @@ With DeepSpeed Stage 3, the use of :meth:`~lightning.fabric.fabric.Fabric.init_m
     optimizer = torch.optim.Adam(model.parameters())
     model, optimizer = fabric.setup(model, optimizer)
 
-
 .. note::
     Empty-init is experimental and the behavior may change in the future.
     For distributed models, it is required that all user-defined modules that manage parameters implement a ``reset_parameters()`` method (all PyTorch built-in modules have this too).

From 30e414bf3c916e7093fd7b58d0c38f0dd2716a69 Mon Sep 17 00:00:00 2001
From: Anton Alyakin <alyakin314@gmail.com>
Date: Wed, 7 Aug 2024 02:56:52 -0400
Subject: [PATCH 3/4] module_init sentence phrasing

---
 docs/source-fabric/advanced/model_init.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source-fabric/advanced/model_init.rst b/docs/source-fabric/advanced/model_init.rst
index ce0ffe5c92e8a..3cc718eec1da6 100644
--- a/docs/source-fabric/advanced/model_init.rst
+++ b/docs/source-fabric/advanced/model_init.rst
@@ -79,7 +79,7 @@ When training distributed models with :doc:`FSDP/TP <model_parallel/index>` or D
     optimizer = torch.optim.Adam(model.parameters())
     optimizer = fabric.setup_optimizers(optimizer)
 
-With DeepSpeed Stage 3, the use of :meth:`~lightning.fabric.fabric.Fabric.init_module` context manager is necessesary for the model to be sharded correctly instead of attempted to be put on the GPU in its entirety. Deepspeed, however, requires the models and optimizer to be set up jointly.
+With DeepSpeed Stage 3, the use of :meth:`~lightning.fabric.fabric.Fabric.init_module` context manager is necessesary for the model to be sharded correctly instead of attempted to be put on the GPU in its entirety. Deepspeed requires the models and optimizer to be set up jointly.
 
 .. code-block:: python
 

From f0562025045eb8f2706237691fb64419a23896b4 Mon Sep 17 00:00:00 2001
From: Jirka B <j.borovec+github@gmail.com>
Date: Thu, 3 Apr 2025 12:53:58 -0400
Subject: [PATCH 4/4] typo

---
 docs/source-fabric/advanced/model_init.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source-fabric/advanced/model_init.rst b/docs/source-fabric/advanced/model_init.rst
index 3cc718eec1da6..f7e11f2dc4210 100644
--- a/docs/source-fabric/advanced/model_init.rst
+++ b/docs/source-fabric/advanced/model_init.rst
@@ -79,7 +79,7 @@ When training distributed models with :doc:`FSDP/TP <model_parallel/index>` or D
     optimizer = torch.optim.Adam(model.parameters())
     optimizer = fabric.setup_optimizers(optimizer)
 
-With DeepSpeed Stage 3, the use of :meth:`~lightning.fabric.fabric.Fabric.init_module` context manager is necessesary for the model to be sharded correctly instead of attempted to be put on the GPU in its entirety. Deepspeed requires the models and optimizer to be set up jointly.
+With DeepSpeed Stage 3, the use of :meth:`~lightning.fabric.fabric.Fabric.init_module` context manager is necessary for the model to be sharded correctly instead of attempted to be put on the GPU in its entirety. Deepspeed requires the models and optimizer to be set up jointly.
 
 .. code-block:: python