state_dict support

inikishev · inikishev · commit de5cae480813 · 2025-02-10T18:13:26.000+03:00
diff --git a/docs/source/FAQ.rst b/docs/source/FAQ.rst
@@ -225,23 +225,41 @@ There is also a :py:class:`tz.m.WrapClosure<torczhero.modules.WrapClosure>` for
 
 How to save/serialize a modular optimizer?
 ============================================
-TODO
+Please refer to pytorch docs https://pytorch.org/tutorials/beginner/saving_loading_models.html.
+
+Like pytorch optimizers, torchzero modular optimizers and modules support :code:`opt.state_dict()` and :code:`opt.load_state_dict()`, which saves and loads state dicts of all modules, including nested ones.
+
+So you can use the standard code for saving and loading:
+
+.. code:: python
+
+    torch.save({
+                'model_state_dict': model.state_dict(),
+                'optimizer_state_dict': optimizer.state_dict(),
+                ...
+                }, PATH)
+
+    model = TheModelClass(*args, **kwargs)
+    optimizer = tz.Modular(model.parameters(), *modules)
+
+    checkpoint = torch.load(PATH, weights_only=True)
+    model.load_state_dict(checkpoint['model_state_dict'])
+    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+
 
 How much overhead does a torchzero modular optimizer have compared to a normal optimizer?
 ==========================================================================================
-A thorough benchmark will be posted to this section very soon. There is no overhead other than what is described below.
-
-Since some optimizers, like Adam, have learning rate baked into the update rule, but we use LR module instead, that requires an extra add operation. Currently if :code:`tz.m.Adam` or :code:`tz.m.Wrap` are directly followed by a :code:`tz.m.LR`, they will be automatically fused (:code:`Wrap` fuses only when wrapped optimizer has an :code:`lr` parameter). However adding LR fusing to all modules with a learning rate is not a priority.
+Since some optimizers, like Adam, have learning rate baked into the update rule, but we use LR module instead, that requires an extra add operation. Currently if :code:`tz.m.Adam` or :code:`tz.m.Wrap` are directly followed by a :code:`tz.m.LR`, they will be automatically fused (:code:`Wrap` fuses only when wrapped optimizer has an :code:`lr` parameter) to mitigate that. However adding LR fusing to all modules with a learning rate is not a priority. From what I can tell this overhead is negligible.
 
 Whenever possible I used `_foreach_xxx <https://pytorch.org/docs/stable/torch.html#foreach-operations>`_ operations. Those operate on all parameters at once instead of using a slow python for-loops. This makes the optimizers way quicker, especially with a lot of different parameter tensors. Also all modules change the update in-place whenever possible.
 
 Is there support for complex-valued parameters?
 =================================================
-Currently no, as I have not made the modules with complex-valued parameters in mind, although some might still work. I do use complex-valued networks so I am looking into adding support. There may actually be a way to support them automatically.
+:code:`tz.m.ViewAsReal()` and :code:`tz.m.ViewAsComplex()` modules will be added soon. This will also allow to use custom pytorch optimizers with complex networks (via :code:`tz.m.Wrap`), even if they don't support those natively.
 
 Is there support for optimized parameters being on different devices?
 ======================================================================
-TODO
+Maybe, I need to test this.
 
 Is there support for FSDP (FullyShardedDataParallel)?
 ======================================================
diff --git a/pyproject.toml b/pyproject.toml
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
 name = "torchzero"
 description = "Modular optimization library for PyTorch."
 
-version = "0.1.7"
+version = "0.1.8"
 dependencies = [
   "torch",
   "numpy",
diff --git a/src/torchzero/core/module.py b/src/torchzero/core/module.py
@@ -212,6 +212,22 @@ def __repr__(self):
         if self._initialized: return super().__repr__()
         return f"uninitialized {self.__class__.__name__}()"
 
+    def state_dict(self):
+        state_dict = {}
+        state_dict['__self__'] = super().state_dict()
+        for k,v in self.children.items():
+            state_dict[k] = v.state_dict()
+        return state_dict
+
+    def load_state_dict(self, state_dict: dict[str, Any]) -> None:
+        super().load_state_dict(state_dict['__self__'])
+        for k, v in self.children.items():
+            if k in state_dict:
+                v.load_state_dict(state_dict[k])
+            else:
+                warnings.warn(f"Tried to load state dict for {k}: {v.__class__.__name__}, but it is not present in state_dict with {list(state_dict.keys()) = }")
+
+
     def set_params(self, params: ParamsT):
         """
         Set parameters to this module. Use this to set per-parameter group settings.
diff --git a/src/torchzero/optim/modular.py b/src/torchzero/optim/modular.py
@@ -2,6 +2,7 @@
 import warnings
 from inspect import cleandoc
 import torch
+from typing import Any
 
 from ..core import OptimizerModule, TensorListOptimizer, OptimizationVars, _Chain, _Chainable
 from ..utils.python_tools import flatten
@@ -67,6 +68,21 @@ def __init__(self, params, *modules: _Chainable):
             for hook in module.post_init_hooks:
                 hook(self, module)
 
+    def state_dict(self):
+        state_dict = {}
+        state_dict['__self__'] = super().state_dict()
+        for i,v in enumerate(self.unrolled_modules):
+            state_dict[str(i)] = v.state_dict()
+        return state_dict
+
+    def load_state_dict(self, state_dict: dict[str, Any]) -> None:
+        super().load_state_dict(state_dict['__self__'])
+        for i,v in enumerate(self.unrolled_modules):
+            if str(i) in state_dict:
+                v.load_state_dict(state_dict[str(i)])
+            else:
+                warnings.warn(f"Tried to load state dict for {i}th module: {v.__class__.__name__}, but it is not present in state_dict with {list(state_dict.keys()) = }")
+
     def get_lr_module(self, last=True) -> OptimizerModule:
         """
         Retrieves the module in the chain that controls the learning rate.