Description & Motivation
PyTorch Lightning’s current async checkpointing implementation predates PyTorch’s Distributed Checkpoint (DCP) API and feels outdated.
This issue proposes evaluating and migrating Lightning’s async checkpoint logic to leverage torch.distributed.checkpoint (DCP), specifically async_save, to:
- Align with upstream PyTorch checkpointing APIs
- Improve robustness and maintainability
- Better support distributed and sharded training setups
- Reduce custom logic that duplicates upstream functionality
Pitch
Use PyTorch DCP's async_save
Alternatives
No response
Additional context
https://docs.pytorch.org/docs/stable/distributed.checkpoint.html#distributed-checkpoint-torch-distributed-checkpoint
cc @lantiga
Description & Motivation
PyTorch Lightning’s current async checkpointing implementation predates PyTorch’s Distributed Checkpoint (DCP) API and feels outdated.
This issue proposes evaluating and migrating Lightning’s async checkpoint logic to leverage
torch.distributed.checkpoint(DCP), specificallyasync_save, to:Pitch
Use PyTorch DCP's async_save
Alternatives
No response
Additional context
https://docs.pytorch.org/docs/stable/distributed.checkpoint.html#distributed-checkpoint-torch-distributed-checkpoint
cc @lantiga