System OOM Error in Distributed Checkpoint Writer When using fused_uvm_caching Kernel

When using the `fused_uvm_caching` kernel, we may encounter a CUDA error due to system out-of-memory (OOM) issues. Currently, the distributed checkpoint writer copies the entire UVM tensor to the CPU in one step, which is highly memory-intensive, https://github.com/pytorch/pytorch/blob/v2.8.0/torch/distributed/checkpoint/filesystem.py#L184. Would it be possible to implement a block-by-block copying strategy to reduce memory overhead?


[rank3]: Traceback (most recent call last): (RANK 14)
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 239, in all_reduce
[rank3]:     local_data = map_fun()
[rank3]:                  ^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper
[rank3]:     result = func(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 368, in write_data
[rank3]:     all_writes = storage_writer.write_data(final_local_plan, planner)
[rank3]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 677, in write_data
[rank3]:     return self._write_data(planner, file_queue)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 705, in _write_data
[rank3]:     _write_files_from_queue(
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 428, in _write_files_from_queue
[rank3]:     for tensor, write_item in loader.values():
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 223, in values
[rank3]:     self._refill()
[rank3]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 184, in _refill
[rank3]:     tensor = tensor.to(device="cpu", non_blocking=True)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.AcceleratorError: CUDA error: invalid argument
[rank3]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank3]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank3]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

System OOM Error in Distributed Checkpoint Writer When using fused_uvm_caching Kernel #3399

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System OOM Error in Distributed Checkpoint Writer When using fused_uvm_caching Kernel #3399

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions