-
Notifications
You must be signed in to change notification settings - Fork 571
Description
When using the fused_uvm_caching kernel, we may encounter a CUDA error due to system out-of-memory (OOM) issues. Currently, the distributed checkpoint writer copies the entire UVM tensor to the CPU in one step, which is highly memory-intensive, https://github.com/pytorch/pytorch/blob/v2.8.0/torch/distributed/checkpoint/filesystem.py#L184. Would it be possible to implement a block-by-block copying strategy to reduce memory overhead?
[rank3]: Traceback (most recent call last): (RANK 14)
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 239, in all_reduce
[rank3]: local_data = map_fun()
[rank3]: ^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper
[rank3]: result = func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 368, in write_data
[rank3]: all_writes = storage_writer.write_data(final_local_plan, planner)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 677, in write_data
[rank3]: return self._write_data(planner, file_queue)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 705, in _write_data
[rank3]: _write_files_from_queue(
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 428, in _write_files_from_queue
[rank3]: for tensor, write_item in loader.values():
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 223, in values
[rank3]: self._refill()
[rank3]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 184, in _refill
[rank3]: tensor = tensor.to(device="cpu", non_blocking=True)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.AcceleratorError: CUDA error: invalid argument
[rank3]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank3]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank3]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.