Skip to content

Error loading optimizer state with torch.load weights_only=True default in PyTorch 2.6 #3539

Open
@luiz0992

Description

@luiz0992

System Info

- `Accelerate` version: 1.6.0
- Platform: Linux-5.10.0-34-cloud-amd64-x86_64-with-glibc2.31
- `accelerate` bash location: /xxx/training/.venv/bin/accelerate
- Python version: 3.12.4
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch SDAA available: False
- PyTorch MUSA available: False
- System RAM: 1842.60 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI-GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Error Message:

  [rank7]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
  [rank7]:        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will 
  likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
  [rank7]:        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
  [rank7]:        WeightsUnpickler error: Unsupported global: GLOBAL omegaconf.listconfig.ListConfig was not an allowed global by default. Please use `torch.serialization.add_safe_globals([ListConfig])` or
   the `torch.serialization.safe_globals([ListConfig])` context manager to allowlist this global if you trust this class/function.

Reproduce:

  1. Save a state using Accelerator.save_state() with an optimizer that includes custom objects in its state (like omegaconf.listconfig.ListConfig)
  2. Try to load the state using Accelerator.load_state() on PyTorch 2.6+

Current behavior:

When calling load_state(), it fails with an unpickling error because torch.load() now uses weights_only=True by default in PyTorch 2.6, which restricts loading certain custom objects.

Expected behavior

The load_state() method should provide a way to pass custom parameters to the underlying torch.load() calls, specifically to set weights_only=False or to add safe globals when needed.

Proposed solution

Add parameters to the load_state() method that allow passing keyword arguments to torch.load() for each component:

  • optimizer_load_kwargs: Pass to optimizer's load function
  • scheduler_load_kwargs: Pass to scheduler's load function
  • dataloader_load_kwargs: Pass to dataloader's load function

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions