SpargeAttention error

torchrun --nproc_per_node=4 ./test/test_hybrid_attn.py --sp_ulysses_degree 4 --attn_impl sparse_sage --tune_mode


attn_processorattn_processor  is an instance of SparseAttentionMeansim, but it is empty now.is an instance of SparseAttentionMeansim, but it is empty now.

attn_processorattn_processor.is_sparse is a substate_dict of attn_processor, we will load it.attn_processor.is_sparse is a substate_dict of attn_processor, we will load it. 

attn_processoris an instance of SparseAttentionMeansim, but it is empty now. 
is an instance of SparseAttentionMeansim, but it is empty now.attn_processor.is_sparse is a substate_dict of attn_processor, we will load it.

attn_processor.is_sparse is a substate_dict of attn_processor, we will load it.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/file_system/fjr/code/long-context-attention/./test/test_hybrid_attn.py", line 172, in <module>
[rank2]:     load_sparse_attention_state_dict(usp_attn, saved_state_dict, multigpu=True, verbose=True)
[rank2]:   File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/spas_sage_attn-0.1.0-py3.10-linux-x86_64.egg/spas_sage_attn/autotune.py", line 36, in load_sparse_attention_state_dict
[rank2]:     sv= sv.to(device=v.device)
[rank2]:   File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
[rank2]:     raise AttributeError(
[rank2]: AttributeError: 'SparseAttentionMeansim' object has no attribute 'device'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/file_system/fjr/code/long-context-attention/./test/test_hybrid_attn.py", line 172, in <module>
[rank0]:     load_sparse_attention_state_dict(usp_attn, saved_state_dict, multigpu=True, verbose=True)
[rank0]:   File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/spas_sage_attn-0.1.0-py3.10-linux-x86_64.egg/spas_sage_attn/autotune.py", line 36, in load_sparse_attention_state_dict
[rank0]:     sv= sv.to(device=v.device)
[rank0]:   File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: 'SparseAttentionMeansim' object has no attribute 'device'
[rank1]: Traceback (most recent call last):
[rank1]:   File "/file_system/fjr/code/long-context-attention/./test/test_hybrid_attn.py", line 172, in <module>
[rank1]:     load_sparse_attention_state_dict(usp_attn, saved_state_dict, multigpu=True, verbose=True)
[rank1]:   File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/spas_sage_attn-0.1.0-py3.10-linux-x86_64.egg/spas_sage_attn/autotune.py", line 36, in load_sparse_attention_state_dict
[rank1]:     sv= sv.to(device=v.device)
[rank1]:   File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
[rank1]:     raise AttributeError(
[rank1]: AttributeError: 'SparseAttentionMeansim' object has no attribute 'device'
[rank3]: Traceback (most recent call last):
[rank3]:   File "/file_system/fjr/code/long-context-attention/./test/test_hybrid_attn.py", line 172, in <module>
[rank3]:     load_sparse_attention_state_dict(usp_attn, saved_state_dict, multigpu=True, verbose=True)
[rank3]:   File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/spas_sage_attn-0.1.0-py3.10-linux-x86_64.egg/spas_sage_attn/autotune.py", line 36, in load_sparse_attention_state_dict
[rank3]:     sv= sv.to(device=v.device)
[rank3]:   File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
[rank3]:     raise AttributeError(
[rank3]: AttributeError: 'SparseAttentionMeansim' object has no attribute 'device'
[rank0]:[W407 11:20:12.908119039 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank3]:[W407 11:20:13.636513957 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank2]:[W407 11:20:13.651571147 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W407 11:20:13.756881791 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0407 11:20:13.578000 457792 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 457918 closing signal SIGTERM
E0407 11:20:13.742000 457792 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 457917) of binary: /file_system/fjr/miniconda3/envs/xdit/bin/python
Traceback (most recent call last):
  File "/file_system/fjr/miniconda3/envs/xdit/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./test/test_hybrid_attn.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-04-07_11:20:13
  host      : localhost
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 457919)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-04-07_11:20:13
  host      : localhost
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 457920)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-04-07_11:20:13
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 457917)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SpargeAttention error #137

./test/test_hybrid_attn.py FAILED

Root Cause (first observed failure):
[0]:
time : 2025-04-07_11:20:13
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 457917)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SpargeAttention error #137

Description

./test/test_hybrid_attn.py FAILED

Root Cause (first observed failure): [0]: time : 2025-04-07_11:20:13 host : localhost rank : 0 (local_rank: 0) exitcode : 1 (pid: 457917) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Root Cause (first observed failure):
[0]:
time : 2025-04-07_11:20:13
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 457917)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html