-
Notifications
You must be signed in to change notification settings - Fork 400
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Maniskill3 crashes after env.rollout when transferring data to host (cuda to cpu).
for _ in tqdm(range(nb_iters), "Evaluation"):
rollouts = self.eval_env.rollout(
max_steps=self.env_max_frames_per_traj,
policy=policy,
auto_reset=False,
auto_cast_to_device=False,
tensordict=tensordict,
).to(device="cpu", non_blocking=False)
Traceback (most recent call last):
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
cli.main()
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
run()
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
_run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
exec(code, run_globals)
File "scripts/train_rl.py", line 118, in <module>
main()
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "scripts/train_rl.py", line 107, in main
trainer.train()
File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/trainers/rl_trainer.py", line 90, in train
eval_metrics = self.evaluator.evaluate(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 147, in evaluate
eval_metrics = self.log_eval_metrics(agent, env_step)
File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 158, in log_eval_metrics
eval_metrics = self.gather_eval_rollouts_metrics(policy)
File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 171, in gather_eval_rollouts_metrics
rollouts = self.eval_env.rollout(
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10623, in to
tensors = [to(t) for t in tensors]
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10623, in <listcomp>
tensors = [to(t) for t in tensors]
File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10595, in to
return tensor.to(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2025-01-30 19:23:26.032] [SAPIEN] [critical] Mem free failed with error code 700!
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.033] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
CUDA error at /__w/SAPIEN/SAPIEN/3rd_party/sapien-vulkan-2/src/core/buffer.cpp 103: an illegal memory access was encountered
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working