Skip to content

[BUG] Maniskill3 crashes on D2H transfer after env rollout #2739

@AlexandreBrown

Description

@AlexandreBrown

Describe the bug

Maniskill3 crashes after env.rollout when transferring data to host (cuda to cpu).

for _ in tqdm(range(nb_iters), "Evaluation"):
            rollouts = self.eval_env.rollout(
                max_steps=self.env_max_frames_per_traj,
                policy=policy,
                auto_reset=False,
                auto_cast_to_device=False,
                tensordict=tensordict,
            ).to(device="cpu", non_blocking=False)
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
    cli.main()
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
    run()
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
    return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
    _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
    exec(code, run_globals)
  File "scripts/train_rl.py", line 118, in <module>
    main()
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "scripts/train_rl.py", line 107, in main
    trainer.train()
  File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/trainers/rl_trainer.py", line 90, in train
    eval_metrics = self.evaluator.evaluate(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 147, in evaluate
    eval_metrics = self.log_eval_metrics(agent, env_step)
  File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 158, in log_eval_metrics
    eval_metrics = self.gather_eval_rollouts_metrics(policy)
  File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 171, in gather_eval_rollouts_metrics
    rollouts = self.eval_env.rollout(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10623, in to
    tensors = [to(t) for t in tensors]
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10623, in <listcomp>
    tensors = [to(t) for t in tensors]
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10595, in to
    return tensor.to(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2025-01-30 19:23:26.032] [SAPIEN] [critical] Mem free failed with error code 700!

[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.033] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
CUDA error at /__w/SAPIEN/SAPIEN/3rd_party/sapien-vulkan-2/src/core/buffer.cpp 103: an illegal memory access was encountered

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions