Skip to content

RuntimeError: CUDA error: no kernel image is available for execution on the device #20

@xibian1120

Description

@xibian1120

when running the code ,it appers the error as follows:

Traceback (most recent call last):
  File "./train.py", line 95, in <module>
  File "./train.y", line 78, in main
    zero_shot_evaluation(model, val_loaders, opts)
  File "/media/yxl/a/2025191008/VALOR-master/train_utils.py", line 247, in zero_shot_evaluation
    eval_log = validate(model, test_loader, opts, global_step=0, total_step=opts.num_train_steps)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 26, in validate
    val_log = validate_single(model, loader, task.split('--')[0], opts, global_step, total_step,task.split('--')[1])
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 40, in validate_single
    return validate_cap(model, val_loader, task, opts, global_step, dset_name)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 161, in validate_cap
    evaluation_dict = model(batch, task_str, compute_loss=False)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/apex/amp/_initialize.py", line 196, in new_fwd
    output = old_fwd(*applier(args, input_caster),
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 135, in forward
    return self.forward_cap(batch, task, compute_loss=compute_loss)
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 726, in forward_cap
    return self.generate_cap(batch, task)
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 930, in generate_cap
    video_input = self.get_multimodal_forward_input_video(video_output) 
  File "/media/yxl/a/2025191008/VALOR-master/model/modeling.py", line 490, in get_multimodal_forward_input_video
    video_output =  video_output + self.video_frame_embedding[:,:video_output.shape[1],:].unsqueeze(-2)
RuntimeError: CUDA error: no kernel image is available for execution on the device
  0%|                                                                                                                                                      | 0/1495 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15399) of binary: /home/yxl/anaconda3/envs/valor_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

I've tried every method on the Internet but still don't solve the problem.
My environment :
_sys.platform linux
Python 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0]
numpy 1.24.4
detectron2 failed to import
detectron2._C not built correctly: No module named 'detectron2'
Compiler ($CXX) c++ (GCC) 7.3.0
CUDA compiler Build cuda_11.1.TC455_06.29190527_0
DETECTRON2_ENV_MODULE
PyTorch 1.9.0+cu111 @/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch
PyTorch debug build False
GPU available True
GPU 0 GeForce RTX 2080 Ti (arch=7.5)
CUDA_HOME /usr/local/cuda
TORCH_CUDA_ARCH_LIST 7.5
Pillow 10.1.0
torchvision 0.10.0+cu111 @/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
cv2 Not found


PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,_

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions