Using pytorch v 1.0.1, I was initially getting this error:
RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:1 and input b is on cuda:0
After using the register_buffer fix identified here (https://discuss.pytorch.org/t/tensors-are-on-different-gpus/1450/28) in the custom_layers.py file, I was able to get the program to run. GPU memory is being used, but the iterations are taking just as long as with CPU only.

Do you have any idea as to why this would be?