In R2Plus1D_model.py, line 200: https://github.com/jfzhang95/pytorch-video-recognition/blob/ca37de9f69a961f22a821c157e9ccf47a601904d/network/R2Plus1D_model.py#L200 It's actually a convolution of 3 * 7 * 7 with padding=(1, 3, 3), not 1 * 7 * 7!