-
Notifications
You must be signed in to change notification settings - Fork 371
Description
🚀 Feature
Please make the opacus grad_sampler compatible with torch.cat operations in activation functions
Motivation
I've been trying to use the grad_sampler module with networks containing the CReLU activation function. However, the CReLU activation functions concatenates the output of the layer with the negative of itself, thus doubling the effective output size of the layer. This can be very useful and space-saving in networks that tend to develop mirrored filters (see https://arxiv.org/pdf/1603.05201v2.pdf).
Furthermore, using the CReLU activation functions it is possible to initialize fully connected networks so that they appear linear at initialization (see photo in additional context). This has been shown to be an extremely powerful initialization pattern, allowing fully connected networks to be trained with over 200 layers. That's incredible! Typical fully connected networks often struggle to learn appreciably with only 20+ layers (see https://arxiv.org/pdf/1702.08591.pdf).
Because of the symmetric initialization pattern the discontinuities in the CReLU activation function (after symmetric initialization) are dramatically smaller than in comparable networks with ReLU other activation functions. I've been studying gradient conditioning and stability in a variety of architectures using opacus, but it's broken for activation functions that use torch.cat. In the case of CReLU weight.grad_sample returns something that is half the size of the weight itself (ignoring the batch size).
Pitch
Implementing (or fixing) opacus grad_sampler compatibility with torch.cat would allow it to be used with a wider variety of activation functions, including CReLU, which would be really cool (see motivation section).
I didn't file this as a bug report because I'm not sure that torch.cat compatibility was ever intentionally implemented.
Alternatives
I can't think of any alternatives