-
Notifications
You must be signed in to change notification settings - Fork 57
Description
I have encountered a bug that invalidates multi-GPU training. Each model stored per GPU then diverges from the others given that the initialization of the model is non-deterministic.
This happens for all sampled_basis
layer. Specifically in the init of BlocksBasisExpansion
in https://github.com/QUVA-Lab/escnn/blob/master/escnn/nn/modules/basismanager/basisexpansion_blocks.py)at:
for i_repr in set(in_reprs):
for o_repr in set(out_reprs):
which makes the order of layers created random given that it iterates over sets. And this will happen on each GPU making models per GPU differ (in_reprs
being {"irrep_1", "irrep_0", "regular"}
and out_reprs
being {"regular"}
).
Here’s one example of wrong ordering:
On one GPU I had
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",
and on another one I had
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",
This should be caught by Torch. Unfortunately it only checks that parameters
are checked while buffers, i.e., values that stay constant through training, are not being checked. And since this BlocksBasisExpansion
is a buffer it fails silently.
Example of such layer:
module.enc.enc_obs.conv.13.basisexpansion.block_expansion_('regular', 'regular').sampled_basis
diff.enc_a.0.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis