Skip to content

Non-deterministic behavior on multi-GPU run #103

@lrenaux-bdai

Description

@lrenaux-bdai

I have encountered a bug that invalidates multi-GPU training. Each model stored per GPU then diverges from the others given that the initialization of the model is non-deterministic.

This happens for all sampled_basis layer. Specifically in the init of BlocksBasisExpansion in https://github.com/QUVA-Lab/escnn/blob/master/escnn/nn/modules/basismanager/basisexpansion_blocks.py)at:

for i_repr in set(in_reprs):
    for o_repr in set(out_reprs):

which makes the order of layers created random given that it iterates over sets. And this will happen on each GPU making models per GPU differ (in_reprs being {"irrep_1", "irrep_0", "regular"} and out_reprs being {"regular"}).

Here’s one example of wrong ordering:
On one GPU I had

"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",

and on another one I had

"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",

This should be caught by Torch. Unfortunately it only checks that parameters are checked while buffers, i.e., values that stay constant through training, are not being checked. And since this BlocksBasisExpansion is a buffer it fails silently.

Example of such layer:
module.enc.enc_obs.conv.13.basisexpansion.block_expansion_('regular', 'regular').sampled_basis
diff.enc_a.0.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions