Non-deterministic behavior on multi-GPU run

I have encountered a bug that invalidates multi-GPU training. Each model stored per GPU then diverges from the others given that the initialization of the model is non-deterministic.

This happens for all `sampled_basis` layer. Specifically in the init of `BlocksBasisExpansion` in https://github.com/QUVA-Lab/escnn/blob/master/escnn/nn/modules/basismanager/basisexpansion_blocks.py)at:

```
for i_repr in set(in_reprs):
    for o_repr in set(out_reprs):
```
which makes the order of layers created random given that it iterates over sets. And this will happen on each GPU making models per GPU differ (`in_reprs` being `{"irrep_1", "irrep_0", "regular"}` and `out_reprs` being `{"regular"}`).

Here’s one example of wrong ordering:
On one GPU I had

```
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",
```
and on another one I had

```
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",
```

This should be caught by Torch. Unfortunately it only checks that `parameters` are checked while buffers, i.e., values that stay constant through training, are not being checked. And since this `BlocksBasisExpansion` is a buffer it fails silently.

Example of such layer:
`module.enc.enc_obs.conv.13.basisexpansion.block_expansion_('regular', 'regular').sampled_basis`
`diff.enc_a.0.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-deterministic behavior on multi-GPU run #103

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Non-deterministic behavior on multi-GPU run #103

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions