Description
Dear @duncanriach,
Thank you for your contributions, work and guidance towards making tensorflow deterministic in the recent releases.
Unfortunately, for popular keras NLP models (BERT) some problems seem to remain (see also related issue in this repository #14).
In spite of combining learnings from:
- the "complete recipe" in your slides from gputechconf
- your recently suggested workaround for issues with crossentropy loss
... I am still arriving at the following short, non-deterministic colab notebook example.
My results for the sum of model weights (as computed with a function you had suggested) after training for only 5 steps is (differences are highlighted
below):
Device | Before training | After training | |
---|---|---|---|
Run 1 | GPU | -641227.5609667897224 | -641237.442 5159916282 |
Run 2 | GPU | -641227.5609667897224 | -641237.442 3093758523 |
Run 1 | CPU | -641227.5609667301178 | -641238.1506845243275 |
Run 2 | CPU | -641227.5609667301178 | -641238.1506845243275 |
This variance gets increasingly more pronounced when the model is trained for longer periods of time.
Could you please help identify the source of non-determinism and provide guidance on how we can resolve it?
As transformers is a very popular package (29.1k Github stars), I am expecting that many other people are silently impacted by this very phenomenon.
Note: As shown above, I have observed that the same code becomes fully deterministic when running on the colab CPU runtime.