Reproducibility issue with transformers (BERT) and tf2.2

Dear @duncanriach,
Thank you for your contributions, work and guidance towards making _tensorflow_ deterministic in the recent releases. 
Unfortunately, for popular keras NLP models (BERT) some problems seem to remain (see also related issue in this repository #14). 

In spite of combining learnings from:
* [the "complete recipe" in your slides from gputechconf](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9911-determinism-in-deep-learning.pdf) 
* [your recently suggested workaround](https://github.com/tensorflow/tensorflow/issues/38185#issuecomment-643014439) for issues with crossentropy loss

... I am still arriving at the following [short, non-deterministic colab notebook example](https://colab.research.google.com/drive/1VSU8lYFD0E1HKZrIL1MvyIRwAktlSF_t?usp=sharing).


My results for the sum of model weights (as computed with [a function you had suggested](https://github.com/NVIDIA/tensorflow-determinism/issues/2#issuecomment-548210203)) after training **for only 5 steps** is (differences are **`highlighted`** below):


| | Device | Before training | After training |
| ------------- | ------------- | ------------- | ------------- |
| Run 1  | GPU | -641227.5609667897224  | -641237.442 **`5159916282`** |
| Run 2  | GPU | -641227.5609667897224  | -641237.442 **`3093758523`** |
| | |  | |
| Run 1 | CPU | -641227.5609667301178 | -641238.1506845243275 |
| Run 2 | CPU | -641227.5609667301178 | -641238.1506845243275 |

This variance gets increasingly more pronounced when the model is trained for longer periods of time.


Could you please help identify the source of non-determinism and provide guidance on how we can resolve it?

As [transformers](https://github.com/huggingface/transformers/) is a very popular package (29.1k Github stars), I am expecting that many other people are silently impacted by this very phenomenon.

_Note:_ As shown above, I have observed that the same code becomes fully deterministic when running on the colab CPU runtime. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducibility issue with transformers (BERT) and tf2.2 #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Device	Before training	After training
Run 1	GPU	-641227.5609667897224	-641237.442 `5159916282`
Run 2	GPU	-641227.5609667897224	-641237.442 `3093758523`

Run 1	CPU	-641227.5609667301178	-641238.1506845243275
Run 2	CPU	-641227.5609667301178	-641238.1506845243275

Reproducibility issue with transformers (BERT) and tf2.2 #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions