Skip to content

Reproducibility issue with transformers (BERT) and tf2.2 #19

Open
@MFreidank

Description

@MFreidank

Dear @duncanriach,
Thank you for your contributions, work and guidance towards making tensorflow deterministic in the recent releases.
Unfortunately, for popular keras NLP models (BERT) some problems seem to remain (see also related issue in this repository #14).

In spite of combining learnings from:

... I am still arriving at the following short, non-deterministic colab notebook example.

My results for the sum of model weights (as computed with a function you had suggested) after training for only 5 steps is (differences are highlighted below):

Device Before training After training
Run 1 GPU -641227.5609667897224 -641237.442 5159916282
Run 2 GPU -641227.5609667897224 -641237.442 3093758523
Run 1 CPU -641227.5609667301178 -641238.1506845243275
Run 2 CPU -641227.5609667301178 -641238.1506845243275

This variance gets increasingly more pronounced when the model is trained for longer periods of time.

Could you please help identify the source of non-determinism and provide guidance on how we can resolve it?

As transformers is a very popular package (29.1k Github stars), I am expecting that many other people are silently impacted by this very phenomenon.

Note: As shown above, I have observed that the same code becomes fully deterministic when running on the colab CPU runtime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions