Skip to content

Conversation

riccardofelluga
Copy link
Collaborator

Summary

This PR introduces dtype-specific numerical tolerances for Transformer Engine (TE) tests to improve test stability and reduce false positives due to numerical precision differences across different floating-point data types.

Key Changes

  • Added te_assert_close function: A new assertion wrapper that automatically applies appropriate tolerances based on tensor dtype and scalar inputs
  • Dtype-specific tolerances: Implements tolerances aligned with Transformer Engine's numerical testing specification:
    • float32: rtol=1.3e-6, atol=1e-5
    • float16: rtol=1e-3, atol=1e-5
    • bfloat16: rtol=1.6e-2, atol=1e-5
    • float scalars: rtol=1.3e-6, atol=1e-5

Copy link
Collaborator

@mattteochen mattteochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you!

Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-vi t-vi merged commit a435195 into main Oct 15, 2025
55 of 68 checks passed
@t-vi t-vi deleted the te-relax-test-tolerance branch October 15, 2025 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants