Skip to content

Conversation

supergeorge23
Copy link
Contributor

Summary:

This Diff:

This diff implements Step 1 of T232701473 by creating EmptyDataloaderDetectorCallback,
a TNT callback that detects consecutive empty training epochs and implements a fail-fast
strategy to surface dataloader issues early.

Callback Feature:

The callback helps identify cases where dataloaders return empty batches, which can
cause confusing downstream issues that manifest as red herrings (e.g., apparent
checkpointing errors that are actually rapid step progression due to empty data).

Next Diff:

Add to Mitra's default callbacks (Step 2 of T232701473)

Differential Revision: D79212756

@meta-cla meta-cla bot added the cla signed label Jul 30, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79212756

supergeorge23 added a commit to supergeorge23/tnt that referenced this pull request Aug 1, 2025
Summary:

# This Diff:
This diff implements Step 1 of T232701473 by creating EmptyDataloaderDetectorCallback,
a TNT callback that detects consecutive empty training epochs and implements a fail-fast
strategy to surface dataloader issues early.

# Callback Feature:
The callback helps identify cases where dataloaders return empty batches, which can
cause confusing downstream issues that manifest as red herrings (e.g., apparent
checkpointing errors that are actually rapid step progression due to empty data).

# Next Diff:
Add to Mitra's default callbacks (Step 2 of T232701473)

Differential Revision: D79212756
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79212756

supergeorge23 added a commit to supergeorge23/tnt that referenced this pull request Aug 1, 2025
Summary:

# This Diff:
This diff implements Step 1 of T232701473 by creating EmptyDataloaderDetectorCallback,
a TNT callback that detects consecutive empty training epochs and implements a fail-fast
strategy to surface dataloader issues early.

# Callback Feature:
The callback helps identify cases where dataloaders return empty batches, which can
cause confusing downstream issues that manifest as red herrings (e.g., apparent
checkpointing errors that are actually rapid step progression due to empty data).

# Next Diff:
Add to Mitra's default callbacks (Step 2 of T232701473), and will enable e2e test with Mitra

Differential Revision: D79212756
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79212756

supergeorge23 added a commit to supergeorge23/tnt that referenced this pull request Aug 1, 2025
Summary:
Pull Request resolved: meta-pytorch#1020

# This Diff:
This diff implements Step 1 of T232701473 by creating EmptyDataloaderDetectorCallback,
a TNT callback that detects consecutive empty training epochs and implements a fail-fast
strategy to surface dataloader issues early.

# Callback Feature:
The callback helps identify cases where dataloaders return empty batches, which can
cause confusing downstream issues that manifest as red herrings (e.g., apparent
checkpointing errors that are actually rapid step progression due to empty data).

# Next Diff:
Add to Mitra's default callbacks (Step 2 of T232701473), and will enable e2e test with Mitra

Differential Revision: D79212756
@supergeorge23 supergeorge23 force-pushed the export-D79212756 branch 2 times, most recently from 6c43d86 to 4ec1115 Compare August 4, 2025 03:24
supergeorge23 added a commit to supergeorge23/tnt that referenced this pull request Aug 4, 2025
Summary:

# This Diff:
This diff implements Step 1 of T232701473 by creating EmptyDataloaderDetectorCallback,
a TNT callback that detects consecutive empty training epochs and implements a fail-fast
strategy to surface dataloader issues early.

# Callback Feature:
The callback helps identify cases where dataloaders return empty batches, which can
cause confusing downstream issues that manifest as red herrings (e.g., apparent
checkpointing errors that are actually rapid step progression due to empty data).

# Next Diff:
Add to Mitra's default callbacks (Step 2 of T232701473), and will enable e2e test with Mitra

Differential Revision: D79212756
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79212756

Summary:

# This Diff:
This diff implements Step 1 of T232701473 by creating EmptyDataloaderDetectorCallback,
a TNT callback that detects consecutive empty training epochs and implements a fail-fast
strategy to surface dataloader issues early.

# Callback Feature:
The callback helps identify cases where dataloaders return empty batches, which can
cause confusing downstream issues that manifest as red herrings (e.g., apparent
checkpointing errors that are actually rapid step progression due to empty data).

# Next Diff:
Add to Mitra's default callbacks (Step 2 of T232701473), and will enable e2e test with Mitra

Differential Revision: D79212756
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79212756

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants