fix loss masking #345

RaymondLi0 · 2025-08-06T22:29:48Z

✨ Description

Fix the triton implementation triton_cross_entropy_from_distribution_forward_backward_kernel

Closes #344

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

oleksost · 2025-08-07T13:58:12Z

fast_llm/functional/cross_entropy.py

+            per_token_loss = torch.nn.functional.cross_entropy(
+                logits_ if logits_scale_factor == 1 else logits_ * logits_scale_factor, target, reduction="none"
+            )
+            loss = (per_token_loss * loss_mask).sum() / loss_mask.sum()


This can result in nans if loss_mask.sum() is 0, which can happen actually in practice in the context of reasoning SFT where prompts can be very long or when we to TP and split across sequence length dimension

So maybe better to check something like:

if mask_sum > 0: # can happen for inputs containing only prompts? loss = (loss_per_token * loss_mask).sum() / mask_sum else: loss = (loss_per_token * 0.0).mean() # preserve grads

RaymondLi0 · 2025-08-07T20:44:35Z

As discussed with @oleksost , to finish the fix we'd also need to properly reduce the loss across micro-sequences, taking into account the sum of the loss_mask.

Now, on second thought:
With the current implementation in main: the contributions of tokens to the gradient will be the same, no matter the number of mask tokens in the sample.
Whereas if we finish this fix and go forward with it: tokens from a sample (sequence) with a lot of masked positions would contribute more to the gradient (compared to tokens from a sample without loss mask).

The question is whether we want an average of the loss over samples, or over tokens.

fix loss masking

e06f332

oleksost reviewed Aug 7, 2025

View reviewed changes

prevent nan with fully-masked micro-sequence

18ddbe5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix loss masking #345

fix loss masking #345

Uh oh!

RaymondLi0 commented Aug 6, 2025

Uh oh!

oleksost Aug 7, 2025 •

edited

Loading

Uh oh!

RaymondLi0 commented Aug 7, 2025

Uh oh!

Uh oh!

fix loss masking #345

Are you sure you want to change the base?

fix loss masking #345

Uh oh!

Conversation

RaymondLi0 commented Aug 6, 2025

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

oleksost Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RaymondLi0 commented Aug 7, 2025

Uh oh!

Uh oh!

oleksost Aug 7, 2025 •

edited

Loading