Fix GRPO/DAPO/Dr.GRPO documentation: formula corrections and KL divergence clarification #3395

JenWei0312 · 2025-04-30T20:57:17Z

This PR clarifies when to use KL divergence (β parameter) in GRPO/DAPO/Dr.GRPO implementations.

The current documentation has several issues:

It implies all loss types include KL divergence, contradicting the DAPO and Dr.GRPO papers
It has inconsistencies in mathematical formulas for GRPO
It lacks guidance on when to set β=0 vs β>0

Changes made:

Corrected GRPO formula with proper normalization
Clarified that DAPO and Dr.GRPO exclude KL term (β=0)
Added a "To KL Or Not To KL" section explaining when KL can be omitted
Improved formula presentation by separating KL from base loss term

References:

DAPO: https://arxiv.org/abs/2503.14476 (Section 2.3)
Dr.GRPO: https://huggingface.co/papers/2503.20783

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

This PR clarifies when to use KL divergence (β parameter) in GRPO/DAPO/Dr.GRPO implementations. The current documentation has several issues: 1. It implies all loss types include KL divergence, contradicting the DAPO and Dr.GRPO papers 2. It has inconsistencies in mathematical formulas for GRPO 3. It lacks guidance on when to set β=0 vs β>0 Changes made: - Corrected GRPO formula with proper normalization - Clarified that DAPO and Dr.GRPO exclude KL term (β=0) - Added a "To KL Or Not To KL" section explaining when KL can be omitted - Improved formula presentation by separating KL from base loss term References: - DAPO: https://arxiv.org/abs/2503.14476 (Section 2.3) - Dr.GRPO: https://huggingface.co/papers/2503.20783

HuggingFaceDocBuilderDev · 2025-04-30T22:35:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec

Thank you very much for improving the documentation, I still have some concerns about the clarity and fluidity of reading, I made some comments.

qgallouedec · 2025-05-01T17:51:50Z

docs/source/grpo_trainer.md

@@ -106,7 +106,7 @@ Note that compared to the original formulation in [DeepSeekMath: Pushing the Lim
 In the original paper, this formulation is generalized to account for multiple updates after each generation (denoted  \\( \mu \\), can be set with `num_iterations` in [`GRPOConfig`]) by leveraging the **clipped surrogate objective**:

 $$
-\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
+\mathcal{L}_{\text{GRPO}}(\theta) =  - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|}  \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],


This would make the equation inconsistent with the previous one:

You're right that this creates an inconsistency between the formulas. My intention was to make the second formula match the original DeepSeekMath paper's definition of GRPO, which uses sequence-level normalization (-1/G ∑ 1/|oi| ∑).

I noticed that in your implementation, you have three different loss types:

if self.loss_type == "grpo": loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean() elif self.loss_type == "bnpo": loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0) elif self.loss_type == "dr_grpo": loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)

This suggests that the true GRPO implementation uses the original formula from the paper, while the first formula in the documentation seems to be describing a modified version. Would it make more sense to clarify that the first formula is actually describing the token-level normalization variant ('bnpo') rather than standard GRPO?

qgallouedec · 2025-05-01T17:51:51Z

docs/source/grpo_trainer.md

 $$

 where

 $$
-l_{i,t} = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} \mid q, o_{i,< t})\right]_{\text{no grad}}} \hat{A}_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right].


A bit the same here. I think it creates confusion, we mention the token-level normalisation, but when comparing the equations, the kl is also gone, without any explanation

The removal of KL terms in the DAPO and Dr. GRPO formulations is explicitly stated in the original papers. In DAPO paper (section 2.3), they state: "during training the long-CoT reasoning model, the model distribution can diverge significantly from the initial model, thus this restriction is not necessary. Therefore, we will exclude the KL term from our proposed algorithm."

Similarly, the Dr. GRPO paper states on page 6 that we can "remove the KL term" when using rule-based verifiers. I wanted to make this distinction clear in the documentation since it represents an important theoretical difference between these algorithms.

I also attached both papers.
DAPO.pdf
Dr_GRPO.pdf

qgallouedec · 2025-05-01T17:56:18Z

docs/source/grpo_trainer.md

+Beside how we weight for each output loss by its length, both DAPO ([DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://arxiv.org/abs/2503.14476)) and Dr. GRPO ([Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783)) exclude the KL term, or setting $\beta = 0$ in $\mathcal{L}_{\text{GRPO}}(\theta)$, because
+
+  1) during training the long-CoT reasoning model, the model distribution can diverge significantly from the initial model, thus this restriction is not necessary. And furthermore,
+  2) RL-tuning reasoning models typically employs rule-based verifiers as r [Tulu 3: Pushing Frontiers in Open Language Model Post-Training](https://arxiv.org/abs/2411.15124), eliminating the concerns of distributional shift.


I'm not sure how rule-based reward functions prevent distributional shift. Can you elaborate on this?

Regarding rule-based verifiers preventing distributional shift concerns: The key insight from the Dr. GRPO paper is that rule-based verifiers (like math answer checkers) provide accurate rewards regardless of how the policy distribution changes. Unlike learned reward models which are only accurate near their training distribution, rule-based verifiers are deterministic functions that don't suffer from distribution shift problems. This is why the papers argue it's safe to set β=0 when using rule-based rewards.

JenWei0312 · 2025-05-04T17:07:55Z

Thank you for your comments! My goal with these changes is to make the documentation accurately reflect the original research papers. There are a few key issues I'm trying to fix:

Formula inconsistency: The original DeepSeekMath GRPO normalizes per sequence then averages (-1/G ∑ 1/|oi| ∑), while DAPO uses Batch normalization (-1/∑|oi| ∑∑). The current documentation mixes these approaches.
KL divergence: Both DAPO and Dr. GRPO papers explicitly state they remove the KL term (β=0). From DAPO paper section 2.3: 'during training the long-CoT reasoning model, the model distribution can diverge significantly from the initial model, thus this restriction is not necessary.' The Dr. GRPO paper similarly states on page 6 that for rule-based verifiers, we can 'remove the KL term.'
Regarding rule-based verifiers preventing distributional shift concerns: Unlike learned reward models which are only accurate near their training distribution, rule-based verifiers (like checking if a math answer is correct) provide accurate rewards regardless of how far the policy drifts from the reference model. This is why both papers argue it's safe to set β=0 in this context.

My changes aim to make the documentation clearly reflect these important differences between the algorithms, which are implemented in the code but not clearly explained in the current documentation.

JenWei0312 · 2025-05-09T21:46:44Z

Hi @qgallouedec

I wanted to follow up on my previous responses about the KL divergence term and formula inconsistencies. I hope my explanations helped clarify my reasoning for the proposed changes.

After reflecting further on the documentation, I'm wondering if a more comprehensive restructuring might address these issues more effectively. Would you be open to an approach that clearly separates the theoretical background from the implementation details? Something like:

Research Papers Background

GRPO (DeepSeekMath): Original formulation with sequence-level normalization and KL term
DAPO: Token-level normalization without KL term
Dr. GRPO: Constant normalization without KL term

Implementation Details

if self.loss_type == "grpo":
    loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
elif self.loss_type == "bnpo":
    loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
elif self.loss_type == "dr_grpo":
    loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)

Merge branch 'main' into patch-1

f61378b

qgallouedec reviewed May 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GRPO/DAPO/Dr.GRPO documentation: formula corrections and KL divergence clarification #3395

Fix GRPO/DAPO/Dr.GRPO documentation: formula corrections and KL divergence clarification #3395

JenWei0312 commented Apr 30, 2025

HuggingFaceDocBuilderDev commented Apr 30, 2025

qgallouedec left a comment •

edited

Loading

qgallouedec May 1, 2025

JenWei0312 May 4, 2025 •

edited

Loading

qgallouedec May 1, 2025

JenWei0312 May 4, 2025

qgallouedec May 1, 2025

JenWei0312 May 4, 2025

JenWei0312 commented May 4, 2025

JenWei0312 commented May 9, 2025

Fix GRPO/DAPO/Dr.GRPO documentation: formula corrections and KL divergence clarification #3395

Are you sure you want to change the base?

Fix GRPO/DAPO/Dr.GRPO documentation: formula corrections and KL divergence clarification #3395

Conversation

JenWei0312 commented Apr 30, 2025

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 30, 2025

qgallouedec left a comment • edited Loading

Choose a reason for hiding this comment

qgallouedec May 1, 2025

Choose a reason for hiding this comment

JenWei0312 May 4, 2025 • edited Loading

Choose a reason for hiding this comment

qgallouedec May 1, 2025

Choose a reason for hiding this comment

JenWei0312 May 4, 2025

Choose a reason for hiding this comment

qgallouedec May 1, 2025

Choose a reason for hiding this comment

JenWei0312 May 4, 2025

Choose a reason for hiding this comment

JenWei0312 commented May 4, 2025

JenWei0312 commented May 9, 2025

Research Papers Background

Implementation Details

qgallouedec left a comment •

edited

Loading

JenWei0312 May 4, 2025 •

edited

Loading