Skip to content

Fix GRPO/DAPO/Dr.GRPO documentation: formula corrections and KL divergence clarification #3395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JenWei0312
Copy link

This PR clarifies when to use KL divergence (β parameter) in GRPO/DAPO/Dr.GRPO implementations.

The current documentation has several issues:

  1. It implies all loss types include KL divergence, contradicting the DAPO and Dr.GRPO papers
  2. It has inconsistencies in mathematical formulas for GRPO
  3. It lacks guidance on when to set β=0 vs β>0

Changes made:

  • Corrected GRPO formula with proper normalization
  • Clarified that DAPO and Dr.GRPO exclude KL term (β=0)
  • Added a "To KL Or Not To KL" section explaining when KL can be omitted
  • Improved formula presentation by separating KL from base loss term

References:

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

This PR clarifies when to use KL divergence (β parameter) in GRPO/DAPO/Dr.GRPO implementations.

The current documentation has several issues:
1. It implies all loss types include KL divergence, contradicting the DAPO and Dr.GRPO papers
2. It has inconsistencies in mathematical formulas for GRPO
3. It lacks guidance on when to set β=0 vs β>0

Changes made:
- Corrected GRPO formula with proper normalization
- Clarified that DAPO and Dr.GRPO exclude KL term (β=0)
- Added a "To KL Or Not To KL" section explaining when KL can be omitted
- Improved formula presentation by separating KL from base loss term

References:
- DAPO: https://arxiv.org/abs/2503.14476 (Section 2.3)
- Dr.GRPO: https://huggingface.co/papers/2503.20783
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for improving the documentation, I still have some concerns about the clarity and fluidity of reading, I made some comments.

@@ -106,7 +106,7 @@ Note that compared to the original formulation in [DeepSeekMath: Pushing the Lim
In the original paper, this formulation is generalized to account for multiple updates after each generation (denoted \\( \mu \\), can be set with `num_iterations` in [`GRPOConfig`]) by leveraging the **clipped surrogate objective**:

$$
\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make the equation inconsistent with the previous one:
Screenshot 2025-05-01 at 10 44 49

Copy link
Author

@JenWei0312 JenWei0312 May 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that this creates an inconsistency between the formulas. My intention was to make the second formula match the original DeepSeekMath paper's definition of GRPO, which uses sequence-level normalization (-1/G ∑ 1/|oi| ∑).

I noticed that in your implementation, you have three different loss types:

if self.loss_type == "grpo":
    loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
elif self.loss_type == "bnpo":
    loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
elif self.loss_type == "dr_grpo":
    loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)

This suggests that the true GRPO implementation uses the original formula from the paper, while the first formula in the documentation seems to be describing a modified version. Would it make more sense to clarify that the first formula is actually describing the token-level normalization variant ('bnpo') rather than standard GRPO?

$$

where

$$
l_{i,t} = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} \mid q, o_{i,< t})\right]_{\text{no grad}}} \hat{A}_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit the same here. I think it creates confusion, we mention the token-level normalisation, but when comparing the equations, the kl is also gone, without any explanation
Screenshot 2025-05-01 at 10 47 48

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of KL terms in the DAPO and Dr. GRPO formulations is explicitly stated in the original papers. In DAPO paper (section 2.3), they state: "during training the long-CoT reasoning model, the model distribution can diverge significantly from the initial model, thus this restriction is not necessary. Therefore, we will exclude the KL term from our proposed algorithm."

Similarly, the Dr. GRPO paper states on page 6 that we can "remove the KL term" when using rule-based verifiers. I wanted to make this distinction clear in the documentation since it represents an important theoretical difference between these algorithms.

I also attached both papers.
DAPO.pdf
Dr_GRPO.pdf

Beside how we weight for each output loss by its length, both DAPO ([DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://arxiv.org/abs/2503.14476)) and Dr. GRPO ([Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783)) exclude the KL term, or setting $\beta = 0$ in $\mathcal{L}_{\text{GRPO}}(\theta)$, because

1) during training the long-CoT reasoning model, the model distribution can diverge significantly from the initial model, thus this restriction is not necessary. And furthermore,
2) RL-tuning reasoning models typically employs rule-based verifiers as r [Tulu 3: Pushing Frontiers in Open Language Model Post-Training](https://arxiv.org/abs/2411.15124), eliminating the concerns of distributional shift.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how rule-based reward functions prevent distributional shift. Can you elaborate on this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding rule-based verifiers preventing distributional shift concerns: The key insight from the Dr. GRPO paper is that rule-based verifiers (like math answer checkers) provide accurate rewards regardless of how the policy distribution changes. Unlike learned reward models which are only accurate near their training distribution, rule-based verifiers are deterministic functions that don't suffer from distribution shift problems. This is why the papers argue it's safe to set β=0 when using rule-based rewards.

@JenWei0312
Copy link
Author

Thank you for your comments! My goal with these changes is to make the documentation accurately reflect the original research papers. There are a few key issues I'm trying to fix:

  1. Formula inconsistency: The original DeepSeekMath GRPO normalizes per sequence then averages (-1/G ∑ 1/|oi| ∑), while DAPO uses Batch normalization (-1/∑|oi| ∑∑). The current documentation mixes these approaches.

  2. KL divergence: Both DAPO and Dr. GRPO papers explicitly state they remove the KL term (β=0). From DAPO paper section 2.3: 'during training the long-CoT reasoning model, the model distribution can diverge significantly from the initial model, thus this restriction is not necessary.' The Dr. GRPO paper similarly states on page 6 that for rule-based verifiers, we can 'remove the KL term.'

  3. Regarding rule-based verifiers preventing distributional shift concerns: Unlike learned reward models which are only accurate near their training distribution, rule-based verifiers (like checking if a math answer is correct) provide accurate rewards regardless of how far the policy drifts from the reference model. This is why both papers argue it's safe to set β=0 in this context.

My changes aim to make the documentation clearly reflect these important differences between the algorithms, which are implemented in the code but not clearly explained in the current documentation.

@JenWei0312
Copy link
Author

Hi @qgallouedec

I wanted to follow up on my previous responses about the KL divergence term and formula inconsistencies. I hope my explanations helped clarify my reasoning for the proposed changes.

After reflecting further on the documentation, I'm wondering if a more comprehensive restructuring might address these issues more effectively. Would you be open to an approach that clearly separates the theoretical background from the implementation details? Something like:

Research Papers Background

  • GRPO (DeepSeekMath): Original formulation with sequence-level normalization and KL term
  • DAPO: Token-level normalization without KL term
  • Dr. GRPO: Constant normalization without KL term

Implementation Details

if self.loss_type == "grpo":
    loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
elif self.loss_type == "bnpo":
    loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
elif self.loss_type == "dr_grpo":
    loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants