-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix GRPO/DAPO/Dr.GRPO documentation: formula corrections and KL divergence clarification #3395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR clarifies when to use KL divergence (β parameter) in GRPO/DAPO/Dr.GRPO implementations. The current documentation has several issues: 1. It implies all loss types include KL divergence, contradicting the DAPO and Dr.GRPO papers 2. It has inconsistencies in mathematical formulas for GRPO 3. It lacks guidance on when to set β=0 vs β>0 Changes made: - Corrected GRPO formula with proper normalization - Clarified that DAPO and Dr.GRPO exclude KL term (β=0) - Added a "To KL Or Not To KL" section explaining when KL can be omitted - Improved formula presentation by separating KL from base loss term References: - DAPO: https://arxiv.org/abs/2503.14476 (Section 2.3) - Dr.GRPO: https://huggingface.co/papers/2503.20783
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for improving the documentation, I still have some concerns about the clarity and fluidity of reading, I made some comments.
@@ -106,7 +106,7 @@ Note that compared to the original formulation in [DeepSeekMath: Pushing the Lim | |||
In the original paper, this formulation is generalized to account for multiple updates after each generation (denoted \\( \mu \\), can be set with `num_iterations` in [`GRPOConfig`]) by leveraging the **clipped surrogate objective**: | |||
|
|||
$$ | |||
\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right], | |||
\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that this creates an inconsistency between the formulas. My intention was to make the second formula match the original DeepSeekMath paper's definition of GRPO, which uses sequence-level normalization (-1/G ∑ 1/|oi| ∑).
I noticed that in your implementation, you have three different loss types:
if self.loss_type == "grpo":
loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
elif self.loss_type == "bnpo":
loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
elif self.loss_type == "dr_grpo":
loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)
This suggests that the true GRPO implementation uses the original formula from the paper, while the first formula in the documentation seems to be describing a modified version. Would it make more sense to clarify that the first formula is actually describing the token-level normalization variant ('bnpo') rather than standard GRPO?
$$ | ||
|
||
where | ||
|
||
$$ | ||
l_{i,t} = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} \mid q, o_{i,< t})\right]_{\text{no grad}}} \hat{A}_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of KL terms in the DAPO and Dr. GRPO formulations is explicitly stated in the original papers. In DAPO paper (section 2.3), they state: "during training the long-CoT reasoning model, the model distribution can diverge significantly from the initial model, thus this restriction is not necessary. Therefore, we will exclude the KL term from our proposed algorithm."
Similarly, the Dr. GRPO paper states on page 6 that we can "remove the KL term" when using rule-based verifiers. I wanted to make this distinction clear in the documentation since it represents an important theoretical difference between these algorithms.
I also attached both papers.
DAPO.pdf
Dr_GRPO.pdf
Beside how we weight for each output loss by its length, both DAPO ([DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://arxiv.org/abs/2503.14476)) and Dr. GRPO ([Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783)) exclude the KL term, or setting $\beta = 0$ in $\mathcal{L}_{\text{GRPO}}(\theta)$, because | ||
|
||
1) during training the long-CoT reasoning model, the model distribution can diverge significantly from the initial model, thus this restriction is not necessary. And furthermore, | ||
2) RL-tuning reasoning models typically employs rule-based verifiers as r [Tulu 3: Pushing Frontiers in Open Language Model Post-Training](https://arxiv.org/abs/2411.15124), eliminating the concerns of distributional shift. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how rule-based reward functions prevent distributional shift. Can you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding rule-based verifiers preventing distributional shift concerns: The key insight from the Dr. GRPO paper is that rule-based verifiers (like math answer checkers) provide accurate rewards regardless of how the policy distribution changes. Unlike learned reward models which are only accurate near their training distribution, rule-based verifiers are deterministic functions that don't suffer from distribution shift problems. This is why the papers argue it's safe to set β=0 when using rule-based rewards.
Thank you for your comments! My goal with these changes is to make the documentation accurately reflect the original research papers. There are a few key issues I'm trying to fix:
My changes aim to make the documentation clearly reflect these important differences between the algorithms, which are implemented in the code but not clearly explained in the current documentation. |
Hi @qgallouedec I wanted to follow up on my previous responses about the KL divergence term and formula inconsistencies. I hope my explanations helped clarify my reasoning for the proposed changes. After reflecting further on the documentation, I'm wondering if a more comprehensive restructuring might address these issues more effectively. Would you be open to an approach that clearly separates the theoretical background from the implementation details? Something like: Research Papers Background
Implementation Detailsif self.loss_type == "grpo":
loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
elif self.loss_type == "bnpo":
loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
elif self.loss_type == "dr_grpo":
loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length) |
This PR clarifies when to use KL divergence (β parameter) in GRPO/DAPO/Dr.GRPO implementations.
The current documentation has several issues:
Changes made:
References:
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.