Skip to content

Conversation

@Muennighoff
Copy link

Muennighoff and others added 8 commits October 7, 2025 11:02
Added humanline logic for log ratio calculation and clipping.
Added a new callback to synchronize the model with a reference model during training.
Added HumanlineSyncRefModelCallback to support humanline synchronization.
added humanline for dpo and kto
@kashif kashif self-requested a review October 15, 2025 08:26
@lewtun
Copy link
Member

lewtun commented Oct 15, 2025

Thanks @Muennighoff ! Do you have some datasets we can test this on?

@Muennighoff
Copy link
Author

Thanks! For GRPO on math, we used just MATH500 (this repo: https://github.com/kawine/open-r1-humanline); Maybe @sijial430 can comment regarding KTO/DPO

@sijial430
Copy link

For GRPO on instruction following, we use princeton-nlp/llama3-ultrafeedback-armorm mainly (repo: https://github.com/ContextualAI/HALOs/tree/research). Thanks!

@kawine
Copy link
Contributor

kawine commented Oct 19, 2025

To provide some more context, all our math experiments were done in HF's open-r1 repo and were done by subclassing GRPOTrainer. All the major changes are in this file, save for adding some new arguments to configs.py (prefixed with 'humanline') and adding some recipes to the recipes/ folder. The results, as discussed in the paper, are:

accuracy_reward format_reward

Our instruction-following experiments were done in the research branch of the HALOs repo.
We are working on some runs to ensure that the offline -> offline+humanline improvements like the one below for gemma2-27B also hold up in TRL.

Screenshot 2025-10-19 at 2 16 29 PM

kl = (policy_KL_logps - reference_KL_logps).mean().detach()
if self.humanline:
policy_KL_logps.clamp_(min=self.humanline_log_eps_P, max=self.humanline_log_eps_R)
reference_KL_logps.clamp_(min=self.humanline_log_eps_P, max=self.humanline_log_eps_R)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijial430 This is incorrect. The logp ratios should be clamped, not the raw logps.


@staticmethod
def get_batch_logps(
self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijial430 get_batch_logps should be kept a static method. humanline should be passed as an argument.

@kashif
Copy link
Collaborator

kashif commented Oct 22, 2025

@Muennighoff will you open an another PR or just fix this one?

@sijial430
Copy link

@kashif Thanks! We will reopen a new PR soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants