Recalculate Returns and Advantages After Callback to Ensure Reward Consistency (common/on_policy_algorithm.py)#2000
Open
mhyrzt wants to merge 4 commits intoDLR-RM:masterfrom
Open
Recalculate Returns and Advantages After Callback to Ensure Reward Consistency (common/on_policy_algorithm.py)#2000mhyrzt wants to merge 4 commits intoDLR-RM:masterfrom
mhyrzt wants to merge 4 commits intoDLR-RM:masterfrom
Conversation
Added a second call to `rollout_buffer.compute_returns_and_advantage()` after invoking the `on_rollout_end()` callback. This adjustment accounts for any potential modifications to the rewards that might occur within custom callback logic. By recalculating returns and advantages, we ensure that the training process reflects the most up-to-date reward structure, improving accuracy and stability of the advantage estimates during the policy update step. This is especially useful when using callbacks that manipulate rewards for custom reward shaping or augmentation.
Author
|
Hi again, Before I begin, I just wanted to apologize for any headaches and for being a bit of a noob since I rarely contribute to projects. I’ve just run $ make commit-checks
# Sort imports
ruff check --select I stable_baselines3/ tests/ docs/conf.py setup.py --fix
All checks passed!
# Reformat using black
black stable_baselines3/ tests/ docs/conf.py setup.py
All done! ✨ 🍰 ✨
96 files left unchanged.
mypy stable_baselines3/ tests/ docs/conf.py setup.py
Success: no issues found in 94 source files
# Stop the build if there are Python syntax errors or undefined names
# See https://www.flake8rules.com/
ruff check stable_baselines3/ tests/ docs/conf.py setup.py --select=E9,F63,F7,F82 --output-format=full
All checks passed!
# exit-zero treats all errors as warnings.
ruff check stable_baselines3/ tests/ docs/conf.py setup.py --exit-zero --output-format=concise
All checks passed!Although I’ve added my username to Also, I haven’t written any tests for the suggested bug fix yet. If you think it's necessary, I'll write them ASAP. |
Member
this step is important to discuss the issue and see if a fix/feature is needed or not or if there are better alternatives. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello There, SB3 Team! 👋
I hope this message finds you well. First and foremost, I want to extend my gratitude for the fantastic work you’ve done on the Stable Baselines3 library. It's truly a remarkable tool and has been invaluable for my projects.
I apologize in advance for any inconsistencies or if I have missed any items in the checklist. Your understanding and guidance are greatly appreciated.
In this pull request, I’ve added a second call to
rollout_buffer.compute_returns_and_advantage()after invoking theon_rollout_end()callback. This adjustment accounts for any potential modifications to the rewards that might occur within custom callback logic. By recalculating returns and advantages, we ensure that the training process reflects the most up-to-date reward structure, improving accuracy and stability of the advantage estimates during the policy update step.This is especially useful when using callbacks that manipulate rewards for custom reward shaping or augmentation.
Description
Recalculated returns and advantages after the
on_rollout_endcallback to handle any reward transformations or manipulations performed by custom callbacks. This ensures that the rollout buffer reflects the updated rewards when computing advantages for the policy update step.Previously,
rollout_buffer.compute_returns_and_advantage()was only called before the callback. If any rewards were modified by the callback duringon_rollout_end(), the advantages would not reflect these changes. By adding a second call tocompute_returns_and_advantage(), we ensure that the correct values are used for training.Code Changes
Motivation and Context
Why is this change required?
This change ensures that if a callback modifies rewards during the
on_rollout_end()phase, the computed returns and advantages are updated accordingly. Without recalculating these values, any changes to the rewards made by the callback would not be reflected in the advantage estimates used for policy updates, potentially leading to suboptimal training performance.By recalculating returns and advantages, we ensure the most accurate representation of the updated rewards, improving stability and effectiveness in training, especially when using reward shaping techniques or custom callbacks.
Types of changes
Checklist
make format(required)make check-codestyleandmake lint(required)make pytestandmake typeboth pass. (required)make doc(required)Note: You can run most of the checks using
make commit-checks.Note: we are using a maximum length of 127 characters per line