-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🕊️ Migration PPOv2
-> PPO
#2174
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…to migration-ppo
…to migration-ppo
While working on this, I noticed that most of our documentation and notebooks are based on PPO and may be outdated, relying on older datasets, models, references, etc. I believe updating all of our examples and guides falls outside the scope of this PR. I recommend the following:
|
@@ -1,4 +1,4 @@ | |||
# Copyright 2022 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is just the renaming of ppov2_config. the diff view makes no sense
@@ -11,1271 +11,53 @@ | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is just the tests/test_ppov2_trainer.py renamed. The diff view makes no sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from a comment about the examples and 1 change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the migration @qgallouedec ! Overall LGTM with some nits about the docs since we can take this opportunity to nuke a bunch of outdated stuff
horizon: float = 10000.0 | ||
gamma: float = 1.0 | ||
lam: float = 0.95 | ||
exp_name: str = os.path.basename(__file__)[: -len(".py")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's a bug in the GitHub diff, but this looks repeated from above
Co-authored-by: Edward Beeching <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: lewtun <[email protected]>
…to migration-ppo
Would it be possible to keep the "old" PPO code around until the new trainer achieves feature parity (in particular peft and arbitrary reward support) with the old one? |
I would also like to have the old PPO code kept around, I was previously using the old ppotrainer.step() function to provide my own reward values for each query/response pair, but there doesn't seem to be an equivalent in PPOv2, unless I'm missing some other way of doing this? |
You should use trl==0.11 |
What does this PR do?
Follows #2016
Fixes # (issue)
It basically:
PPO
(including trainer, config, tests, examples, but not all, see my comment)PPOv2
toPPO
(in trainer, config, examples, tests, doc, etc.)PPOv2
forPPO
I've done my best, but it may break some code. To put things in perspective, PPO isn't as popular as it used to be, at the time of writing, the last PPO model created on the hub was two weeks ago.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.