-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] independently configurable learning rates for actor and critic #338
Comments
Seems like in the original silver paper they use different learning rates in the application, too: |
Since the original DDPG paper used this approach I think this should be included here as well down the line, however I am not sure about including it outside DDPG: TD3 did not use this as far I am aware (@araffin ?) and tbh I have never seen this type of trickery before (do share references if other algorithms use this). Shared parameters also become a problem here and I am not sure what the behaviour would be like. Quickly reading through DDPG paper I did not see a note on if they shared parameters between actors and critics, so I assume they had separate networks. |
Hello,
so you are proposing that for DDPG, TD3 and SAC?
I have mixed feelings about that. On one hand, the request seems reasonable. On the other hand, this would add complexity (we would need to support schedule for each of them) and won't necessary help in your hyperparameter optimization as it would make the search space wider. |
the search space does not get that much more complex. I would have a "base learningrate" and a "actor learningrate scale" to change the base learningrate for one of the networks by a factor. Modern blackbox optimizers like optuna are quite efficient in learning that. For research it is a bit of a problem to use stable-baselines as i cannot reproduce the work of other publications as most of the literature does exactly that. TD3 is nice, but using an different algorithm is less comparable. To be totally honest i have not looked that much into TD3 though. The pointers to the contrib part ist definetly interesting and i totally understand when you dont deem it usefull for your purpose of this library. However, i might need to switch to acme than.... Thanks for your fast replay in any case. |
TD3 is literally DDPG + tricks, so it will give you at least as good results as DDPG and usually better (or much better).
To sum the discussion: you would like to have two different lr because of performance + comparison reasons. But if you really need two different learning rates, you can simply define a custom DDPG class that overrides that method: stable-baselines3/stable_baselines3/common/base_class.py Lines 245 to 259 in 65100a4
called here: stable-baselines3/stable_baselines3/td3/td3.py Lines 129 to 130 in 65100a4
that would look like: class CustomDDPG(DDPG):
def __init__(policy, env, *args, actor_lr=1e-5, critic_lr=1e-4, **kwargs):
super().__init__(policy, env, *args, **kwargs)
self.actor_lr = actor_lr
self.critic_lr = critic_lr
def _update_learning_rate(self, optimizers):
actor_optimizer, critic_optimizer = optimizers
... # update the learning rate using different values if needed |
Thanks for the help. Just one more general note: You call it DDPG but its actually a variant of the original implementation. This is a little misleading in my opinion. Of course its perfectly valid to do so, but maybe it would be generally useful if the documentation would clarify which exact design changes are done with regard to the source publication. I am still not convinced, but i no one prevents me from forking, hence i dont think further discussion would be a valuable use of time as if fully understand your standpoint :) |
Re-opening as it might be helpful to be able to tune |
Hi, I'm wanting to use different learning rates for the policy and value networks in PPO, can I not do this because they share the optimizer? If so, can I change the optimizer so they can or would you not recommend this? I'm using SB3 1.6.2. fyi |
yes
you can give it a try (you will need to fork SB3 for that), please report any result here ;) EDIT: it will probably not work if the actor and critic share weights |
Hello! I came to this issue exactly for this, and I totally agree that a note in the docs about this change with original DDPG would prevent people from thinking that the same paper hyperparameters are being used (as indicated in #1562 (comment)). Is this the only difference from the paper hyperparameters? Also, I'd like to point out why it would be useful in my case to include this change in DDPG. I need to use DDPG in a paper, and I'd like to use the original hyperparameters, but I need DDPG and not TD3 because I want to use the Wolpertinger architecture, which is integrated using DDPG in the original paper (https://arxiv.org/pdf/1512.07679.pdf). Is this feature request still under consideration (even though some DDPG hyperparameters have recently been changed #1785)? |
yes, but help from contributors is needed.
I think so. |
🚀 Feature
independently configurable learning rates for actor and critic in AC-style algorithms
Motivation
In literature the actor is often configured to learn slower, such that the critics responses are more reliable. At least it would be nice if i could allow my hyperparameter optimizer to decide which learning rates he wants to use for actor or critic.
Pitch
stable-baselines3/stable_baselines3/ddpg/ddpg.py
Lines 12 to 26 in 65100a4
Additional context
https://spinningup.openai.com/en/latest/algorithms/ddpg.html#documentation-pytorch-version
The text was updated successfully, but these errors were encountered: