Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] independently configurable learning rates for actor and critic #338

Open
stheid opened this issue Mar 3, 2021 · 11 comments
Labels
enhancement New feature or request help wanted Help from contributors is welcomed

Comments

@stheid
Copy link
Contributor

stheid commented Mar 3, 2021

🚀 Feature

independently configurable learning rates for actor and critic in AC-style algorithms

Motivation

In literature the actor is often configured to learn slower, such that the critics responses are more reliable. At least it would be nice if i could allow my hyperparameter optimizer to decide which learning rates he wants to use for actor or critic.

Pitch

class DDPG(TD3):
"""
Deep Deterministic Policy Gradient (DDPG).
Deterministic Policy Gradient: http://proceedings.mlr.press/v32/silver14.pdf
DDPG Paper: https://arxiv.org/abs/1509.02971
Introduction to DDPG: https://spinningup.openai.com/en/latest/algorithms/ddpg.html
Note: we treat DDPG as a special case of its successor TD3.
:param policy: The policy model to use (MlpPolicy, CnnPolicy, ...)
:param env: The environment to learn from (if registered in Gym, can be str)
:param learning_rate: learning rate for adam optimizer,
the same learning rate will be used for all networks (Q-Values, Actor and Value function)
it can be a function of the current progress remaining (from 1 to 0)

Additional context

https://spinningup.openai.com/en/latest/algorithms/ddpg.html#documentation-pytorch-version

@stheid stheid added the enhancement New feature or request label Mar 3, 2021
@Webbah
Copy link

Webbah commented Mar 3, 2021

Seems like in the original silver paper they use different learning rates in the application, too:

grafik

https://arxiv.org/abs/1509.02971 p.11

@araffin araffin changed the title [Feature Request] request title [Feature Request] independently configurable learning rates for actor and critic Mar 3, 2021
@Miffyli
Copy link
Collaborator

Miffyli commented Mar 3, 2021

Since the original DDPG paper used this approach I think this should be included here as well down the line, however I am not sure about including it outside DDPG: TD3 did not use this as far I am aware (@araffin ?) and tbh I have never seen this type of trickery before (do share references if other algorithms use this).

Shared parameters also become a problem here and I am not sure what the behaviour would be like. Quickly reading through DDPG paper I did not see a note on if they shared parameters between actors and critics, so I assume they had separate networks.

@araffin
Copy link
Member

araffin commented Mar 4, 2021

Hello,

independently configurable learning rates for actor and critic in AC-style algorithms

so you are proposing that for DDPG, TD3 and SAC?
(it does not apply to PPO/A2C as the value and policy share the same optimizer)

In literature the actor is often configured to learn slower, such that the critics responses are more reliable.
i could allow my hyperparameter optimizer to decide which learning rates he wants to use for actor or critic.

I have mixed feelings about that.
If your remark is only about DDPG, then I can only highly recommend you to use TD3 (DDPG + some tricks to stabilize it) which integrates "delayed" updates exactly for that purpose.
You can also take a look at SAC or even better, TQC, which is the current state of the art algorithm for continuous control and implemented in our contrib repo: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

On one hand, the request seems reasonable. On the other hand, this would add complexity (we would need to support schedule for each of them) and won't necessary help in your hyperparameter optimization as it would make the search space wider.

@stheid
Copy link
Contributor Author

stheid commented Mar 5, 2021

the search space does not get that much more complex. I would have a "base learningrate" and a "actor learningrate scale" to change the base learningrate for one of the networks by a factor.

Modern blackbox optimizers like optuna are quite efficient in learning that.

For research it is a bit of a problem to use stable-baselines as i cannot reproduce the work of other publications as most of the literature does exactly that. TD3 is nice, but using an different algorithm is less comparable. To be totally honest i have not looked that much into TD3 though.

The pointers to the contrib part ist definetly interesting and i totally understand when you dont deem it usefull for your purpose of this library. However, i might need to switch to acme than....

Thanks for your fast replay in any case.

@araffin
Copy link
Member

araffin commented Mar 5, 2021

i cannot reproduce the work of other publications as most of the literature does exactly that. TD3 is nice, but using an different algorithm is less comparable.

TD3 is literally DDPG + tricks, so it will give you at least as good results as DDPG and usually better (or much better).
So if you use TD3 for your baselines (when reproducing previous work), I would say it is still a fair comparison.

The pointers to the contrib part ist definetly interesting and i totally understand when you dont deem it usefull for your purpose of this library. However, i might need to switch to acme than....

To sum the discussion: you would like to have two different lr because of performance + comparison reasons.
My main concern is that such decoupling of learning rates is usually not needed, especially with the most recent algorithms (DDPG was published in 2015) and therefore may add unnecessary complexity in our codebase.

But if you really need two different learning rates, you can simply define a custom DDPG class that overrides that method:

def _update_learning_rate(self, optimizers: Union[List[th.optim.Optimizer], th.optim.Optimizer]) -> None:
"""
Update the optimizers learning rate using the current learning rate schedule
and the current progress remaining (from 1 to 0).
:param optimizers:
An optimizer or a list of optimizers.
"""
# Log the current learning rate
logger.record("train/learning_rate", self.lr_schedule(self._current_progress_remaining))
if not isinstance(optimizers, list):
optimizers = [optimizers]
for optimizer in optimizers:
update_learning_rate(optimizer, self.lr_schedule(self._current_progress_remaining))

called here:

# Update learning rate according to lr schedule
self._update_learning_rate([self.actor.optimizer, self.critic.optimizer])

that would look like:

class CustomDDPG(DDPG):
    def __init__(policy, env, *args, actor_lr=1e-5, critic_lr=1e-4, **kwargs):
        super().__init__(policy, env, *args, **kwargs)
        self.actor_lr = actor_lr
        self.critic_lr = critic_lr

    def _update_learning_rate(self, optimizers):
      actor_optimizer, critic_optimizer = optimizers
      ... # update the learning rate using different values if needed

@stheid
Copy link
Contributor Author

stheid commented Mar 5, 2021

Thanks for the help.

Just one more general note: You call it DDPG but its actually a variant of the original implementation. This is a little misleading in my opinion. Of course its perfectly valid to do so, but maybe it would be generally useful if the documentation would clarify which exact design changes are done with regard to the source publication.

I am still not convinced, but i no one prevents me from forking, hence i dont think further discussion would be a valuable use of time as if fully understand your standpoint :)

@stheid stheid closed this as completed Mar 5, 2021
@araffin araffin reopened this Jun 21, 2023
@araffin
Copy link
Member

araffin commented Jun 21, 2023

Re-opening as it might be helpful to be able to tune qf_learning_rate as done in https://github.com/araffin/sbx.
We need to see how much complexity it adds though.

@jake321southall
Copy link

jake321southall commented Nov 22, 2023

Hello,

independently configurable learning rates for actor and critic in AC-style algorithms

so you are proposing that for DDPG, TD3 and SAC? (it does not apply to PPO/A2C as the value and policy share the same optimizer)

Hi, I'm wanting to use different learning rates for the policy and value networks in PPO, can I not do this because they share the optimizer? If so, can I change the optimizer so they can or would you not recommend this? I'm using SB3 1.6.2. fyi

@araffin
Copy link
Member

araffin commented Nov 29, 2023

can I not do this because they share the optimizer?

yes

If so, can I change the optimizer so they can or would you not recommend this?

you can give it a try (you will need to fork SB3 for that), please report any result here ;)

EDIT: it will probably not work if the actor and critic share weights

@cgr71ii
Copy link

cgr71ii commented Jan 31, 2024

Seems like in the original silver paper they use different learning rates in the application, too:

Just one more general note: You call it DDPG but its actually a variant of the original implementation. This is a little misleading in my opinion. Of course its perfectly valid to do so, but maybe it would be generally useful if the documentation would clarify which exact design changes are done with regard to the source publication.

Hello! I came to this issue exactly for this, and I totally agree that a note in the docs about this change with original DDPG would prevent people from thinking that the same paper hyperparameters are being used (as indicated in #1562 (comment)). Is this the only difference from the paper hyperparameters?

Also, I'd like to point out why it would be useful in my case to include this change in DDPG. I need to use DDPG in a paper, and I'd like to use the original hyperparameters, but I need DDPG and not TD3 because I want to use the Wolpertinger architecture, which is integrated using DDPG in the original paper (https://arxiv.org/pdf/1512.07679.pdf).

Is this feature request still under consideration (even though some DDPG hyperparameters have recently been changed #1785)?

@araffin
Copy link
Member

araffin commented Feb 2, 2024

Is this feature request still under consideration

yes, but help from contributors is needed.
Otherwise, you can have a look at SBX: https://github.com/araffin/sbx

Is this the only difference from the paper hyperparameters?

I think so.
The rest is defined in the RL Zoo (action noise, and obs normalization).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Help from contributors is welcomed
Projects
None yet
Development

No branches or pull requests

6 participants