[Feature Request] independently configurable learning rates for actor and critic #338

stheid · 2021-03-03T15:20:24Z

🚀 Feature

independently configurable learning rates for actor and critic in AC-style algorithms

Motivation

In literature the actor is often configured to learn slower, such that the critics responses are more reliable. At least it would be nice if i could allow my hyperparameter optimizer to decide which learning rates he wants to use for actor or critic.

Pitch

stable-baselines3/stable_baselines3/ddpg/ddpg.py

Lines 12 to 26 in 65100a4

    
           class DDPG(TD3): 
        
               """ 
        
               Deep Deterministic Policy Gradient (DDPG). 
        
               Deterministic Policy Gradient: http://proceedings.mlr.press/v32/silver14.pdf 
        
               DDPG Paper: https://arxiv.org/abs/1509.02971 
        
               Introduction to DDPG: https://spinningup.openai.com/en/latest/algorithms/ddpg.html 
        
               Note: we treat DDPG as a special case of its successor TD3. 
        
               :param policy: The policy model to use (MlpPolicy, CnnPolicy, ...) 
        
               :param env: The environment to learn from (if registered in Gym, can be str) 
        
               :param learning_rate: learning rate for adam optimizer, 
        
                   the same learning rate will be used for all networks (Q-Values, Actor and Value function) 
        
                   it can be a function of the current progress remaining (from 1 to 0)

Additional context

https://spinningup.openai.com/en/latest/algorithms/ddpg.html#documentation-pytorch-version

Webbah · 2021-03-03T15:26:41Z

Seems like in the original silver paper they use different learning rates in the application, too:

https://arxiv.org/abs/1509.02971 p.11

Miffyli · 2021-03-03T22:51:40Z

Since the original DDPG paper used this approach I think this should be included here as well down the line, however I am not sure about including it outside DDPG: TD3 did not use this as far I am aware (@araffin ?) and tbh I have never seen this type of trickery before (do share references if other algorithms use this).

Shared parameters also become a problem here and I am not sure what the behaviour would be like. Quickly reading through DDPG paper I did not see a note on if they shared parameters between actors and critics, so I assume they had separate networks.

araffin · 2021-03-04T12:26:30Z

Hello,

independently configurable learning rates for actor and critic in AC-style algorithms

so you are proposing that for DDPG, TD3 and SAC?
(it does not apply to PPO/A2C as the value and policy share the same optimizer)

In literature the actor is often configured to learn slower, such that the critics responses are more reliable.
i could allow my hyperparameter optimizer to decide which learning rates he wants to use for actor or critic.

I have mixed feelings about that.
If your remark is only about DDPG, then I can only highly recommend you to use TD3 (DDPG + some tricks to stabilize it) which integrates "delayed" updates exactly for that purpose.
You can also take a look at SAC or even better, TQC, which is the current state of the art algorithm for continuous control and implemented in our contrib repo: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

On one hand, the request seems reasonable. On the other hand, this would add complexity (we would need to support schedule for each of them) and won't necessary help in your hyperparameter optimization as it would make the search space wider.

stheid · 2021-03-05T11:04:56Z

the search space does not get that much more complex. I would have a "base learningrate" and a "actor learningrate scale" to change the base learningrate for one of the networks by a factor.

Modern blackbox optimizers like optuna are quite efficient in learning that.

For research it is a bit of a problem to use stable-baselines as i cannot reproduce the work of other publications as most of the literature does exactly that. TD3 is nice, but using an different algorithm is less comparable. To be totally honest i have not looked that much into TD3 though.

The pointers to the contrib part ist definetly interesting and i totally understand when you dont deem it usefull for your purpose of this library. However, i might need to switch to acme than....

Thanks for your fast replay in any case.

araffin · 2021-03-05T12:36:29Z

i cannot reproduce the work of other publications as most of the literature does exactly that. TD3 is nice, but using an different algorithm is less comparable.

TD3 is literally DDPG + tricks, so it will give you at least as good results as DDPG and usually better (or much better).
So if you use TD3 for your baselines (when reproducing previous work), I would say it is still a fair comparison.

The pointers to the contrib part ist definetly interesting and i totally understand when you dont deem it usefull for your purpose of this library. However, i might need to switch to acme than....

To sum the discussion: you would like to have two different lr because of performance + comparison reasons.
My main concern is that such decoupling of learning rates is usually not needed, especially with the most recent algorithms (DDPG was published in 2015) and therefore may add unnecessary complexity in our codebase.

But if you really need two different learning rates, you can simply define a custom DDPG class that overrides that method:

stable-baselines3/stable_baselines3/common/base_class.py

Lines 245 to 259 in 65100a4

    
               def _update_learning_rate(self, optimizers: Union[List[th.optim.Optimizer], th.optim.Optimizer]) -> None: 
        
                   """ 
        
                   Update the optimizers learning rate using the current learning rate schedule 
        
                   and the current progress remaining (from 1 to 0). 
        
                   :param optimizers: 
        
                       An optimizer or a list of optimizers. 
        
                   """ 
        
                   # Log the current learning rate 
        
                   logger.record("train/learning_rate", self.lr_schedule(self._current_progress_remaining)) 
        
                   if not isinstance(optimizers, list): 
        
                       optimizers = [optimizers] 
        
                   for optimizer in optimizers: 
        
                       update_learning_rate(optimizer, self.lr_schedule(self._current_progress_remaining))

called here:

stable-baselines3/stable_baselines3/td3/td3.py

Lines 129 to 130 in 65100a4

    
           # Update learning rate according to lr schedule 
        
           self._update_learning_rate([self.actor.optimizer, self.critic.optimizer])

that would look like:

class CustomDDPG(DDPG):
    def __init__(policy, env, *args, actor_lr=1e-5, critic_lr=1e-4, **kwargs):
        super().__init__(policy, env, *args, **kwargs)
        self.actor_lr = actor_lr
        self.critic_lr = critic_lr

    def _update_learning_rate(self, optimizers):
      actor_optimizer, critic_optimizer = optimizers
      ... # update the learning rate using different values if needed

stheid · 2021-03-05T16:52:24Z

Thanks for the help.

Just one more general note: You call it DDPG but its actually a variant of the original implementation. This is a little misleading in my opinion. Of course its perfectly valid to do so, but maybe it would be generally useful if the documentation would clarify which exact design changes are done with regard to the source publication.

I am still not convinced, but i no one prevents me from forking, hence i dont think further discussion would be a valuable use of time as if fully understand your standpoint :)

araffin · 2023-06-21T08:38:17Z

Re-opening as it might be helpful to be able to tune qf_learning_rate as done in https://github.com/araffin/sbx.
We need to see how much complexity it adds though.

jake321southall · 2023-11-22T15:37:28Z

Hello,

independently configurable learning rates for actor and critic in AC-style algorithms

so you are proposing that for DDPG, TD3 and SAC? (it does not apply to PPO/A2C as the value and policy share the same optimizer)

Hi, I'm wanting to use different learning rates for the policy and value networks in PPO, can I not do this because they share the optimizer? If so, can I change the optimizer so they can or would you not recommend this? I'm using SB3 1.6.2. fyi

araffin · 2023-11-29T12:23:28Z

can I not do this because they share the optimizer?

yes

If so, can I change the optimizer so they can or would you not recommend this?

you can give it a try (you will need to fork SB3 for that), please report any result here ;)

EDIT: it will probably not work if the actor and critic share weights

cgr71ii · 2024-01-31T16:55:47Z

Seems like in the original silver paper they use different learning rates in the application, too:

Just one more general note: You call it DDPG but its actually a variant of the original implementation. This is a little misleading in my opinion. Of course its perfectly valid to do so, but maybe it would be generally useful if the documentation would clarify which exact design changes are done with regard to the source publication.

Hello! I came to this issue exactly for this, and I totally agree that a note in the docs about this change with original DDPG would prevent people from thinking that the same paper hyperparameters are being used (as indicated in #1562 (comment)). Is this the only difference from the paper hyperparameters?

Also, I'd like to point out why it would be useful in my case to include this change in DDPG. I need to use DDPG in a paper, and I'd like to use the original hyperparameters, but I need DDPG and not TD3 because I want to use the Wolpertinger architecture, which is integrated using DDPG in the original paper (https://arxiv.org/pdf/1512.07679.pdf).

Is this feature request still under consideration (even though some DDPG hyperparameters have recently been changed #1785)?

araffin · 2024-02-02T09:00:33Z

Is this feature request still under consideration

yes, but help from contributors is needed.
Otherwise, you can have a look at SBX: https://github.com/araffin/sbx

Is this the only difference from the paper hyperparameters?

I think so.
The rest is defined in the RL Zoo (action noise, and obs normalization).

stheid added the enhancement New feature or request label Mar 3, 2021

araffin changed the title ~~[Feature Request] request title~~ [Feature Request] independently configurable learning rates for actor and critic Mar 3, 2021

stheid closed this as completed Mar 5, 2021

araffin reopened this Jun 21, 2023

araffin added the help wanted Help from contributors is welcomed label Jun 21, 2023

araffin mentioned this issue Nov 8, 2023

the same learning rate will be used for all networks (Q-Values, Actor and Value function) #1740

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] independently configurable learning rates for actor and critic #338

[Feature Request] independently configurable learning rates for actor and critic #338

stheid commented Mar 3, 2021

Webbah commented Mar 3, 2021

Miffyli commented Mar 3, 2021

araffin commented Mar 4, 2021

stheid commented Mar 5, 2021

araffin commented Mar 5, 2021

stheid commented Mar 5, 2021

araffin commented Jun 21, 2023

jake321southall commented Nov 22, 2023 •

edited

Loading

araffin commented Nov 29, 2023 •

edited

Loading

cgr71ii commented Jan 31, 2024

araffin commented Feb 2, 2024

[Feature Request] independently configurable learning rates for actor and critic #338

[Feature Request] independently configurable learning rates for actor and critic #338

Comments

stheid commented Mar 3, 2021

🚀 Feature

Motivation

Pitch

Additional context

Webbah commented Mar 3, 2021

Miffyli commented Mar 3, 2021

araffin commented Mar 4, 2021

stheid commented Mar 5, 2021

araffin commented Mar 5, 2021

stheid commented Mar 5, 2021

araffin commented Jun 21, 2023

jake321southall commented Nov 22, 2023 • edited Loading

araffin commented Nov 29, 2023 • edited Loading

cgr71ii commented Jan 31, 2024

araffin commented Feb 2, 2024

jake321southall commented Nov 22, 2023 •

edited

Loading

araffin commented Nov 29, 2023 •

edited

Loading