Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: bug title: Proposal to Adjust Default n_steps for PPO for Better Outcomes with Small Environments #1846

Closed
5 tasks done
Ian-Sy-Zhang opened this issue Feb 22, 2024 · 6 comments
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested

Comments

@Ian-Sy-Zhang
Copy link

🐛 Bug

Hello Stable-Baselines3 Team,

I recently encountered an issue while training an agent on the FrozenLake environment using PPO's default settings. The agent showed no signs of learning, which prompted me to delve into the stable_baselines3/ppo/ppo.py source code. I discovered that the default n_steps is set to 2048. This high value means that in a single environment setup, the PPO model requires over 2048 steps worth of data before starting to train.

As a beginner, I find Stable-baselines3 incredibly user-friendly. However, I generally only adjusted the batch_size and learning_rate, not being aware of the impact of n_steps. This oversight led to some unsuccessful training attempts with small sample sizes using the default PPO parameters.

My breakthrough came when I reduced n_steps to 16, which significantly improved the training results. This experience led me to look at similar parameters in other algorithms, like DQN. I noticed that in merge request #1785, the learning_starts parameter for dqn.py was initially 50000, which was later reduced to 100 after recognizing that it was too high compared to other algorithms. This change aligns with the improved training outcomes I observed using the default settings of the newer DQN model on my small sample data.

Considering these observations, I suggest revising the default n_steps value in ppo.py from 2048 to a smaller number. I believe this could be more beginner-friendly and may enhance the usability of Stable-baselines3, particularly for those working with smaller environments.

Thank you for considering this adjustment.

Best regards,
Shiyu

To Reproduce

import bug_lib as BL
import subprocess
import os
import sys

sys.path.insert(0, './training_scripts/')
from stable_baselines3 import DQN, PPO, A2C, SAC
from stable_baselines3.common.callbacks import BaseCallback

import training_scripts.Env

def get_PPO_Model(env, model_path=os.path.join('RLTesting', 'logs', 'ppo.zip')):
if os.path.isfile(model_path):
print("loading existing model")
model = PPO.load(model_path, env=env)
else:
print("creating new model")
model = PPO('MlpPolicy', env)
# new_logger = configure(folder="logs", format_strings=["stdout", "log", "csv", "tensorboard"])
# model.set_logger(new_logger)
return model

'''
note here: max_steps=200 so that it's less than the default value of 2048; maybe making 2048 a small value will be better
'''
def train_PPO_model(model, max_steps=200, model_path=os.path.join('RLTesting', 'logs', 'ppo.zip')):
vec_env = model.get_env()
vec_env.reset()
vec_env.render()
model.learn(max_steps)
action_state_list = vec_env.envs[0].get_state_action_pairs()
model.save(model_path)
vec_env.close()
return action_state_list

if name == 'main':
env = training_scripts.Env.EnvWrapper()
env.reset()
model = get_PPO_Model(env=env)
result = train_PPO_model(model=model)

Relevant log output / Error message

max_steps=200, so that it's less than the default value of 2048; The model is untrained. Maybe making 2048 a small value will be better

System Info

No response

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@Ian-Sy-Zhang Ian-Sy-Zhang added the bug Something isn't working label Feb 22, 2024
@Ian-Sy-Zhang
Copy link
Author

sorry about the format of "To Reproduce", I will rewrite here:

import bug_lib as BL
import subprocess
import os
import sys

sys.path.insert(0, './training_scripts/')
from stable_baselines3 import DQN, PPO, A2C, SAC
from stable_baselines3.common.callbacks import BaseCallback

import training_scripts.Env


def get_PPO_Model(env, model_path=os.path.join('RLTesting', 'logs', 'ppo.zip')):
    if os.path.isfile(model_path):
        print("loading existing model")
        model = PPO.load(model_path, env=env)
    else:
        print("creating new model")
        model = PPO('MlpPolicy', env)
        # new_logger = configure(folder="logs", format_strings=["stdout", "log", "csv", "tensorboard"])
        # model.set_logger(new_logger)
    return model


'''
note here: max_steps=200 so that it's less than the default value of 2048; maybe making 2048 a small value will be better
'''
def train_PPO_model(model, max_steps=200, model_path=os.path.join('RLTesting', 'logs', 'ppo.zip')):
    vec_env = model.get_env()
    vec_env.reset()
    vec_env.render()
    model.learn(max_steps)
    action_state_list = vec_env.envs[0].get_state_action_pairs()
    model.save(model_path)
    vec_env.close()
    return action_state_list


if __name__ == '__main__':
    env = training_scripts.Env.EnvWrapper()
    env.reset()
    model = get_PPO_Model(env=env)
    result = train_PPO_model(model=model)

@araffin araffin added question Further information is requested custom gym env Issue related to Custom Gym Env and removed bug Something isn't working labels Feb 22, 2024
@araffin
Copy link
Member

araffin commented Feb 23, 2024

Hello,

not being aware of the impact of n_steps

from our readme: "Note: Despite its simplicity of use, Stable Baselines3 (SB3) assumes you have some knowledge about Reinforcement Learning (RL). You should not utilize this library without some practice. To that extent, we provide good resources in the documentation to get started with RL."

I suggest revising the default n_steps value in ppo.py from 2048 to a smaller number.

The current default PPO values work well in many different scenarios, and compared to DQN, changing the default will probably cause more harm than benefits.
In your case, the fact that PPO with a small number of steps work better probably means that it would have worked but many more iterations (see our "tips and tricks" in our doc).

PS: thanks for the code but it is not minimal, nor working... (see link in the checklist for the explanation)

@Ian-Sy-Zhang
Copy link
Author

Hello:

I still think it's a seminar problem like merge request #1785; You can not assume everyone using Stable-baselines3 is fimilar with every line in your documentation.

I believe that unreasonable parameter settings could lead to users wasting their time, so perhaps adjusting to a smaller parameter or providing prompts such as 'Your model is actually still warming up and not yet training' would be more user-friendly.

Ignore the demo code since every time you call 'model.learn(max_steps)' and max_step < 2048 will cause that problem.

Regards

@Ian-Sy-Zhang
Copy link
Author

To be percise, check #1760

@Ian-Sy-Zhang
Copy link
Author

I trained the ppo model for 300 rounds (each round has 1-80 steps), the pic below is using n_steps=16;
output

X axis means round index, and Y axis means accuracy;
Red line means the result of linear regression of the training.

The following picture uses default param, n=2048. The picture shows that the agent didn't learn anything.
output1

I think a user-friendly hint 'your model is only warming up' or 'your model is not trained' will help; Or you can just set n_steps a smaller number like the situation #1760

@Ian-Sy-Zhang
Copy link
Author

I find this problem do has something to do with my environment, cause I make the agent stop learning when it falls into the ice lake; so that if I set n_steps to 2048, in my environment it will never learn. But if I set it to 16, the result will benefit from more iterations.
Thanks for your help and again, I think a worming message or hint like “the model didn’t get trained” would really help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants