-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Does the environment restart from scratch when reloading model ? #1643
Comments
Hello Flo =),
I think your problem lie in the env: the cartpole env doesn't have a fixed number of timesteps per episode and you would need to seed it halfway to get the same result. Here is an example with Pendulum (where the number of timesteps per episode is fixed, so we can stop the data collection exactly after an episode and not in the middle of it) This is actually a duplicate of #435 (comment) I think. (and the following might work with any env, not only the fixed number of timesteps one) import os
import gymnasium as gym
from stable_baselines3 import PPO
# Each episode is always 200 steps with Pendulum
# we choose total_timesteps = 200 * k
# and n_steps = 200 * k
env_id = "Pendulum-v1"
seed = 42
n_timesteps_per_episode = 200
total_timesteps = 12 * n_timesteps_per_episode
kwargs = dict(
n_steps=200,
batch_size=100,
n_epochs=1,
policy_kwargs=dict(
net_arch=[64],
),
)
tensorboard_log = "./results/"
os.makedirs(tensorboard_log, exist_ok=True)
def create_env():
return gym.make(env_id, render_mode="rgb_array")
obs = create_env().reset(seed=0)[0]
def full_train():
env = create_env()
model = PPO(
"MlpPolicy",
env,
tensorboard_log=tensorboard_log,
seed=seed,
**kwargs,
)
model.learn(total_timesteps, progress_bar=True)
# Seed the env halfway to have the same behavior as when loading
# a checkpoint
model.set_env(create_env())
model.set_random_seed(seed)
model.learn(total_timesteps, progress_bar=True, reset_num_timesteps=False)
print(model.predict(obs))
print(model.predict(obs, deterministic=True))
def train_first_part():
env = create_env()
model = PPO(
"MlpPolicy",
env,
tensorboard_log=tensorboard_log,
seed=seed,
**kwargs,
)
model.learn(total_timesteps, progress_bar=True)
model.save("./results/checkpoint")
def train_second_part():
env = create_env()
model = PPO.load("./results/checkpoint")
model.set_env(env)
model.set_random_seed(seed)
model.learn(total_timesteps, progress_bar=True, reset_num_timesteps=False)
print(model.predict(obs))
print(model.predict(obs, deterministic=True))
if __name__ == "__main__":
full_train()
train_first_part()
train_second_part()
EDIT: you can take a look at #597 on why we need PS: you can ```python to have code highlighting in markdown ;) |
Hi Antonin :D Thanks for the reply and sorry for the delay ! In the example you provided, the full_train is already divided in two parts. So what I understand is that one single training with 2*n timesteps (and no reseeding in the middle) cannot be exactly equivalent to two successive n-step trainings (in the case the environment is recreated between the two). PS : not sure I get the importance of set_env. I thought it was a function that checks the environment, and potentially vectorizes it. |
yes, but if you do quantitative experiments, this should not be an issue.
oh, my answer to that one got apparently lost...
and stable-baselines3/stable_baselines3/common/base_class.py Lines 505 to 508 in 1cd6ae4
|
❓ Question
Hello,
I have a question about saving and loading a model, and more specifically about the environment.
When reloading a saved model from a checkpoint and continuing training, the environment restarts from the beginning. If a random seed is given to the model during training, this random seed is saved and will be used again during retraining. However, the environment restarts from this random seed, so that the same episodes are seen when training is restarted.
I also noticed that the random seed of the algorithm is reset when reloading a model. In other word, when loading a model, the randomness is reset to the initial random seed.
So, when doing a whole training at once, I don't get the same results as when I stop and restart it (I'm using PPO, so there's no replay buffer) (as in issues : 326 ).
Am I missing something? Is there a way to restart the environment from where it was when the model was saved in the control point?
I provide a minimal code example below, where Cartpole is trained first during 50k steps in once, then the training is split in two 25k steps. The training is identical during the first 25k steps, and then diverge (see image below - I display the policy gradient loss as it was very clear on the image, but the reward curve also diverges from 25k steps).
Checklist
The text was updated successfully, but these errors were encountered: