Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Does the environment restart from scratch when reloading model ? #1643

Closed
4 tasks done
Florence-C opened this issue Aug 9, 2023 · 3 comments
Closed
4 tasks done
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@Florence-C
Copy link

Florence-C commented Aug 9, 2023

❓ Question

Hello,
I have a question about saving and loading a model, and more specifically about the environment.

When reloading a saved model from a checkpoint and continuing training, the environment restarts from the beginning. If a random seed is given to the model during training, this random seed is saved and will be used again during retraining. However, the environment restarts from this random seed, so that the same episodes are seen when training is restarted.
I also noticed that the random seed of the algorithm is reset when reloading a model. In other word, when loading a model, the randomness is reset to the initial random seed.

So, when doing a whole training at once, I don't get the same results as when I stop and restart it (I'm using PPO, so there's no replay buffer) (as in issues : 326 ).

Am I missing something? Is there a way to restart the environment from where it was when the model was saved in the control point?

I provide a minimal code example below, where Cartpole is trained first during 50k steps in once, then the training is split in two 25k steps. The training is identical during the first 25k steps, and then diverge (see image below - I display the policy gradient loss as it was very clear on the image, but the reward curve also diverges from 25k steps).

image

from stable_baselines3 import PPO
import gymnasium as gym
import os

def train_1() : 

	tensorboard_log = './results/debug/'

	os.makedirs(tensorboard_log, exist_ok=True)

	env = gym.make('CartPole-v1', render_mode="rgb_array")
	model = PPO('MlpPolicy', env, tensorboard_log=tensorboard_log, seed=42)
	model.learn(50000)
	model.save('./results/debug/model_train1_50k')
	

def train_2() : 

	tensorboard_log = './results/debug/'

	os.makedirs(tensorboard_log, exist_ok=True)

	env = gym.make('CartPole-v1', render_mode="rgb_array")
	model = PPO('MlpPolicy', env, tensorboard_log=tensorboard_log, seed=42)
	model.learn(25000)
	model.save('./results/debug/model_train2_25k')


def train_3() : 

	tensorboard_log = './results/debug/'
	os.makedirs(tensorboard_log, exist_ok=True)
	env = gym.make('CartPole-v1', render_mode="rgb_array")
	model = PPO.load('./results/debug/model_train2_25k', env=env)
	model.learn(25000, reset_num_timesteps=False)
	model.save('./results/debug/model_train3_50k')


if __name__ == '__main__':
	train_1()
	train_2()
	train_3()

Checklist

@Florence-C Florence-C added the question Further information is requested label Aug 9, 2023
@araffin araffin added the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Aug 9, 2023
@araffin araffin removed the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Aug 20, 2023
@araffin
Copy link
Member

araffin commented Aug 20, 2023

Hello Flo =),

So, when doing a whole training at once, I don't get the same results as when I stop and restart it

I think your problem lie in the env: the cartpole env doesn't have a fixed number of timesteps per episode and you would need to seed it halfway to get the same result.

Here is an example with Pendulum (where the number of timesteps per episode is fixed, so we can stop the data collection exactly after an episode and not in the middle of it)

This is actually a duplicate of #435 (comment) I think.

(and the following might work with any env, not only the fixed number of timesteps one)

import os

import gymnasium as gym
from stable_baselines3 import PPO

# Each episode is always 200 steps with Pendulum
# we choose total_timesteps = 200 * k
# and n_steps = 200 * k
env_id = "Pendulum-v1"
seed = 42
n_timesteps_per_episode = 200
total_timesteps = 12 * n_timesteps_per_episode
kwargs = dict(
    n_steps=200,
    batch_size=100,
    n_epochs=1,
    policy_kwargs=dict(
        net_arch=[64],
    ),
)


tensorboard_log = "./results/"

os.makedirs(tensorboard_log, exist_ok=True)

def create_env():
    return gym.make(env_id, render_mode="rgb_array")


obs = create_env().reset(seed=0)[0]


def full_train():
    env = create_env()

    model = PPO(
        "MlpPolicy",
        env,
        tensorboard_log=tensorboard_log,
        seed=seed,
        **kwargs,
    )
    model.learn(total_timesteps, progress_bar=True)
    # Seed the env halfway to have the same behavior as when loading
    # a checkpoint
    model.set_env(create_env())
    model.set_random_seed(seed)
    model.learn(total_timesteps, progress_bar=True, reset_num_timesteps=False)

    print(model.predict(obs))
    print(model.predict(obs, deterministic=True))


def train_first_part():
    env = create_env()

    model = PPO(
        "MlpPolicy",
        env,
        tensorboard_log=tensorboard_log,
        seed=seed,
        **kwargs,
    )
    model.learn(total_timesteps, progress_bar=True)
    model.save("./results/checkpoint")


def train_second_part():
    env = create_env()

    model = PPO.load("./results/checkpoint")
    model.set_env(env)
    model.set_random_seed(seed)
    model.learn(total_timesteps, progress_bar=True, reset_num_timesteps=False)
    print(model.predict(obs))
    print(model.predict(obs, deterministic=True))


if __name__ == "__main__":
    full_train()
    train_first_part()
    train_second_part()

EDIT: you can take a look at #597 on why we need set_env()

PS: you can ```python to have code highlighting in markdown ;)

@araffin araffin added the duplicate This issue or pull request already exists label Aug 20, 2023
@Florence-C
Copy link
Author

Hi Antonin :D

Thanks for the reply and sorry for the delay !

In the example you provided, the full_train is already divided in two parts. So what I understand is that one single training with 2*n timesteps (and no reseeding in the middle) cannot be exactly equivalent to two successive n-step trainings (in the case the environment is recreated between the two).
There is no way to restart the environment where the model was saved, if it was not anticipated before, am I correct ?

PS : not sure I get the importance of set_env. I thought it was a function that checks the environment, and potentially vectorizes it.

@araffin
Copy link
Member

araffin commented Sep 17, 2023

There is no way to restart the environment where the model was saved, if it was not anticipated before, am I correct ?

yes, but if you do quantitative experiments, this should not be an issue.

PS : not sure I get the importance of set_env. I thought it was a function that checks the environment, and potentially vectorizes it.

oh, my answer to that one got apparently lost...

def set_env(self, env: GymEnv, force_reset: bool = True) -> None:

and

# Discard `_last_obs`, this will force the env to reset before training
# See issue https://github.com/DLR-RM/stable-baselines3/issues/597
if force_reset:
self._last_obs = None

@araffin araffin closed this as completed Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants