Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reward shaping in Atari #3

Open
merv801 opened this issue Dec 14, 2019 · 4 comments
Open

reward shaping in Atari #3

merv801 opened this issue Dec 14, 2019 · 4 comments

Comments

@merv801
Copy link

merv801 commented Dec 14, 2019

Hello. I have ran you algorithm in the Pong game 2 times for about 3k steps each. Once with the clip_rewards=True and the other time with clip_rewards=False.
However in the case of clip_rewards=False it did not progress that much but for clip_rewards=True the results are like yours.
I thought that in the pong environment setting clip_rewards should not have any effect because the rewards are already 0, +1 or -1.
Do you have any idea what is the cause?
Thanks

@zplizzi
Copy link
Owner

zplizzi commented Dec 14, 2019

Hm, that is strange. The only place that clip_rewards is applied is here:

if self.args.clip_rewards:
    # clip reward to one of {-1, 0, 1}
    step_reward = np.sign(step_reward)

As you say, in pong the rewards are already in this set so it should have no effect. I'd guess that maybe there is just some random variation in the runs that caused the behavior you see? I would try running them again.

@zplizzi
Copy link
Owner

zplizzi commented Dec 14, 2019

Actually that isn't strictly true, step_reward is actually the sum of rewards over steps_to_skip timesteps. But I don't think it's possible to get multiple rewards within a few steps in Pong - so this still shouldn't matter.

@merv801
Copy link
Author

merv801 commented Dec 15, 2019

Thanks for your response.
I ran it again with step_reward=False again and this time it is working good, so the first time was indeed a random variation.
However it seems quite strange. I didn't expect to see such a difference between two runs. I have heard that PPO is relatively stable.( the first time agent got stuck in the -8 to 2 range of rewards)

@zplizzi
Copy link
Owner

zplizzi commented Dec 15, 2019

Yeah, it's possible that the hyperparameters I tested with aren't great (didn't tune them at all), or maybe it would work more reliably with frame stacking (#2). But RL does generally have a good deal of variation even in the more stable algorithms, so who knows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants