Skip to content

Implement PPO-clip and PPO-penalty on Atari, which is the only open source of PPO-penalty

Notifications You must be signed in to change notification settings

ChengTsang/PPO-clip-and-PPO-penalty-on-Atari-Domain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PPO-clip-and-PPO-penalty-on-Atari-Domain

Overview

This repo is based on spinningup, sincerely grateful to it.

I do these things:

  • Implement PPO-penalty. You will find that neither spinningup nor baseline implements PPO-penalty. Although PPO-penalty results are not as good as PPO-clip, this algorithm is meaningful as a good baseline.
  • Implement PPO algorithm on Atari domain(if you read spinningup carefully, or run the program, you will find the program don't match the Atari domain. Because the input vector isn't flattened.)

Advantage

  • This may be the only open source of PPO-penalty
  • This program is very easy to configure.
  • This code is readable, more readable than baseline, and more suitable for beginners.

References

Blog

Backgroud

PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.

There are two primary variants of PPO: PPO-Penalty and PPO-Clip.

PPO-Penalty approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it's scaled appropriately.

PPO-Clip doesn't have a KL-divergence term in the objective and doesn't have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.

Here, we'll focus only on PPO-Clip (the primary variant used at OpenAI).

Quick Facts

  • PPO is an on-policy algorithm.
  • PPO can be used for environments with either discrete or continuous action spaces.
  • The Spinning Up implementation of PPO supports parallelization with MPI.

Installation Dependencies

  • cloudpickle==0.5.2
  • gym>=0.10.8
  • matplotlib
  • numpy
  • pandas
  • scipy
  • tensorflow>=1.8.0
  • tqdm

How to Install

	conda install tensorflow
	conda install gym
	conda install numpy 

mpi4py install on unix click here mpi4py on windows click here

How to Run

  • Quick start
	python ppo.py
  • Play with parallel (-np:set number of processings, take care of OOM!)
	mpiexec -np 4 python ppo.py

Algorithm

The detail of PPO is the same as the original paper. you can have a look at my bolg to know the details. Pseudo-code is shown below.

PPO-clip and PPO-penalty's objective functions are below:

Details

Most settings are the same with PPO, details as follow :

  • Network Structure we used a fully-connected MLP with two hidden layers of 64 units, and tanh nonlinearities, outputting the mean of a Gaussian distribution, with variable standard deviations, following [Sch+15b; Dua+16]. We don’t share parameters between the policy and value function (so coefficient c1 is irrelevant), and we don’t use an entropy bonus.

  • Parameters in Detail

    Parameters on Mujoco and Atari

About

Implement PPO-clip and PPO-penalty on Atari, which is the only open source of PPO-penalty

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages