This repo is based on spinningup, sincerely grateful to it.
I do these things:
- Implement PPO-penalty. You will find that neither spinningup nor baseline implements PPO-penalty. Although PPO-penalty results are not as good as PPO-clip, this algorithm is meaningful as a good baseline.
- Implement PPO algorithm on Atari domain(if you read spinningup carefully, or run the program, you will find the program don't match the Atari domain. Because the input vector isn't flattened.)
Advantage
- This may be the only open source of PPO-penalty
- This program is very easy to configure.
- This code is readable, more readable than baseline, and more suitable for beginners.
References
- Proximal Policy Optimization Algorithms, Schulman et al. 2017
- Emergence of Locomotion Behaviours in Rich Environments, Heess et al. 2017
- High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016
Blog
- my blog on PPO
- mpi4py blog
PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.
There are two primary variants of PPO: PPO-Penalty and PPO-Clip.
PPO-Penalty approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it's scaled appropriately.
PPO-Clip doesn't have a KL-divergence term in the objective and doesn't have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.
Here, we'll focus only on PPO-Clip (the primary variant used at OpenAI).
- PPO is an on-policy algorithm.
- PPO can be used for environments with either discrete or continuous action spaces.
- The Spinning Up implementation of PPO supports parallelization with MPI.
- cloudpickle==0.5.2
- gym>=0.10.8
- matplotlib
- numpy
- pandas
- scipy
- tensorflow>=1.8.0
- tqdm
conda install tensorflow
conda install gym
conda install numpy
mpi4py install on unix click here mpi4py on windows click here
- Quick start
python ppo.py
- Play with parallel (-np:set number of processings, take care of OOM!)
mpiexec -np 4 python ppo.py
The detail of PPO is the same as the original paper. you can have a look at my bolg to know the details. Pseudo-code is shown below.
PPO-clip and PPO-penalty's objective functions are below:
Most settings are the same with PPO, details as follow :
-
Network Structure we used a fully-connected MLP with two hidden layers of 64 units, and tanh nonlinearities, outputting the mean of a Gaussian distribution, with variable standard deviations, following [Sch+15b; Dua+16]. We don’t share parameters between the policy and value function (so coefficient c1 is irrelevant), and we don’t use an entropy bonus.
-
Parameters in Detail
Parameters on Mujoco and Atari
- Proximal Policy Optimization Algorithms, Schulman et al. 2017
- Emergence of Locomotion Behaviours in Rich Environments, Heess et al. 2017
- High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016