diff --git a/Alessandro_Pomponio_Bowling_PPO.html b/Alessandro_Pomponio_Bowling_PPO.html deleted file mode 100644 index 1490549..0000000 --- a/Alessandro_Pomponio_Bowling_PPO.html +++ /dev/null @@ -1,17722 +0,0 @@ - - - - - -PPO_Bowling_FINAL - - - - - - - - - - - - - - - - - - - - - - - -
-
-

Playing Bowling with Proximal Policy Optimization (PPO)

This notebook has been created by Alessandro Pomponio for the Autonomous and Adaptive System course, held by prof. Mirco Musolesi at the University of Bologna.

- -
-
-
-
-

Install prerequisites

Since this notebook was written on Google Colab, we need to install some prerequisites if we want to render outputs.

- -
-
-
- -
- -
-
- -
- -
-
-
-

Analyzing the environment

We start by taking a look at how the Bowling environment is set up and implemented in OpenAI Gym.

-

References:

- - -
-
-
- -
- -
-
-
-

The observation space is given by the video feed, an 8-bit 210x160 RGB image (with values in the range 0-255)

- -
-
-
- -
- -
- - - - -
- -
-
-
-

We can perform 6 actions:

- -
-
-
- -
- -
- - - - -
- -
-
-
-

The actions have the following meanings:

- -
-
-
- -
- -
- - - - -
- -
-
-
-

Setting up the environment

Reading the report "Game Playing with Deep Q-Learning using OpenAI Gym" by Robert Chuchro and Deepak Gupta we find a few suggestions that can help us in preprocessing the observations provided by the environment.

-

From the report:

- -

In addition to what has been done in the paper, we can try to crop the observation to include only the bowling lane. This might be able to further optimize the training, making the network only focus on what is important.

-

We will implement all these preprocessing actions as gym.ObservationWrapper, following the templates shown in Alexander Van de Kleut's blog and in the official OpenAI Gym repository.

- -
-
-
-
-

Cropping the observation

We will start by cropping the observation, as we must figure out what to do on our own.

-

To make this process more "visual", we will import matplotlib and use it to show the observation in the notebook. We will render the environment as a rgb_array to allow for plotting.

- -
-
-
- -
- -
- - - - -
- -
-
-
-

A first look at the picture tells us that the bowling lane starts a little after 100 pixels and ends a little before 175 pixels (vertically). After some testing (running the game and rendering frames), we are able to crop the observation vertically with the following values:

- -
-
-
- -
- -
-
- -
- -
- - - - -
- -
-
-
-

This has reduced the observation from (210, 160, 3) to (65, 160, 3), decreasing the screen size by ~70%.

- -
-
-
- -
- -
- - - - -
- -
-
-
-

We then define our wrapper:

- -
-
-
- -
- -
-
-
-

Converting the observation to greyscale

This functionality is part of the predefined wrappers in OpenAI Gym. We will then use the GreyScaleObservation wrapper, which can be found here: https://github.com/openai/gym/blob/master/gym/wrappers/gray_scale_observation.py

- -
-
-
-
-

Denoising the observation - Adaptive Gaussian Thresholding

This type of denoising may or may not be useful in our case, as the environment does not have any type of background noise. We will implement it as a wrapper anyway, using OpenCV's function: https://docs.opencv.org/3.4/d7/d4d/tutorial_py_thresholding.html

- -
-
-
- -
- -
-
-
-

Normalizing the observation

We know that the state is represented by 8-bit values, from 0 to 255. To normalize it, we simply divide the observation by 255.

- -
-
-
- -
- -
-
-
-

Resizing the observation

In the original paper, the authors shrunk the observation from the original size down to 80x80. Since we can crop the observation by 70%, we might want to further reduce this size.

-

Again, this feature is pre-packed in OpenAI Gym, in the ResizeObservation wrapper, which can be found here: https://github.com/openai/gym/blob/master/gym/wrappers/resize_observation.py

- -
-
-
-
-

Stacking frames

To stack frames, we will use another one of the provided wrappers in Gym, FrameStack, which can be found here: https://github.com/openai/gym/blob/master/gym/wrappers/frame_stack.py

- -
-
-
-
-

Preprocessing settings

For ease of use, we condense here all the preprocessing settings:

- -
-
-
- -
- -
-
-
-

Preprocessing results

Finally, let us look at the results of our preprocessing:

- -
-
-
- -
- -
- - - - -
- -
-
-
-

Note that in the end we ended up keeping the image resolution at 80x80. This was done because at lower resolutions (such as 50x50), pin and ball dimensions and shapes were being heavily altered, as we can see in the picture below:

- -
-
-
- -
- -
- - - - -
- -
-
-
-

Get environment function

We also create a utility function to create the environment given the settings above. They have been re-ordered to make sure everything works as expected.

- -
-
-
- -
- -
-
-
-

Proximal Policy Optimization

From "Proximal Policy Optimization Algorithms" by Schulman et al:

-

[Proximal Policy Optimization algorithms are] a new family of policy gradient methods [...] which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. [They] have some of the benefits of trust region policy optimization (TRPO) but they are much simpler to implement, more general, and have better sample complexity (empirically).

-

This family of algorithms has been proposed due to the lack of simple, efficient, scalable and robust policy gradient algorithms for reinforcement learning. At that time, in fact, the most common algorithms used were Q-learning (which struggled in problems with continuous action spaces, and was generally poorly understood) and Trust Region Policy Optimization (which is complicated and not compatible with architectures that include noise or parameter sharing).

-

PPO attempts to achieve the same data efficiency and reliable performance of TRPO, while using only first-order optimization. This is achieved by means of a novel objective with clipped probability ratios, which forms a pessimistic estimate (i.e., a lower bound) of the performance of the policy. Policy optimization is obtained by alternating between sampling data from the policy and performing several epochs of optimization on the sampled data.

- -
-
-
-
-

Hyperparameters

We will use the hyperparameters suggested for Atari games in the paper. Note that the Adam stepize and the clipping parameter $\epsilon$ are multiplied by a value $\alpha$, which is linearly annealed from 1 to 0 over the course of learning.

-

In addition, for the sake of simplicity, we will write a PPO implementation that uses only one actor (the paper uses 8 in parallel in Atari experiments)

- -
-
-
- -
- -
-
-
-

Clipped Surrogate Objective

In order to limit the size of the policy update, TRPO applies a constraint $\beta$ to a "surrogate" objective function based on the KL-divercence between the old and the new policy.

-

In theory TRPO should use a penalty instead of a hard constraint; choosing a value of $\beta$ that performs well across different problems, however, is very difficult. The authors of PPO, then took TRPO's surrogate objective:

-

$L^{CPI}(\theta) = \hat{\mathbb{E}}_t \Big[\frac{\pi_{\theta}(a_t \vert s_t)}{\pi_{\theta_{old}}(a_t \vert s_t)} \hat{A}_t \Big] = \hat{\mathbb{E}}_t \Big[ r_t(\theta) \hat{A}_t \Big]$

-

and modified it to penalize changes to the policy that move $r_t (\theta)$ away from 1 (as it would lead to an excessively large policy update without constraints). They propose the following:

-

$L^{CLIP}(\theta)=\hat{\mathbb{E}}_t \bigg[\min\left(r_t(\theta)\hat{A}_t, clip\left(r_t(\theta), 1-\epsilon, 1+\epsilon \right) \hat{A}_t\right)\bigg]$

-

This new objective clips the probability ratio to remove the incentive for moving $r_t$ outside the interval around 1 delimited by $\epsilon$ and returns the minimum between the unclipped and the clipped objective, effectively acting as a lower bound. In practice, this leads to ignoring the change in probability ratio when it would make the objective improve, only including it when it makes it worse.

- -
-
-
- -
- -
-
-
-

Advantages

The term $\hat{A}_t$ that we saw in the previous formulas is an advantage estimator, a quantity showing how good or bad something is compared to our current estimates.

-

The authors of this paper propose a truncated version of generalized advantage estimation, adapting what was proposed in "Asynchronous methods for deep reinforcement learning" by Mnih et al.

-

$\hat{A}_t = \delta_t + (\gamma\lambda)\delta_{t+1} + \dots + \dots + (\gamma\lambda)^{T-t+1}\delta_{T-1}$

-

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

-

Note: the formula above does not explicitly put a check on whether the state at $t+1$ is a terminal state or not.

- -
-
-
- -
- -
-
-
-

Since we will have to compute the discounted returns as well, we will create a function for it, too:

- -
-
-
- -
- -
-
-
-

Loss function

We will be using a neural network architecture that shares parameters between the policy and the value function; this requires us to use a loss function that combines the policy surrogate and a value function error term. This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration.

-

Combining these terms, we obtain the following objective, which is (approximately) maximized each iteration:

-

$L_t^{CLIP+VF+S}(\theta) = \hat{\mathbb{E}} \big[L_t^{CLIP}(\theta) - c_1 L_t^{VF}(\theta) + c_2 S[\pi_{\theta}](s_t) \big]$

-

_where $c_1$, $c_2$ are coefficients, and $S$ denotes an entropy bonus, and $L_t^{VF}$ is a squared-error loss $\left( V_{\theta}(s_t) - V_t^{targ} \right)^2$._

- -
-
-
- -
- -
-
-
-

Algorithm

The authors provide a pseudocode implementation of PPO with fixed-length trajectory segments:

- -
for iteration = 1, 2, ... do
-  for actor = 1, 2, ..., N do
-    Run policy \pi_{\theta_{old}} in environment for T timesteps
-    Compute advantage estimates \hat{A}_1, ..., \hat{A}_T
-  end for
-  Optimize surrogate L wrt \theta, with K epochs and minibatch size M \le NT
-  \theta_{old} <- \theta
-end for
-

We will first define the neural network we will use as function approximator and a few utility functions to run part of the algorithm.

- -
-
-
-
-

Neural network model

For the neural network we will refer to the supplementary material of the paper "Asynchronous Methods for Deep Reinforcement Learning" by Minh et al. available here.

-

The implementation that follows is also inspired by the Keras documentation:

- - -
-
-
-
-

Parameters

We set the network parameters according to our configuration.

- -
-
-
- -
- -
-
-
-

Model

We define a utility function to get a new network for ease of use.

- -
-
-
- -
- -
-
-
-

Playing for T timesteps

As we said before, PPO runs the policy for $T$ timesteps (where $T$ is much less than the episode length) and uses the collected samples for an update.

-

Here we define a utility function to play for $T$ timesteps:

- -
-
-
- -
- -
-
-
-

Generating minibatches

The PPO update is run on minibatches of size minibatch_size, we create a generator function to output them starting from the data we collected.

- -
-
-
- -
- -
-
-
-

PPO update

The PPO update is run on the data obtained while playing and is repeated for num_epochs.

- -
-
-
- -
- -
-
-
-

Run

After implementing the PPO algorithm, we must now instantiate the network and run the training loop.

- -
-
-
-
-

Parameters

We initially set a reward threshold of 50 because it served as a measure of the algorithm working. It is also a value near what the paper obtained before the agent started having a few issues and undoing what it had learned until then.

- -
-
-
- -
- -
-
- -
- -
- - - - -
- -
-
-
-

Run and train

-
-
-
- -
- -
- - - - -
- -
-
-
-

Iterative improvements

Now that we have reached the target that we had set, we save the model and iteratively try to see how much we can improve before things start to go wrong.

-

Note: with #train we mean the code within the try/except statement in the previous cell.

- -
-
-
- -
- -
-
-
-

Let us try to improve the average reward from 50 to 60.

- -
-
-
- -
- -
- - - - -
- -
-
- -
- -
-
-
-

Let us now to reach 65.

- -
-
-
- -
- -
- - - - -
- -
-
- -
- -
-
-
-

The agent is performing very well, we can now try to reach 70.

- -
-
-
- -
- -
- - - - -
- -
-
- -
- -
-
-
-

The training was very quick thanks to the performance of the agent. Let us now try to reach 75.

- -
-
-
- -
- -
- - - - -
- -
-
-
-

Conclusions

-
-
-
-
-

Unfortunately, we have hit a wall. Since the training has gone on for almost two days and we had previously reached a good level of performance, we can be happy of the result and see how well the agent performs by making it play a few episodes loading the previous model.

- -
-
-
- -
- -
- - - - -
- -
-
- -
- -
-
- -
- -
- - - - -
- -
-
-
-

The agent plays pretty well, apart from certain instances where it does not even seem to try to hit the pins.

-

In the future, it may be worth to try and modify the reward function so that it penalizes the agent for the number of pins left standing after each throw, so that it attempts to hit a strike every time. Additionally, it might be beneficial to end the episode after a strike or after the second throw. This could help reduce the learning curve and make the agent focus on every single throw.

- -
-
-
-
-

References and inspirations

This work was based on and inspired by:

- -

Inspirations taken from the official documentations for OpenAI Gym, Keras, Numpy, OpenCV, etc. are linked in the code or right before.

- -
-
- - - - - - - - -