From 5cc77099ab011bba985f71c080f5c10ca3f27f80 Mon Sep 17 00:00:00 2001 From: Max Lapan Date: Mon, 5 Aug 2019 12:06:07 +0300 Subject: [PATCH] Ptan intro tutorial --- README.md | 5 +- .../.ipynb_checkpoints/intro-checkpoint.ipynb | 1798 +++++++++++++++++ docs/intro.ipynb | 1798 +++++++++++++++++ 3 files changed, 3600 insertions(+), 1 deletion(-) create mode 100644 docs/.ipynb_checkpoints/intro-checkpoint.ipynb create mode 100644 docs/intro.ipynb diff --git a/README.md b/README.md index 75b9375..b4d1dd0 100644 --- a/README.md +++ b/README.md @@ -44,5 +44,8 @@ pip install opencv-python ## Documentation -Is not yet written :(. But has planned :) Before it has happend, there are random pieces of lib discussions which could be useful: +* [Ptan introduction](docs/intro.ipynb) + +Random pieces of information + * `ExperienceSource` vs `ExperienceSourceFirstLast`: https://github.com/Shmuma/ptan/issues/17#issuecomment-489584115 diff --git a/docs/.ipynb_checkpoints/intro-checkpoint.ipynb b/docs/.ipynb_checkpoints/intro-checkpoint.ipynb new file mode 100644 index 0000000..7069392 --- /dev/null +++ b/docs/.ipynb_checkpoints/intro-checkpoint.ipynb @@ -0,0 +1,1798 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Ptan intro\n", + "\n", + "[PTAN](https://github.com/Shmuma/ptan) (abbrevation of `PyTorch AgentNet`) is a small library I wrote to simplify RL experiments. It tries to keep the balance between two extremes:\n", + "\n", + "1. import lib, then write one line to train the DQN (very vivid example is [OpenAI baselines project](https://github.com/openai/baselines/))\n", + "2. implement everything from scratch\n", + "\n", + "First approach is very inflexible. It works good when you're using the library the way it supposed to be used. But if you want to do something fancy, you quickly find yourself hacking the lib and fighting with constraints imposed by the author rather than solving the problem you want to solve.\n", + "\n", + "Second extreme gives you *too much freedom* and requires implementing replay buffers and trajectory handling over and over again, which is error-prone, boring and inefficient.\n", + "\n", + "Several years ago I was tired of writing replay buffers and decided to implement something in between: not \"the universal RL lib\", but a set of classes to avoid writing boilerplate code.\n", + "\n", + "I used ptan to implement all the [examples for the \"Deep Reinforcement Learning Hands-On\" book](https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/), which include all the major DRL methods, which includes DQN, A3C, all tricks in Rainbow paper, DDPG, D4PG, PPO, TRPO, Acktr and AlphaGo Zero.\n", + "\n", + "## High-level overview\n", + "\n", + "From the high level, ptan provides you the following entities:\n", + "\n", + "* `Agent`: class which knows how to convert batch of observations to batch of actions to be executed. It can contain optional state, in case you need to track some info between consequent actions in one episode (for example, noise params for Ornstein–Uhlenbeck exploration). Normally, you can use [already existing Agent instances](https://github.com/Shmuma/ptan/blob/master/ptan/agent.py) or write your own subclass of `BaseAgent`.\n", + "* `ActionSelector`: small piece of logic which knows how to choose the action from some output of the network. Works in tandem with `Agent` https://github.com/Shmuma/ptan/blob/master/ptan/actions.py\n", + "* `ExperienceSource` and variations: by using the `Agent` instance and gym environment object can provide information about the trajectory from episodes. In the simplest form it is one single $(a, r, s')$ transition at a time, but functionality goes beyond this. Source file is https://github.com/Shmuma/ptan/blob/master/ptan/experience.py\n", + "* `ExperienceSourceBuffer` and friends: replay buffers with various characteristics. Includes simple replay buffer and two versions of prioritized replay buffers\n", + "* various utility classes, like `TargetNet` (both discrete and continuous), wrappers for time-series preprocessing (used for tracking training progress in TensorBoard)\n", + "* includes wrappers for Gym environments, for example, wrappers for Atari games (copy-pasted from OpenAI baselines with some tweaks): https://github.com/Shmuma/ptan/blob/master/ptan/common/wrappers.py\n", + "\n", + "And that's basically it. Total amount of sourse is just ~1500 lines of Python, which makes it possible to master in couple of hours.\n", + "\n", + "Below I'm going to demonstrate how ptan could be used to simplify RL methods implementation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Installation\n", + "\n", + "We'll need gym, opencv python bindings. And pytorch, of course" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting package metadata: done\n", + "Solving environment: done\n", + "\n", + "\n", + "==> WARNING: A newer version of conda exists. <==\n", + " current version: 4.6.2\n", + " latest version: 4.7.10\n", + "\n", + "Please update conda by running\n", + "\n", + " $ conda update -n base -c defaults conda\n", + "\n", + "\n", + "\n", + "# All requested packages already installed.\n", + "\n", + "Requirement already satisfied: ptan in /Users/shmuma/work/ptan (0.5)\n", + "Requirement already satisfied: torch==1.1.0 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (1.1.0)\n", + "Requirement already satisfied: gym in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (0.12.1)\n", + "Requirement already satisfied: atari-py in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (0.2.6)\n", + "Requirement already satisfied: numpy in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (1.16.2)\n", + "Requirement already satisfied: opencv-python in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (4.1.0.25)\n", + "Requirement already satisfied: pyglet>=1.2.0 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from gym->ptan) (1.3.2)\n", + "Requirement already satisfied: six in /anaconda3/envs/ptan/lib/python3.7/site-packages (from gym->ptan) (1.12.0)\n", + "Requirement already satisfied: scipy in /anaconda3/envs/ptan/lib/python3.7/site-packages (from gym->ptan) (1.2.1)\n", + "Requirement already satisfied: requests>=2.0 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from gym->ptan) (2.21.0)\n", + "Requirement already satisfied: future in /anaconda3/envs/ptan/lib/python3.7/site-packages (from pyglet>=1.2.0->gym->ptan) (0.17.1)\n", + "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from requests>=2.0->gym->ptan) (1.24.1)\n", + "Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from requests>=2.0->gym->ptan) (2.8)\n", + "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from requests>=2.0->gym->ptan) (3.0.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from requests>=2.0->gym->ptan) (2019.6.16)\n" + ] + } + ], + "source": [ + "!conda install pytorch torchvision -c pytorch\n", + "!pip install ptan" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Imports" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import ptan\n", + "import gym\n", + "import numpy as np\n", + "from typing import List, Any, Optional, Tuple\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "import torch.optim as optim\n", + "import matplotlib.pylab as plt\n", + "\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Action selector\n", + "\n", + "https://github.com/Shmuma/ptan/blob/master/ptan/actions.py\n", + "\n", + "Helps to go from network output to concrete action values. Most common cases:\n", + "* Argmax: commonly used by Q-value methods, when the network predicts Q-values for set of actions and the desired action is the action with the largest Q\n", + "* Policy-based: network outputs the probability distribution (in form of logits or normalized distribution) and action need to be sampled from this distribution. Used commonly by PG-methods\n", + "\n", + "Action selector is used by the `Agent`, and rarely need to be customized (but you have this option). Concrete classes which could be used:\n", + "* [`ArgmaxActionSelector`](https://github.com/Shmuma/ptan/blob/master/ptan/actions.py#L12): applies `argmax` on the second axis of passed tensor (matrix is assumed)\n", + "* [`ProbabilityActionSelector`](https://github.com/Shmuma/ptan/blob/master/ptan/actions.py#L36): samples from probability distribution of discrete set of actions\n", + "* [`EpsilonGreedyActionSelector`](https://github.com/Shmuma/ptan/blob/master/ptan/actions.py#L21): has parameter $\\epsilon$ which specifies the probability of random action to be taken. \n", + "\n", + "All the classes assume numpy arrays to be passed to them\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 1, 2, 3],\n", + " [ 1, -1, 0]])" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "q_vals = np.array([[1, 2, 3], [1, -1, 0]])\n", + "q_vals" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([2, 0])" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector = ptan.actions.ArgmaxActionSelector()\n", + "selector(q_vals)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([2, 0])" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=0.0)\n", + "selector(q_vals)\n", + "# have to be the same result, as episilon is 0 (no random actions)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 1])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)\n", + "selector(q_vals)\n", + "# will be random" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[2 2 1]\n", + "[1 2 1]\n", + "[1 2 0]\n", + "[1 2 1]\n", + "[1 2 1]\n", + "[1 2 1]\n", + "[0 2 0]\n", + "[1 2 0]\n", + "[1 2 1]\n", + "[1 2 0]\n" + ] + } + ], + "source": [ + "# here we sample from probability distribution (have to be normalized)\n", + "selector = ptan.actions.ProbabilityActionSelector()\n", + "for _ in range(10):\n", + " acts = selector(np.array([\n", + " [0.1, 0.8, 0.1],\n", + " [0.0, 0.0, 1.0],\n", + " [0.5, 0.5, 0.0]\n", + " ]))\n", + " print(acts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Agent class\n", + "\n", + "`Agent` is class which knows how to convert observations into actions. There are three most common approaches:\n", + "* **Q-function**: NN predicts Q-values for actions, the $argmax Q(s)$ is the action\n", + "* **Policy-based**: NN predicts probability distribution over actions $\\pi(s)$, you sample from this distribution and get the action to do\n", + "* **Continuous control**: NN predits the $\\mu(s)$ of continuous control parameters and the output is your actions to execute.\n", + "\n", + "Third case is trivial, two first approached is implemented in `ptan` to be reused without any coding: [`DQNAgent`](https://github.com/Shmuma/ptan/blob/master/ptan/agent.py#L55) and [`PolicyAgent`](https://github.com/Shmuma/ptan/blob/master/ptan/agent.py#L104).\n", + "\n", + "But in reality, it is often needed to implement your own agent, some of the reasons:\n", + "* You have fancy architecture of the net -- mixture of continuous and discrete action space, have multi-modal observations (text and pixels, for example)\n", + "* You want to use non-standard exploration strategies, for example Ornstein–Uhlenbeck process (very popular exploration strategy in continuous control domain)\n", + "* You have PoMDP environment and you decision are not fully defined by observations, but by some internal agent state (which is also the case for Ornstein–Uhlenbeck)\n", + "\n", + "All those cases are easily supported by subclassing the `BaseAgent` class, in TextWorld's tutorial we'll do exactly this.\n", + "\n", + "Below is the example how provided `DQNAgent` and `PolicyAgent` could be used." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## DQNAgent\n", + "\n", + "Suppose we have NN which produces Q-values from observations. `DQNAgent` takes batch of observations on input (as numpy array), apply the network on them to get Q-values, then uses provided `ActionSelector` to convert Q-values to indices of actions.\n", + "\n", + "Below is the small example. For simplicity, our network always produces the same output for the input batch" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "class Net(nn.Module):\n", + " def __init__(self, actions: int):\n", + " super(Net, self).__init__()\n", + " self.actions = actions\n", + " \n", + " def forward(self, x):\n", + " # we always produce diagonal tensor of shape (batch_size, actions)\n", + " return torch.eye(x.size()[0], self.actions)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "net = Net(actions=3)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "tensor([[1., 0., 0.],\n", + " [0., 1., 0.]])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "net(torch.zeros(2, 10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So, let's use simple $argmax$ policy for the beginning. Agent will return actions corresponding to 1s in the net output." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "selector = ptan.actions.ArgmaxActionSelector()\n", + "agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector, device=\"cpu\")\n", + "# note that you need to tell agent are you using GPU or not by passing device, by default it equals to \"cpu\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can pass the agent some observations (which will be ignored as our example is trivial), the output will be the actions according to NN output." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([0, 1]), [None, None])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent(torch.zeros(2, 5))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output from the agent is a tuple with two components:\n", + "1. numpy array with actions to be executed -- in our case of discrete actions, they are indices\n", + "2. list with agent's internal state. This is used for stateful agents, and is a list of None in our case. As our agent is stateless, you can ignore it" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's try to make the agent with epsilon-greedy exploration strategy. For this, we need just pass a different action selector and that's done." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)\n", + "agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As epsilon is 1, all the actions will be random, regardless of network's output" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 0, 1, 1, 0, 2, 2, 1, 0, 0])" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent(torch.zeros(10, 5))[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But we can change the epsilon value on the fly, which is very handy during the training, when we supposed to anneal epsilon over time." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([2, 1, 2, 2, 0, 1, 2, 0, 0, 0])" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector.epsilon = 0.5\n", + "agent(torch.zeros(10, 5))[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0, 1, 2, 0, 0, 0, 0, 0, 0, 0])" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector.epsilon = 0.1\n", + "agent(torch.zeros(10, 5))[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## PolicyAgent\n", + "\n", + "`PolicyAgent` expects the network to produce policy distribution over discrete set of actions. Policy distribution could be either logits (unnormalized) or normalized distribution. In practice you should always use logits to improve stability.\n", + "\n", + "Let's reimplement our above sample, but now network will produce probability" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "class Net(nn.Module):\n", + " def __init__(self, actions: int):\n", + " super(Net, self).__init__()\n", + " self.actions = actions\n", + " \n", + " def forward(self, x):\n", + " # Now we produce the tensor with first two actions having the same logit scores\n", + " res = torch.zeros((x.size()[0], self.actions), dtype=torch.float32)\n", + " res[:, 0] = 1\n", + " res[:, 1] = 1\n", + " return res" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "tensor([[1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.]])" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "net = Net(actions=5)\n", + "net(torch.zeros(6, 10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we need to use `ProbabilityActionSelector`. Also note the agument `apply_softmax=True` which tells agent that output is not normalized." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "selector = ptan.actions.ProbabilityActionSelector()\n", + "agent = ptan.agent.PolicyAgent(model=net, action_selector=selector, apply_softmax=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can pass agent observations (fake, as before) and get some actions. Agent, as before returns the tuple with actions and internal state, which will be ignored" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([4, 0, 4, 0, 0, 0])" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent(torch.zeros(6, 5))[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please note, that softmax returns non-zero probabilities to zero logits, so, actions 2-5 are still could be sampled (but less likely than 0 and 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Experience source\n", + "\n", + "`Agent` abstraction described above allows us to implement environment communications in a generic way. This communication is happening in form of trajectories, produced by applying agent's actions to gym environment.\n", + "\n", + "At high level, Experience source classes take the agent instance, environment and provide you step-by step data from the trajectories. Functionality of those clases include:\n", + "1. support of multiple environments being communicated at the same time. This allows efficient GPU utilization as batch of observations being processed by agent at once.\n", + "2. trajectory could be preprocessed and presented in a convenient form for further training. For example, there is implementation of sub-trajectory rollouts, which is convenient for DQN and n-step DQN, when we're not interested in intermediate steps in n-step subtrajectories, only in first and last observations + total reward for the subtrajectory.\n", + "3. support of vectorized environments from OpenAI Universe\n", + "\n", + "So, the experience source classes acts as a \"magic black box\" hiding the environment interaction and trajectory handling complexities from the library user. But the overall ptan philosophy is to be flexible and extensible, so, if you want, you can subclass one of existing classes or implement your own version in case of neccessity. \n", + "\n", + "There are classes which are provided by the system:\n", + "* [`ExperienceSource`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L18): by using agent and the set of environments produces n-step subtrajectories with all intermediate steps.\n", + "* [`ExperienceSourceFirstLast`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L161): the same as `ExperienceSource`, but instead of full subtrajectory (with all steps) keeps only first and last steps with proper reward accumulation in between. This can save lots of memory in case of N-step DQN or A2C rollouts.\n", + "* [`ExperienceSourceRollouts`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L200): follows A3C rollouts scheme described in Minh's paper about Atari games.\n", + "\n", + "All the classes are written to be efficient both in terms of CPU and memory, which is not very important for toy problems, but might become an issue when you want to solve Atari games, keeping 10M samples in replay buffer using commodity hardware.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Toy gym environment\n", + "\n", + "For demonstration purpoposes, we'll implement very simple gym environment with small predictable observation state to show how Experience source classes works" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "class ToyEnv(gym.Env):\n", + " \"\"\"\n", + " Environment with observation 0..4 and actions 0..2\n", + " Observations are rotated sequentialy mod 5, reward is equal to given action.\n", + " Episodes are having fixed length of 10\n", + " \"\"\"\n", + " def __init__(self):\n", + " super(ToyEnv, self).__init__()\n", + " self.observation_space = gym.spaces.Discrete(n=5)\n", + " self.action_space = gym.spaces.Discrete(n=3)\n", + " self.step_index = 0\n", + " \n", + " def reset(self):\n", + " self.step_index = 0\n", + " return self.step_index\n", + " \n", + " def step(self, action):\n", + " is_done = self.step_index == 10\n", + " if is_done:\n", + " return self.step_index % self.observation_space.n, 0.0, is_done, {}\n", + " self.step_index += 1\n", + " return self.step_index % self.observation_space.n, float(action), self.step_index == 10, {}" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "env = ToyEnv()\n", + "env.reset()" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(1, 1.0, False, {})" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "env.step(1)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2, 2.0, False, {})" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "env.step(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(3, 0.0, False, {})\n", + "(4, 0.0, False, {})\n", + "(0, 0.0, False, {})\n", + "(1, 0.0, False, {})\n", + "(2, 0.0, False, {})\n", + "(3, 0.0, False, {})\n", + "(4, 0.0, False, {})\n", + "(0, 0.0, True, {})\n", + "(0, 0.0, True, {})\n", + "(0, 0.0, True, {})\n" + ] + } + ], + "source": [ + "for _ in range(10):\n", + " r = env.step(0)\n", + " print(r)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "env.reset()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll also need the agent which always generates fixed action" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "class DullAgent(ptan.agent.BaseAgent):\n", + " def __init__(self, action: int):\n", + " self.action = action\n", + " \n", + " def __call__(self, observations: List[Any], state: Optional[List] = None) -> Tuple[List[int], Optional[List]]:\n", + " return [self.action for _ in observations], state" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "([1, 1], None)" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent = DullAgent(action=1)\n", + "agent([1, 2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ExperienceSource class\n", + "\n", + "Generates chunks of trajectories of the given length.\n", + "\n", + "Constructor arguments:\n", + "* gym environment to be use (could be the list of environments or one single environment)\n", + "* the agent\n", + "* `steps_count=2`: the length of sub-trajectories to be generated\n", + "* `steps_delta=1`: step in subtrajectories\n", + "* `vectorized=False`: if true, environment is OpenAI Universe vectorized environment (more about them in MiniWoB tutorial)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "env = ToyEnv()\n", + "agent = DullAgent(action=1)\n", + "exp_source = ptan.experience.ExperienceSource(env=env, agent=agent, steps_count=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All experience source classes are providing standard python's iterator interface, so, you can just iterate over them to get sub-trajectories." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n" + ] + } + ], + "source": [ + "for exp in exp_source:\n", + " print(exp)\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result is a tuple of length `steps_count` (in our case we requested sub-trajectories of length 2). Every entry is a namedtuple object with the following fields:\n", + "* state: state we observed before taking the action\n", + "* action: action we've done\n", + "* reward: immediate reward we've got from env\n", + "* done: was the episode done or not" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=4, action=1, reward=1.0, done=True),)\n" + ] + } + ], + "source": [ + "for exp in exp_source:\n", + " print(exp)\n", + " if exp[0].done:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please note, that partial trajectories are alse returned, so, we can handle end of episodes properly.\n", + "\n", + "At the end of episode, environment is being reset automatically, so, we don't need to bother about them, just keep iterating:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=4, action=1, reward=1.0, done=True),)\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n" + ] + } + ], + "source": [ + "for idx, exp in enumerate(exp_source):\n", + " print(exp)\n", + " if idx > 15:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That's very convenient, especially in cases when we have several environments running in parallel (several instances of the same Atari game, for example).\n", + "\n", + "Let's increase length of our experience chunks." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "exp_source = ptan.experience.ExperienceSource(env=env, agent=agent, steps_count=4)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=4, action=1, reward=1.0, done=True),)\n" + ] + } + ], + "source": [ + "for exp in exp_source:\n", + " print(exp)\n", + " if exp[0].done:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we're getting subtrajectories of length 4, including the final pieces of trajectory.\n", + "\n", + "Let's give several environments to the experience source." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "exp_source = ptan.experience.ExperienceSource(env=[ToyEnv(), ToyEnv()], agent=agent, steps_count=2)" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n" + ] + } + ], + "source": [ + "for idx, exp in enumerate(exp_source):\n", + " print(exp)\n", + " if idx > 15:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, our environments are being iterated on a round-robin fashion, giving us access to trajectories from both environment step-by-step. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ExperienceSourceFirstLast\n", + "\n", + "Class `ExperienceSource` provides us full subtrajectories of given length as the list of $(s, a, r)$ objects. The next state $s'$ is returned in the next tuple, which is not always convenient. \n", + "\n", + "For example, in DQN training, we want to have tuples $(s, a, r, s')$ at once to do 1-step Bellman approximation during the training. In addition, some extension of DQN, like n-step DQN might want to collapse longer sequences of observations into (first-state, action, total-reward-for-n-steps, state-after-step-n).\n", + "\n", + "To support this in a generic way, simple subclass of `ExperienceSource` is implemented: `ExperienceSourceFirstLast`. It accepts almost the same arguments in constructor, but returns different data." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n" + ] + } + ], + "source": [ + "for idx, exp in enumerate(exp_source):\n", + " print(exp)\n", + " if idx > 10:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now it returns single object on every iteration, which is again a namedtuple with the following fields:\n", + "* `state`: state which we used to decide on action to make\n", + "* `action`: action we've done at this step\n", + "* `reward`: partial accumulated reward for `steps_count` (in our case, `steps_count=1`, so it is equal to immediate reward)\n", + "* `last_state`: the state we've got after executing the action. If our episode ends, we have None here\n", + "\n", + "This data is much more convenient for DQN training, as we can apply Bellman approximation directly on this data.\n", + "\n", + "Let's check the result with larger amount of steps." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [], + "source": [ + "exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=2)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)\n", + "ExperienceFirstLast(state=2, action=1, reward=2.0, last_state=4)\n", + "ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=0)\n", + "ExperienceFirstLast(state=4, action=1, reward=2.0, last_state=1)\n", + "ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)\n", + "ExperienceFirstLast(state=2, action=1, reward=2.0, last_state=4)\n", + "ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=None)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)\n" + ] + } + ], + "source": [ + "for idx, exp in enumerate(exp_source):\n", + " print(exp)\n", + " if idx > 10:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So, now we're collapsing two steps on every iteration, calculating immediate reward (that's why our reward=2.0 for most of the samples).\n", + "\n", + "More interesting samples are at the end of the episode:\n", + "```\n", + "ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=None)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "```\n", + "\n", + "As episode ends, we have `last_state=None` in those samples, but additionally, we calculating the tail of the episode. Those tiny details are very easy to implement wrong, if you're doing all the trajectory handling yourself." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Experience source buffers\n", + "\n", + "In DQN we rarely dealing with immediate experience samples, as they are heavily correlated, which lead to instability in training. \n", + "\n", + "Normally, we have large replay buffers, which are being populated with experience pieces. Then the buffer is being sampled (randomly or with priority weights) to get the training batch. Replay buffer normally has the maximum capacity, so old samples are being pushed out when replay buffer reaches the limit.\n", + "\n", + "There are several implementation tricks here, which becomes extremely important when you need to deal with large problems:\n", + "* how to efficiently sample from large buffer\n", + "* how to push old samples from the buffer\n", + "* in case of prioritized buffer, how priorities need to be maintained and handled in the most efficient way.\n", + "\n", + "All this becomes quite non-trivial task, if you want to solve atari, keeping 10-100M samples where every sample is an image from the game. Small mistake can lead to 10-100x memory increase and major slowdowns of the training process.\n", + "\n", + "Ptan provides several variants of replay buffers, which provide simple integration with `ExperienceSource` and `Agent` machinery. Normally, what you need to do is to ask buffer to pull new sample from the source and sample the training batch.\n", + "\n", + "Provided classes:\n", + "* [`ExperienceReplayBuffer`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L327): simple replay buffer of predefined size with uniform sampling\n", + "* [`PrioReplayBufferNaive`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L371): simple, but not very efficient prioritized replay buffer implementation. Complexity of sampling is O(n), which might become an issue with large buffers\n", + "* [`PrioritizedReplayBuffer`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L414): uses segment trees for sampling, which makes code cryptic, but with O(log(n)) sampling complexity.\n", + "\n", + "Below is the example of simple relay buffer, if you want, you can find examples of `PrioritizedReplayBuffer` usage in examples for chapter 7 of my book: https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter07/05_dqn_prio_replay.py" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [], + "source": [ + "env = ToyEnv()\n", + "agent = DullAgent(action=1)\n", + "exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=100)" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(buffer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All replay buffers provides the following interface:\n", + "* python iterator interface to walk over all the samples in the buffer\n", + "* method `populate(N)`, to get N samples from the experience source and put into the buffer\n", + "* method `sample(N)`, to get the batch of N experience objects\n", + "\n", + "So, the normal training loop for DQN looks like infinite repetition of the following steps:\n", + "1. call `buffer.populate(1)` to get fresh sample from the environment\n", + "2. `batch = buffer.sample(BATCH_SIZE)` to get the batch from buffer\n", + "3. calculate the loss on the sampled batch\n", + "4. backpropagate\n", + "5. repeat until convergence (hopefully)\n", + "\n", + "All the rest is happening automatically -- reset of the environment, sub-trajectories handling, buffer size maintenance, etc." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n" + ] + } + ], + "source": [ + "for step in range(10):\n", + " buffer.populate(1)\n", + " # if buffer is small enough, do nothing\n", + " if len(buffer) < 5:\n", + " continue\n", + " batch = buffer.sample(4)\n", + " print(\"Train time, %d batch samples:\" % len(batch))\n", + " for s in batch:\n", + " print(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Monitoring the training\n", + "\n", + "Normally, if we're running the training process, we want to keep an eye on several metrics to check how good our method is doing. Minimal set of things to watch includes:\n", + "* training loss (several loss components in case of A2C, for example)\n", + "* values predicted by the network (in case of DQN)\n", + "* statistics about episode rewards (to check that our agent improves over time)\n", + "* statistics about the length of the episode, as this is normally a proxy for reward\n", + "\n", + "First two items are being calculated in the training loop, but the rest two values are not that easy to get. If we're implementing everything from scratch, we need to track the current episode and when it ends, track the total reward and length.\n", + "\n", + "Ptan simplifies this by providing the method in experience source, which returns this information in one call. Method `pop_rewards_steps()` returns the list, where each entry is the information about the episode which since the lass call to the method. If no episodes have completed between the calls, empty list is returned. \n", + "\n", + "Every item is a tuple with (total_reword, total_steps). \n", + "\n", + "So, the only thing you need to do to monitor the training progress, is to periodically call method `pop_rewards_steps()` in the training loop and handle the results (printing on console or sending to TensorBoard, or whatever)." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [], + "source": [ + "r = exp_source.pop_rewards_steps()" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(10.0, 10)]" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "r" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We've one one episode completed so far, it got reward 10.0 and total amount of steps was 10" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "exp_source.pop_rewards_steps()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Other tools\n", + "\n", + "There are several smaller things, which could be used, like [`TargetNet`](https://github.com/Shmuma/ptan/blob/master/ptan/agent.py#L79), which allows you to keep a copy of model weights and syncronize them from time to time (which is essential for DQN to converge), or a [set of utils](https://github.com/Shmuma/ptan/blob/master/ptan/common/utils.py) to smooth time series for better training progress visualisation.\n", + "\n", + "There is [PyTorch Ignite bindings](https://github.com/Shmuma/ptan/blob/master/ptan/ignite.py) which implement integration of ptan with ignite framework:\n", + "\n", + "* install end of episode hooks `EpisodeEvents.EPISODE_COMPLETED`\n", + "* handle situations when reward reaches boundary `EpisodeEvents.BOUND_REWARD_REACHED`\n", + "* measure performance of the training process: `EpisodeFPSHandler`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Simple CartPole solver\n", + "\n", + "Below is very simple DQN version which solves CartPole, just to demonstrate how all things fits together in real life." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [], + "source": [ + "class Net(nn.Module):\n", + " def __init__(self, obs_size, hidden_size, n_actions):\n", + " super(Net, self).__init__()\n", + " self.net = nn.Sequential(\n", + " nn.Linear(obs_size, hidden_size),\n", + " nn.ReLU(),\n", + " nn.Linear(hidden_size, n_actions)\n", + " )\n", + "\n", + " def forward(self, x):\n", + " # CartPole is stupid -- they return double observations, rather than standard floats, so, the cast here\n", + " return self.net(x.float())" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [], + "source": [ + "BATCH_SIZE = 64\n", + "REPLAY_SIZE = 1000\n", + "LR = 1e-3\n", + "GAMMA=0.9\n", + "EPS_DECAY=0.995" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [], + "source": [ + "env = gym.make(\"CartPole-v0\")" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Net(\n", + " (net): Sequential(\n", + " (0): Linear(in_features=4, out_features=64, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=64, out_features=2, bias=True)\n", + " )\n", + ")" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "net = Net(obs_size=env.observation_space.shape[0], hidden_size=64, n_actions=env.action_space.n)\n", + "optimizer = optim.Adam(net.parameters(), LR)\n", + "net" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [], + "source": [ + "action_selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)\n", + "agent = ptan.agent.DQNAgent(net, action_selector)\n", + "exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=GAMMA)\n", + "buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=REPLAY_SIZE)" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [], + "source": [ + "@torch.no_grad()\n", + "def unpack_batch(batch: List[ptan.experience.ExperienceFirstLast], net: nn.Module, gamma: float):\n", + " states = []\n", + " actions = []\n", + " rewards = []\n", + " done_masks = []\n", + " last_states = []\n", + " for exp in batch:\n", + " states.append(exp.state)\n", + " actions.append(exp.action)\n", + " rewards.append(exp.reward)\n", + " done_masks.append(exp.last_state is None)\n", + " if exp.last_state is None:\n", + " last_states.append(exp.state)\n", + " else:\n", + " last_states.append(exp.last_state)\n", + "\n", + " states_v = torch.tensor(states)\n", + " actions_v = torch.tensor(actions)\n", + " rewards_v = torch.tensor(rewards)\n", + " last_states_v = torch.tensor(last_states)\n", + " last_state_q_v = net(last_states_v)\n", + " best_last_q_v = torch.max(last_state_q_v, dim=1)[0]\n", + " best_last_q_v[done_masks] = 0.0\n", + " return states_v, actions_v, best_last_q_v + rewards_v" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "28: episode done, reward=27.000, steps=27, epsilon=1.00\n", + "43: episode done, reward=15.000, steps=15, epsilon=1.00\n", + "66: episode done, reward=23.000, steps=23, epsilon=1.00\n", + "85: episode done, reward=19.000, steps=19, epsilon=1.00\n", + "98: episode done, reward=13.000, steps=13, epsilon=1.00\n", + "111: episode done, reward=13.000, steps=13, epsilon=1.00\n", + "166: episode done, reward=55.000, steps=55, epsilon=1.00\n", + "191: episode done, reward=25.000, steps=25, epsilon=1.00\n", + "212: episode done, reward=21.000, steps=21, epsilon=0.94\n", + "231: episode done, reward=19.000, steps=19, epsilon=0.86\n", + "248: episode done, reward=17.000, steps=17, epsilon=0.79\n", + "266: episode done, reward=18.000, steps=18, epsilon=0.72\n", + "282: episode done, reward=16.000, steps=16, epsilon=0.66\n", + "298: episode done, reward=16.000, steps=16, epsilon=0.61\n", + "325: episode done, reward=27.000, steps=27, epsilon=0.53\n", + "337: episode done, reward=12.000, steps=12, epsilon=0.50\n", + "349: episode done, reward=12.000, steps=12, epsilon=0.47\n", + "362: episode done, reward=13.000, steps=13, epsilon=0.44\n", + "371: episode done, reward=9.000, steps=9, epsilon=0.42\n", + "387: episode done, reward=16.000, steps=16, epsilon=0.39\n", + "397: episode done, reward=10.000, steps=10, epsilon=0.37\n", + "408: episode done, reward=11.000, steps=11, epsilon=0.35\n", + "416: episode done, reward=8.000, steps=8, epsilon=0.34\n", + "436: episode done, reward=20.000, steps=20, epsilon=0.31\n", + "445: episode done, reward=9.000, steps=9, epsilon=0.29\n", + "458: episode done, reward=13.000, steps=13, epsilon=0.27\n", + "467: episode done, reward=9.000, steps=9, epsilon=0.26\n", + "476: episode done, reward=9.000, steps=9, epsilon=0.25\n", + "489: episode done, reward=13.000, steps=13, epsilon=0.23\n", + "501: episode done, reward=12.000, steps=12, epsilon=0.22\n", + "519: episode done, reward=18.000, steps=18, epsilon=0.20\n", + "533: episode done, reward=14.000, steps=14, epsilon=0.19\n", + "543: episode done, reward=10.000, steps=10, epsilon=0.18\n", + "553: episode done, reward=10.000, steps=10, epsilon=0.17\n", + "562: episode done, reward=9.000, steps=9, epsilon=0.16\n", + "571: episode done, reward=9.000, steps=9, epsilon=0.16\n", + "581: episode done, reward=10.000, steps=10, epsilon=0.15\n", + "593: episode done, reward=12.000, steps=12, epsilon=0.14\n", + "602: episode done, reward=9.000, steps=9, epsilon=0.13\n", + "613: episode done, reward=11.000, steps=11, epsilon=0.13\n", + "627: episode done, reward=14.000, steps=14, epsilon=0.12\n", + "640: episode done, reward=13.000, steps=13, epsilon=0.11\n", + "653: episode done, reward=13.000, steps=13, epsilon=0.10\n", + "664: episode done, reward=11.000, steps=11, epsilon=0.10\n", + "679: episode done, reward=15.000, steps=15, epsilon=0.09\n", + "700: episode done, reward=21.000, steps=21, epsilon=0.08\n", + "725: episode done, reward=25.000, steps=25, epsilon=0.07\n", + "736: episode done, reward=11.000, steps=11, epsilon=0.07\n", + "746: episode done, reward=10.000, steps=10, epsilon=0.06\n", + "756: episode done, reward=10.000, steps=10, epsilon=0.06\n", + "766: episode done, reward=10.000, steps=10, epsilon=0.06\n", + "779: episode done, reward=13.000, steps=13, epsilon=0.05\n", + "794: episode done, reward=15.000, steps=15, epsilon=0.05\n", + "807: episode done, reward=13.000, steps=13, epsilon=0.05\n", + "818: episode done, reward=11.000, steps=11, epsilon=0.05\n", + "828: episode done, reward=10.000, steps=10, epsilon=0.04\n", + "837: episode done, reward=9.000, steps=9, epsilon=0.04\n", + "846: episode done, reward=9.000, steps=9, epsilon=0.04\n", + "856: episode done, reward=10.000, steps=10, epsilon=0.04\n", + "865: episode done, reward=9.000, steps=9, epsilon=0.04\n", + "875: episode done, reward=10.000, steps=10, epsilon=0.03\n", + "886: episode done, reward=11.000, steps=11, epsilon=0.03\n", + "897: episode done, reward=11.000, steps=11, epsilon=0.03\n", + "909: episode done, reward=12.000, steps=12, epsilon=0.03\n", + "934: episode done, reward=25.000, steps=25, epsilon=0.03\n", + "947: episode done, reward=13.000, steps=13, epsilon=0.02\n", + "961: episode done, reward=14.000, steps=14, epsilon=0.02\n", + "974: episode done, reward=13.000, steps=13, epsilon=0.02\n", + "986: episode done, reward=12.000, steps=12, epsilon=0.02\n", + "1018: episode done, reward=32.000, steps=32, epsilon=0.02\n", + "1060: episode done, reward=42.000, steps=42, epsilon=0.01\n", + "1079: episode done, reward=19.000, steps=19, epsilon=0.01\n", + "1131: episode done, reward=52.000, steps=52, epsilon=0.01\n", + "1175: episode done, reward=44.000, steps=44, epsilon=0.01\n", + "1275: episode done, reward=100.000, steps=100, epsilon=0.00\n", + "1309: episode done, reward=34.000, steps=34, epsilon=0.00\n", + "1339: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1405: episode done, reward=66.000, steps=66, epsilon=0.00\n", + "1500: episode done, reward=95.000, steps=95, epsilon=0.00\n", + "1535: episode done, reward=35.000, steps=35, epsilon=0.00\n", + "1556: episode done, reward=21.000, steps=21, epsilon=0.00\n", + "1580: episode done, reward=24.000, steps=24, epsilon=0.00\n", + "1625: episode done, reward=45.000, steps=45, epsilon=0.00\n", + "1654: episode done, reward=29.000, steps=29, epsilon=0.00\n", + "1684: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1714: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1748: episode done, reward=34.000, steps=34, epsilon=0.00\n", + "1774: episode done, reward=26.000, steps=26, epsilon=0.00\n", + "1804: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1831: episode done, reward=27.000, steps=27, epsilon=0.00\n", + "1848: episode done, reward=17.000, steps=17, epsilon=0.00\n", + "1864: episode done, reward=16.000, steps=16, epsilon=0.00\n", + "1894: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1922: episode done, reward=28.000, steps=28, epsilon=0.00\n", + "1954: episode done, reward=32.000, steps=32, epsilon=0.00\n", + "1982: episode done, reward=28.000, steps=28, epsilon=0.00\n", + "2038: episode done, reward=56.000, steps=56, epsilon=0.00\n", + "2072: episode done, reward=34.000, steps=34, epsilon=0.00\n", + "2172: episode done, reward=100.000, steps=100, epsilon=0.00\n", + "2264: episode done, reward=92.000, steps=92, epsilon=0.00\n", + "2294: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "2328: episode done, reward=34.000, steps=34, epsilon=0.00\n", + "2382: episode done, reward=54.000, steps=54, epsilon=0.00\n", + "2420: episode done, reward=38.000, steps=38, epsilon=0.00\n", + "2469: episode done, reward=49.000, steps=49, epsilon=0.00\n", + "2523: episode done, reward=54.000, steps=54, epsilon=0.00\n", + "2547: episode done, reward=24.000, steps=24, epsilon=0.00\n", + "2573: episode done, reward=26.000, steps=26, epsilon=0.00\n", + "2606: episode done, reward=33.000, steps=33, epsilon=0.00\n", + "2620: episode done, reward=14.000, steps=14, epsilon=0.00\n", + "2646: episode done, reward=26.000, steps=26, epsilon=0.00\n", + "2666: episode done, reward=20.000, steps=20, epsilon=0.00\n", + "2698: episode done, reward=32.000, steps=32, epsilon=0.00\n", + "2738: episode done, reward=40.000, steps=40, epsilon=0.00\n", + "2779: episode done, reward=41.000, steps=41, epsilon=0.00\n", + "2822: episode done, reward=43.000, steps=43, epsilon=0.00\n", + "2880: episode done, reward=58.000, steps=58, epsilon=0.00\n", + "2936: episode done, reward=56.000, steps=56, epsilon=0.00\n" + ] + } + ], + "source": [ + "step = 0\n", + "losses = []\n", + "rewards = []\n", + "\n", + "while True:\n", + " step += 1\n", + " buffer.populate(1)\n", + " solved = False\n", + " for reward, steps in exp_source.pop_rewards_steps():\n", + " print(\"%d: episode done, reward=%.3f, steps=%d, epsilon=%.2f\" % (\n", + " step, reward, steps, action_selector.epsilon))\n", + " rewards.append(reward)\n", + " solved = reward > 150\n", + " if solved:\n", + " print(\"Congrats!\")\n", + " break\n", + " if len(buffer) < 200:\n", + " continue\n", + " batch = buffer.sample(BATCH_SIZE)\n", + " states_v, actions_v, tgt_q_v = unpack_batch(batch, net, GAMMA)\n", + " optimizer.zero_grad()\n", + " q_v = net(states_v)\n", + " q_v = q_v.gather(1, actions_v.unsqueeze(-1)).squeeze(-1)\n", + " loss_v = F.mse_loss(q_v, tgt_q_v)\n", + " loss_v.backward()\n", + " optimizer.step() \n", + " losses.append(loss_v.item())\n", + " action_selector.epsilon *= EPS_DECAY\n", + " if step > 3000:\n", + " break" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(losses);" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(rewards);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Of course, hyperparams should be tuned, target network will improve stability, but you've got the idea :)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/intro.ipynb b/docs/intro.ipynb new file mode 100644 index 0000000..7069392 --- /dev/null +++ b/docs/intro.ipynb @@ -0,0 +1,1798 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Ptan intro\n", + "\n", + "[PTAN](https://github.com/Shmuma/ptan) (abbrevation of `PyTorch AgentNet`) is a small library I wrote to simplify RL experiments. It tries to keep the balance between two extremes:\n", + "\n", + "1. import lib, then write one line to train the DQN (very vivid example is [OpenAI baselines project](https://github.com/openai/baselines/))\n", + "2. implement everything from scratch\n", + "\n", + "First approach is very inflexible. It works good when you're using the library the way it supposed to be used. But if you want to do something fancy, you quickly find yourself hacking the lib and fighting with constraints imposed by the author rather than solving the problem you want to solve.\n", + "\n", + "Second extreme gives you *too much freedom* and requires implementing replay buffers and trajectory handling over and over again, which is error-prone, boring and inefficient.\n", + "\n", + "Several years ago I was tired of writing replay buffers and decided to implement something in between: not \"the universal RL lib\", but a set of classes to avoid writing boilerplate code.\n", + "\n", + "I used ptan to implement all the [examples for the \"Deep Reinforcement Learning Hands-On\" book](https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/), which include all the major DRL methods, which includes DQN, A3C, all tricks in Rainbow paper, DDPG, D4PG, PPO, TRPO, Acktr and AlphaGo Zero.\n", + "\n", + "## High-level overview\n", + "\n", + "From the high level, ptan provides you the following entities:\n", + "\n", + "* `Agent`: class which knows how to convert batch of observations to batch of actions to be executed. It can contain optional state, in case you need to track some info between consequent actions in one episode (for example, noise params for Ornstein–Uhlenbeck exploration). Normally, you can use [already existing Agent instances](https://github.com/Shmuma/ptan/blob/master/ptan/agent.py) or write your own subclass of `BaseAgent`.\n", + "* `ActionSelector`: small piece of logic which knows how to choose the action from some output of the network. Works in tandem with `Agent` https://github.com/Shmuma/ptan/blob/master/ptan/actions.py\n", + "* `ExperienceSource` and variations: by using the `Agent` instance and gym environment object can provide information about the trajectory from episodes. In the simplest form it is one single $(a, r, s')$ transition at a time, but functionality goes beyond this. Source file is https://github.com/Shmuma/ptan/blob/master/ptan/experience.py\n", + "* `ExperienceSourceBuffer` and friends: replay buffers with various characteristics. Includes simple replay buffer and two versions of prioritized replay buffers\n", + "* various utility classes, like `TargetNet` (both discrete and continuous), wrappers for time-series preprocessing (used for tracking training progress in TensorBoard)\n", + "* includes wrappers for Gym environments, for example, wrappers for Atari games (copy-pasted from OpenAI baselines with some tweaks): https://github.com/Shmuma/ptan/blob/master/ptan/common/wrappers.py\n", + "\n", + "And that's basically it. Total amount of sourse is just ~1500 lines of Python, which makes it possible to master in couple of hours.\n", + "\n", + "Below I'm going to demonstrate how ptan could be used to simplify RL methods implementation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Installation\n", + "\n", + "We'll need gym, opencv python bindings. And pytorch, of course" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting package metadata: done\n", + "Solving environment: done\n", + "\n", + "\n", + "==> WARNING: A newer version of conda exists. <==\n", + " current version: 4.6.2\n", + " latest version: 4.7.10\n", + "\n", + "Please update conda by running\n", + "\n", + " $ conda update -n base -c defaults conda\n", + "\n", + "\n", + "\n", + "# All requested packages already installed.\n", + "\n", + "Requirement already satisfied: ptan in /Users/shmuma/work/ptan (0.5)\n", + "Requirement already satisfied: torch==1.1.0 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (1.1.0)\n", + "Requirement already satisfied: gym in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (0.12.1)\n", + "Requirement already satisfied: atari-py in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (0.2.6)\n", + "Requirement already satisfied: numpy in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (1.16.2)\n", + "Requirement already satisfied: opencv-python in /anaconda3/envs/ptan/lib/python3.7/site-packages (from ptan) (4.1.0.25)\n", + "Requirement already satisfied: pyglet>=1.2.0 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from gym->ptan) (1.3.2)\n", + "Requirement already satisfied: six in /anaconda3/envs/ptan/lib/python3.7/site-packages (from gym->ptan) (1.12.0)\n", + "Requirement already satisfied: scipy in /anaconda3/envs/ptan/lib/python3.7/site-packages (from gym->ptan) (1.2.1)\n", + "Requirement already satisfied: requests>=2.0 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from gym->ptan) (2.21.0)\n", + "Requirement already satisfied: future in /anaconda3/envs/ptan/lib/python3.7/site-packages (from pyglet>=1.2.0->gym->ptan) (0.17.1)\n", + "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from requests>=2.0->gym->ptan) (1.24.1)\n", + "Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from requests>=2.0->gym->ptan) (2.8)\n", + "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from requests>=2.0->gym->ptan) (3.0.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/envs/ptan/lib/python3.7/site-packages (from requests>=2.0->gym->ptan) (2019.6.16)\n" + ] + } + ], + "source": [ + "!conda install pytorch torchvision -c pytorch\n", + "!pip install ptan" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Imports" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import ptan\n", + "import gym\n", + "import numpy as np\n", + "from typing import List, Any, Optional, Tuple\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "import torch.optim as optim\n", + "import matplotlib.pylab as plt\n", + "\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Action selector\n", + "\n", + "https://github.com/Shmuma/ptan/blob/master/ptan/actions.py\n", + "\n", + "Helps to go from network output to concrete action values. Most common cases:\n", + "* Argmax: commonly used by Q-value methods, when the network predicts Q-values for set of actions and the desired action is the action with the largest Q\n", + "* Policy-based: network outputs the probability distribution (in form of logits or normalized distribution) and action need to be sampled from this distribution. Used commonly by PG-methods\n", + "\n", + "Action selector is used by the `Agent`, and rarely need to be customized (but you have this option). Concrete classes which could be used:\n", + "* [`ArgmaxActionSelector`](https://github.com/Shmuma/ptan/blob/master/ptan/actions.py#L12): applies `argmax` on the second axis of passed tensor (matrix is assumed)\n", + "* [`ProbabilityActionSelector`](https://github.com/Shmuma/ptan/blob/master/ptan/actions.py#L36): samples from probability distribution of discrete set of actions\n", + "* [`EpsilonGreedyActionSelector`](https://github.com/Shmuma/ptan/blob/master/ptan/actions.py#L21): has parameter $\\epsilon$ which specifies the probability of random action to be taken. \n", + "\n", + "All the classes assume numpy arrays to be passed to them\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 1, 2, 3],\n", + " [ 1, -1, 0]])" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "q_vals = np.array([[1, 2, 3], [1, -1, 0]])\n", + "q_vals" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([2, 0])" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector = ptan.actions.ArgmaxActionSelector()\n", + "selector(q_vals)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([2, 0])" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=0.0)\n", + "selector(q_vals)\n", + "# have to be the same result, as episilon is 0 (no random actions)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 1])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)\n", + "selector(q_vals)\n", + "# will be random" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[2 2 1]\n", + "[1 2 1]\n", + "[1 2 0]\n", + "[1 2 1]\n", + "[1 2 1]\n", + "[1 2 1]\n", + "[0 2 0]\n", + "[1 2 0]\n", + "[1 2 1]\n", + "[1 2 0]\n" + ] + } + ], + "source": [ + "# here we sample from probability distribution (have to be normalized)\n", + "selector = ptan.actions.ProbabilityActionSelector()\n", + "for _ in range(10):\n", + " acts = selector(np.array([\n", + " [0.1, 0.8, 0.1],\n", + " [0.0, 0.0, 1.0],\n", + " [0.5, 0.5, 0.0]\n", + " ]))\n", + " print(acts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Agent class\n", + "\n", + "`Agent` is class which knows how to convert observations into actions. There are three most common approaches:\n", + "* **Q-function**: NN predicts Q-values for actions, the $argmax Q(s)$ is the action\n", + "* **Policy-based**: NN predicts probability distribution over actions $\\pi(s)$, you sample from this distribution and get the action to do\n", + "* **Continuous control**: NN predits the $\\mu(s)$ of continuous control parameters and the output is your actions to execute.\n", + "\n", + "Third case is trivial, two first approached is implemented in `ptan` to be reused without any coding: [`DQNAgent`](https://github.com/Shmuma/ptan/blob/master/ptan/agent.py#L55) and [`PolicyAgent`](https://github.com/Shmuma/ptan/blob/master/ptan/agent.py#L104).\n", + "\n", + "But in reality, it is often needed to implement your own agent, some of the reasons:\n", + "* You have fancy architecture of the net -- mixture of continuous and discrete action space, have multi-modal observations (text and pixels, for example)\n", + "* You want to use non-standard exploration strategies, for example Ornstein–Uhlenbeck process (very popular exploration strategy in continuous control domain)\n", + "* You have PoMDP environment and you decision are not fully defined by observations, but by some internal agent state (which is also the case for Ornstein–Uhlenbeck)\n", + "\n", + "All those cases are easily supported by subclassing the `BaseAgent` class, in TextWorld's tutorial we'll do exactly this.\n", + "\n", + "Below is the example how provided `DQNAgent` and `PolicyAgent` could be used." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## DQNAgent\n", + "\n", + "Suppose we have NN which produces Q-values from observations. `DQNAgent` takes batch of observations on input (as numpy array), apply the network on them to get Q-values, then uses provided `ActionSelector` to convert Q-values to indices of actions.\n", + "\n", + "Below is the small example. For simplicity, our network always produces the same output for the input batch" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "class Net(nn.Module):\n", + " def __init__(self, actions: int):\n", + " super(Net, self).__init__()\n", + " self.actions = actions\n", + " \n", + " def forward(self, x):\n", + " # we always produce diagonal tensor of shape (batch_size, actions)\n", + " return torch.eye(x.size()[0], self.actions)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "net = Net(actions=3)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "tensor([[1., 0., 0.],\n", + " [0., 1., 0.]])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "net(torch.zeros(2, 10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So, let's use simple $argmax$ policy for the beginning. Agent will return actions corresponding to 1s in the net output." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "selector = ptan.actions.ArgmaxActionSelector()\n", + "agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector, device=\"cpu\")\n", + "# note that you need to tell agent are you using GPU or not by passing device, by default it equals to \"cpu\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can pass the agent some observations (which will be ignored as our example is trivial), the output will be the actions according to NN output." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([0, 1]), [None, None])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent(torch.zeros(2, 5))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output from the agent is a tuple with two components:\n", + "1. numpy array with actions to be executed -- in our case of discrete actions, they are indices\n", + "2. list with agent's internal state. This is used for stateful agents, and is a list of None in our case. As our agent is stateless, you can ignore it" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's try to make the agent with epsilon-greedy exploration strategy. For this, we need just pass a different action selector and that's done." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)\n", + "agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As epsilon is 1, all the actions will be random, regardless of network's output" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 0, 1, 1, 0, 2, 2, 1, 0, 0])" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent(torch.zeros(10, 5))[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But we can change the epsilon value on the fly, which is very handy during the training, when we supposed to anneal epsilon over time." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([2, 1, 2, 2, 0, 1, 2, 0, 0, 0])" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector.epsilon = 0.5\n", + "agent(torch.zeros(10, 5))[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0, 1, 2, 0, 0, 0, 0, 0, 0, 0])" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selector.epsilon = 0.1\n", + "agent(torch.zeros(10, 5))[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## PolicyAgent\n", + "\n", + "`PolicyAgent` expects the network to produce policy distribution over discrete set of actions. Policy distribution could be either logits (unnormalized) or normalized distribution. In practice you should always use logits to improve stability.\n", + "\n", + "Let's reimplement our above sample, but now network will produce probability" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "class Net(nn.Module):\n", + " def __init__(self, actions: int):\n", + " super(Net, self).__init__()\n", + " self.actions = actions\n", + " \n", + " def forward(self, x):\n", + " # Now we produce the tensor with first two actions having the same logit scores\n", + " res = torch.zeros((x.size()[0], self.actions), dtype=torch.float32)\n", + " res[:, 0] = 1\n", + " res[:, 1] = 1\n", + " return res" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "tensor([[1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.],\n", + " [1., 1., 0., 0., 0.]])" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "net = Net(actions=5)\n", + "net(torch.zeros(6, 10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we need to use `ProbabilityActionSelector`. Also note the agument `apply_softmax=True` which tells agent that output is not normalized." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "selector = ptan.actions.ProbabilityActionSelector()\n", + "agent = ptan.agent.PolicyAgent(model=net, action_selector=selector, apply_softmax=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can pass agent observations (fake, as before) and get some actions. Agent, as before returns the tuple with actions and internal state, which will be ignored" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([4, 0, 4, 0, 0, 0])" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent(torch.zeros(6, 5))[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please note, that softmax returns non-zero probabilities to zero logits, so, actions 2-5 are still could be sampled (but less likely than 0 and 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Experience source\n", + "\n", + "`Agent` abstraction described above allows us to implement environment communications in a generic way. This communication is happening in form of trajectories, produced by applying agent's actions to gym environment.\n", + "\n", + "At high level, Experience source classes take the agent instance, environment and provide you step-by step data from the trajectories. Functionality of those clases include:\n", + "1. support of multiple environments being communicated at the same time. This allows efficient GPU utilization as batch of observations being processed by agent at once.\n", + "2. trajectory could be preprocessed and presented in a convenient form for further training. For example, there is implementation of sub-trajectory rollouts, which is convenient for DQN and n-step DQN, when we're not interested in intermediate steps in n-step subtrajectories, only in first and last observations + total reward for the subtrajectory.\n", + "3. support of vectorized environments from OpenAI Universe\n", + "\n", + "So, the experience source classes acts as a \"magic black box\" hiding the environment interaction and trajectory handling complexities from the library user. But the overall ptan philosophy is to be flexible and extensible, so, if you want, you can subclass one of existing classes or implement your own version in case of neccessity. \n", + "\n", + "There are classes which are provided by the system:\n", + "* [`ExperienceSource`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L18): by using agent and the set of environments produces n-step subtrajectories with all intermediate steps.\n", + "* [`ExperienceSourceFirstLast`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L161): the same as `ExperienceSource`, but instead of full subtrajectory (with all steps) keeps only first and last steps with proper reward accumulation in between. This can save lots of memory in case of N-step DQN or A2C rollouts.\n", + "* [`ExperienceSourceRollouts`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L200): follows A3C rollouts scheme described in Minh's paper about Atari games.\n", + "\n", + "All the classes are written to be efficient both in terms of CPU and memory, which is not very important for toy problems, but might become an issue when you want to solve Atari games, keeping 10M samples in replay buffer using commodity hardware.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Toy gym environment\n", + "\n", + "For demonstration purpoposes, we'll implement very simple gym environment with small predictable observation state to show how Experience source classes works" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "class ToyEnv(gym.Env):\n", + " \"\"\"\n", + " Environment with observation 0..4 and actions 0..2\n", + " Observations are rotated sequentialy mod 5, reward is equal to given action.\n", + " Episodes are having fixed length of 10\n", + " \"\"\"\n", + " def __init__(self):\n", + " super(ToyEnv, self).__init__()\n", + " self.observation_space = gym.spaces.Discrete(n=5)\n", + " self.action_space = gym.spaces.Discrete(n=3)\n", + " self.step_index = 0\n", + " \n", + " def reset(self):\n", + " self.step_index = 0\n", + " return self.step_index\n", + " \n", + " def step(self, action):\n", + " is_done = self.step_index == 10\n", + " if is_done:\n", + " return self.step_index % self.observation_space.n, 0.0, is_done, {}\n", + " self.step_index += 1\n", + " return self.step_index % self.observation_space.n, float(action), self.step_index == 10, {}" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "env = ToyEnv()\n", + "env.reset()" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(1, 1.0, False, {})" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "env.step(1)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2, 2.0, False, {})" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "env.step(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(3, 0.0, False, {})\n", + "(4, 0.0, False, {})\n", + "(0, 0.0, False, {})\n", + "(1, 0.0, False, {})\n", + "(2, 0.0, False, {})\n", + "(3, 0.0, False, {})\n", + "(4, 0.0, False, {})\n", + "(0, 0.0, True, {})\n", + "(0, 0.0, True, {})\n", + "(0, 0.0, True, {})\n" + ] + } + ], + "source": [ + "for _ in range(10):\n", + " r = env.step(0)\n", + " print(r)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "env.reset()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll also need the agent which always generates fixed action" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "class DullAgent(ptan.agent.BaseAgent):\n", + " def __init__(self, action: int):\n", + " self.action = action\n", + " \n", + " def __call__(self, observations: List[Any], state: Optional[List] = None) -> Tuple[List[int], Optional[List]]:\n", + " return [self.action for _ in observations], state" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "([1, 1], None)" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent = DullAgent(action=1)\n", + "agent([1, 2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ExperienceSource class\n", + "\n", + "Generates chunks of trajectories of the given length.\n", + "\n", + "Constructor arguments:\n", + "* gym environment to be use (could be the list of environments or one single environment)\n", + "* the agent\n", + "* `steps_count=2`: the length of sub-trajectories to be generated\n", + "* `steps_delta=1`: step in subtrajectories\n", + "* `vectorized=False`: if true, environment is OpenAI Universe vectorized environment (more about them in MiniWoB tutorial)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "env = ToyEnv()\n", + "agent = DullAgent(action=1)\n", + "exp_source = ptan.experience.ExperienceSource(env=env, agent=agent, steps_count=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All experience source classes are providing standard python's iterator interface, so, you can just iterate over them to get sub-trajectories." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n" + ] + } + ], + "source": [ + "for exp in exp_source:\n", + " print(exp)\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result is a tuple of length `steps_count` (in our case we requested sub-trajectories of length 2). Every entry is a namedtuple object with the following fields:\n", + "* state: state we observed before taking the action\n", + "* action: action we've done\n", + "* reward: immediate reward we've got from env\n", + "* done: was the episode done or not" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=4, action=1, reward=1.0, done=True),)\n" + ] + } + ], + "source": [ + "for exp in exp_source:\n", + " print(exp)\n", + " if exp[0].done:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please note, that partial trajectories are alse returned, so, we can handle end of episodes properly.\n", + "\n", + "At the end of episode, environment is being reset automatically, so, we don't need to bother about them, just keep iterating:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=4, action=1, reward=1.0, done=True),)\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n" + ] + } + ], + "source": [ + "for idx, exp in enumerate(exp_source):\n", + " print(exp)\n", + " if idx > 15:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That's very convenient, especially in cases when we have several environments running in parallel (several instances of the same Atari game, for example).\n", + "\n", + "Let's increase length of our experience chunks." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "exp_source = ptan.experience.ExperienceSource(env=env, agent=agent, steps_count=4)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n", + "(Experience(state=4, action=1, reward=1.0, done=True),)\n" + ] + } + ], + "source": [ + "for exp in exp_source:\n", + " print(exp)\n", + " if exp[0].done:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we're getting subtrajectories of length 4, including the final pieces of trajectory.\n", + "\n", + "Let's give several environments to the experience source." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "exp_source = ptan.experience.ExperienceSource(env=[ToyEnv(), ToyEnv()], agent=agent, steps_count=2)" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))\n", + "(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))\n" + ] + } + ], + "source": [ + "for idx, exp in enumerate(exp_source):\n", + " print(exp)\n", + " if idx > 15:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, our environments are being iterated on a round-robin fashion, giving us access to trajectories from both environment step-by-step. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ExperienceSourceFirstLast\n", + "\n", + "Class `ExperienceSource` provides us full subtrajectories of given length as the list of $(s, a, r)$ objects. The next state $s'$ is returned in the next tuple, which is not always convenient. \n", + "\n", + "For example, in DQN training, we want to have tuples $(s, a, r, s')$ at once to do 1-step Bellman approximation during the training. In addition, some extension of DQN, like n-step DQN might want to collapse longer sequences of observations into (first-state, action, total-reward-for-n-steps, state-after-step-n).\n", + "\n", + "To support this in a generic way, simple subclass of `ExperienceSource` is implemented: `ExperienceSourceFirstLast`. It accepts almost the same arguments in constructor, but returns different data." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n" + ] + } + ], + "source": [ + "for idx, exp in enumerate(exp_source):\n", + " print(exp)\n", + " if idx > 10:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now it returns single object on every iteration, which is again a namedtuple with the following fields:\n", + "* `state`: state which we used to decide on action to make\n", + "* `action`: action we've done at this step\n", + "* `reward`: partial accumulated reward for `steps_count` (in our case, `steps_count=1`, so it is equal to immediate reward)\n", + "* `last_state`: the state we've got after executing the action. If our episode ends, we have None here\n", + "\n", + "This data is much more convenient for DQN training, as we can apply Bellman approximation directly on this data.\n", + "\n", + "Let's check the result with larger amount of steps." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [], + "source": [ + "exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=2)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)\n", + "ExperienceFirstLast(state=2, action=1, reward=2.0, last_state=4)\n", + "ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=0)\n", + "ExperienceFirstLast(state=4, action=1, reward=2.0, last_state=1)\n", + "ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)\n", + "ExperienceFirstLast(state=2, action=1, reward=2.0, last_state=4)\n", + "ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=None)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)\n" + ] + } + ], + "source": [ + "for idx, exp in enumerate(exp_source):\n", + " print(exp)\n", + " if idx > 10:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So, now we're collapsing two steps on every iteration, calculating immediate reward (that's why our reward=2.0 for most of the samples).\n", + "\n", + "More interesting samples are at the end of the episode:\n", + "```\n", + "ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=None)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "```\n", + "\n", + "As episode ends, we have `last_state=None` in those samples, but additionally, we calculating the tail of the episode. Those tiny details are very easy to implement wrong, if you're doing all the trajectory handling yourself." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Experience source buffers\n", + "\n", + "In DQN we rarely dealing with immediate experience samples, as they are heavily correlated, which lead to instability in training. \n", + "\n", + "Normally, we have large replay buffers, which are being populated with experience pieces. Then the buffer is being sampled (randomly or with priority weights) to get the training batch. Replay buffer normally has the maximum capacity, so old samples are being pushed out when replay buffer reaches the limit.\n", + "\n", + "There are several implementation tricks here, which becomes extremely important when you need to deal with large problems:\n", + "* how to efficiently sample from large buffer\n", + "* how to push old samples from the buffer\n", + "* in case of prioritized buffer, how priorities need to be maintained and handled in the most efficient way.\n", + "\n", + "All this becomes quite non-trivial task, if you want to solve atari, keeping 10-100M samples where every sample is an image from the game. Small mistake can lead to 10-100x memory increase and major slowdowns of the training process.\n", + "\n", + "Ptan provides several variants of replay buffers, which provide simple integration with `ExperienceSource` and `Agent` machinery. Normally, what you need to do is to ask buffer to pull new sample from the source and sample the training batch.\n", + "\n", + "Provided classes:\n", + "* [`ExperienceReplayBuffer`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L327): simple replay buffer of predefined size with uniform sampling\n", + "* [`PrioReplayBufferNaive`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L371): simple, but not very efficient prioritized replay buffer implementation. Complexity of sampling is O(n), which might become an issue with large buffers\n", + "* [`PrioritizedReplayBuffer`](https://github.com/Shmuma/ptan/blob/master/ptan/experience.py#L414): uses segment trees for sampling, which makes code cryptic, but with O(log(n)) sampling complexity.\n", + "\n", + "Below is the example of simple relay buffer, if you want, you can find examples of `PrioritizedReplayBuffer` usage in examples for chapter 7 of my book: https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter07/05_dqn_prio_replay.py" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [], + "source": [ + "env = ToyEnv()\n", + "agent = DullAgent(action=1)\n", + "exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=100)" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(buffer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All replay buffers provides the following interface:\n", + "* python iterator interface to walk over all the samples in the buffer\n", + "* method `populate(N)`, to get N samples from the experience source and put into the buffer\n", + "* method `sample(N)`, to get the batch of N experience objects\n", + "\n", + "So, the normal training loop for DQN looks like infinite repetition of the following steps:\n", + "1. call `buffer.populate(1)` to get fresh sample from the environment\n", + "2. `batch = buffer.sample(BATCH_SIZE)` to get the batch from buffer\n", + "3. calculate the loss on the sampled batch\n", + "4. backpropagate\n", + "5. repeat until convergence (hopefully)\n", + "\n", + "All the rest is happening automatically -- reset of the environment, sub-trajectories handling, buffer size maintenance, etc." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)\n", + "Train time, 4 batch samples:\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n", + "ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)\n" + ] + } + ], + "source": [ + "for step in range(10):\n", + " buffer.populate(1)\n", + " # if buffer is small enough, do nothing\n", + " if len(buffer) < 5:\n", + " continue\n", + " batch = buffer.sample(4)\n", + " print(\"Train time, %d batch samples:\" % len(batch))\n", + " for s in batch:\n", + " print(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Monitoring the training\n", + "\n", + "Normally, if we're running the training process, we want to keep an eye on several metrics to check how good our method is doing. Minimal set of things to watch includes:\n", + "* training loss (several loss components in case of A2C, for example)\n", + "* values predicted by the network (in case of DQN)\n", + "* statistics about episode rewards (to check that our agent improves over time)\n", + "* statistics about the length of the episode, as this is normally a proxy for reward\n", + "\n", + "First two items are being calculated in the training loop, but the rest two values are not that easy to get. If we're implementing everything from scratch, we need to track the current episode and when it ends, track the total reward and length.\n", + "\n", + "Ptan simplifies this by providing the method in experience source, which returns this information in one call. Method `pop_rewards_steps()` returns the list, where each entry is the information about the episode which since the lass call to the method. If no episodes have completed between the calls, empty list is returned. \n", + "\n", + "Every item is a tuple with (total_reword, total_steps). \n", + "\n", + "So, the only thing you need to do to monitor the training progress, is to periodically call method `pop_rewards_steps()` in the training loop and handle the results (printing on console or sending to TensorBoard, or whatever)." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [], + "source": [ + "r = exp_source.pop_rewards_steps()" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(10.0, 10)]" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "r" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We've one one episode completed so far, it got reward 10.0 and total amount of steps was 10" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "exp_source.pop_rewards_steps()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Other tools\n", + "\n", + "There are several smaller things, which could be used, like [`TargetNet`](https://github.com/Shmuma/ptan/blob/master/ptan/agent.py#L79), which allows you to keep a copy of model weights and syncronize them from time to time (which is essential for DQN to converge), or a [set of utils](https://github.com/Shmuma/ptan/blob/master/ptan/common/utils.py) to smooth time series for better training progress visualisation.\n", + "\n", + "There is [PyTorch Ignite bindings](https://github.com/Shmuma/ptan/blob/master/ptan/ignite.py) which implement integration of ptan with ignite framework:\n", + "\n", + "* install end of episode hooks `EpisodeEvents.EPISODE_COMPLETED`\n", + "* handle situations when reward reaches boundary `EpisodeEvents.BOUND_REWARD_REACHED`\n", + "* measure performance of the training process: `EpisodeFPSHandler`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Simple CartPole solver\n", + "\n", + "Below is very simple DQN version which solves CartPole, just to demonstrate how all things fits together in real life." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [], + "source": [ + "class Net(nn.Module):\n", + " def __init__(self, obs_size, hidden_size, n_actions):\n", + " super(Net, self).__init__()\n", + " self.net = nn.Sequential(\n", + " nn.Linear(obs_size, hidden_size),\n", + " nn.ReLU(),\n", + " nn.Linear(hidden_size, n_actions)\n", + " )\n", + "\n", + " def forward(self, x):\n", + " # CartPole is stupid -- they return double observations, rather than standard floats, so, the cast here\n", + " return self.net(x.float())" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [], + "source": [ + "BATCH_SIZE = 64\n", + "REPLAY_SIZE = 1000\n", + "LR = 1e-3\n", + "GAMMA=0.9\n", + "EPS_DECAY=0.995" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [], + "source": [ + "env = gym.make(\"CartPole-v0\")" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Net(\n", + " (net): Sequential(\n", + " (0): Linear(in_features=4, out_features=64, bias=True)\n", + " (1): ReLU()\n", + " (2): Linear(in_features=64, out_features=2, bias=True)\n", + " )\n", + ")" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "net = Net(obs_size=env.observation_space.shape[0], hidden_size=64, n_actions=env.action_space.n)\n", + "optimizer = optim.Adam(net.parameters(), LR)\n", + "net" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [], + "source": [ + "action_selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)\n", + "agent = ptan.agent.DQNAgent(net, action_selector)\n", + "exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=GAMMA)\n", + "buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=REPLAY_SIZE)" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [], + "source": [ + "@torch.no_grad()\n", + "def unpack_batch(batch: List[ptan.experience.ExperienceFirstLast], net: nn.Module, gamma: float):\n", + " states = []\n", + " actions = []\n", + " rewards = []\n", + " done_masks = []\n", + " last_states = []\n", + " for exp in batch:\n", + " states.append(exp.state)\n", + " actions.append(exp.action)\n", + " rewards.append(exp.reward)\n", + " done_masks.append(exp.last_state is None)\n", + " if exp.last_state is None:\n", + " last_states.append(exp.state)\n", + " else:\n", + " last_states.append(exp.last_state)\n", + "\n", + " states_v = torch.tensor(states)\n", + " actions_v = torch.tensor(actions)\n", + " rewards_v = torch.tensor(rewards)\n", + " last_states_v = torch.tensor(last_states)\n", + " last_state_q_v = net(last_states_v)\n", + " best_last_q_v = torch.max(last_state_q_v, dim=1)[0]\n", + " best_last_q_v[done_masks] = 0.0\n", + " return states_v, actions_v, best_last_q_v + rewards_v" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "28: episode done, reward=27.000, steps=27, epsilon=1.00\n", + "43: episode done, reward=15.000, steps=15, epsilon=1.00\n", + "66: episode done, reward=23.000, steps=23, epsilon=1.00\n", + "85: episode done, reward=19.000, steps=19, epsilon=1.00\n", + "98: episode done, reward=13.000, steps=13, epsilon=1.00\n", + "111: episode done, reward=13.000, steps=13, epsilon=1.00\n", + "166: episode done, reward=55.000, steps=55, epsilon=1.00\n", + "191: episode done, reward=25.000, steps=25, epsilon=1.00\n", + "212: episode done, reward=21.000, steps=21, epsilon=0.94\n", + "231: episode done, reward=19.000, steps=19, epsilon=0.86\n", + "248: episode done, reward=17.000, steps=17, epsilon=0.79\n", + "266: episode done, reward=18.000, steps=18, epsilon=0.72\n", + "282: episode done, reward=16.000, steps=16, epsilon=0.66\n", + "298: episode done, reward=16.000, steps=16, epsilon=0.61\n", + "325: episode done, reward=27.000, steps=27, epsilon=0.53\n", + "337: episode done, reward=12.000, steps=12, epsilon=0.50\n", + "349: episode done, reward=12.000, steps=12, epsilon=0.47\n", + "362: episode done, reward=13.000, steps=13, epsilon=0.44\n", + "371: episode done, reward=9.000, steps=9, epsilon=0.42\n", + "387: episode done, reward=16.000, steps=16, epsilon=0.39\n", + "397: episode done, reward=10.000, steps=10, epsilon=0.37\n", + "408: episode done, reward=11.000, steps=11, epsilon=0.35\n", + "416: episode done, reward=8.000, steps=8, epsilon=0.34\n", + "436: episode done, reward=20.000, steps=20, epsilon=0.31\n", + "445: episode done, reward=9.000, steps=9, epsilon=0.29\n", + "458: episode done, reward=13.000, steps=13, epsilon=0.27\n", + "467: episode done, reward=9.000, steps=9, epsilon=0.26\n", + "476: episode done, reward=9.000, steps=9, epsilon=0.25\n", + "489: episode done, reward=13.000, steps=13, epsilon=0.23\n", + "501: episode done, reward=12.000, steps=12, epsilon=0.22\n", + "519: episode done, reward=18.000, steps=18, epsilon=0.20\n", + "533: episode done, reward=14.000, steps=14, epsilon=0.19\n", + "543: episode done, reward=10.000, steps=10, epsilon=0.18\n", + "553: episode done, reward=10.000, steps=10, epsilon=0.17\n", + "562: episode done, reward=9.000, steps=9, epsilon=0.16\n", + "571: episode done, reward=9.000, steps=9, epsilon=0.16\n", + "581: episode done, reward=10.000, steps=10, epsilon=0.15\n", + "593: episode done, reward=12.000, steps=12, epsilon=0.14\n", + "602: episode done, reward=9.000, steps=9, epsilon=0.13\n", + "613: episode done, reward=11.000, steps=11, epsilon=0.13\n", + "627: episode done, reward=14.000, steps=14, epsilon=0.12\n", + "640: episode done, reward=13.000, steps=13, epsilon=0.11\n", + "653: episode done, reward=13.000, steps=13, epsilon=0.10\n", + "664: episode done, reward=11.000, steps=11, epsilon=0.10\n", + "679: episode done, reward=15.000, steps=15, epsilon=0.09\n", + "700: episode done, reward=21.000, steps=21, epsilon=0.08\n", + "725: episode done, reward=25.000, steps=25, epsilon=0.07\n", + "736: episode done, reward=11.000, steps=11, epsilon=0.07\n", + "746: episode done, reward=10.000, steps=10, epsilon=0.06\n", + "756: episode done, reward=10.000, steps=10, epsilon=0.06\n", + "766: episode done, reward=10.000, steps=10, epsilon=0.06\n", + "779: episode done, reward=13.000, steps=13, epsilon=0.05\n", + "794: episode done, reward=15.000, steps=15, epsilon=0.05\n", + "807: episode done, reward=13.000, steps=13, epsilon=0.05\n", + "818: episode done, reward=11.000, steps=11, epsilon=0.05\n", + "828: episode done, reward=10.000, steps=10, epsilon=0.04\n", + "837: episode done, reward=9.000, steps=9, epsilon=0.04\n", + "846: episode done, reward=9.000, steps=9, epsilon=0.04\n", + "856: episode done, reward=10.000, steps=10, epsilon=0.04\n", + "865: episode done, reward=9.000, steps=9, epsilon=0.04\n", + "875: episode done, reward=10.000, steps=10, epsilon=0.03\n", + "886: episode done, reward=11.000, steps=11, epsilon=0.03\n", + "897: episode done, reward=11.000, steps=11, epsilon=0.03\n", + "909: episode done, reward=12.000, steps=12, epsilon=0.03\n", + "934: episode done, reward=25.000, steps=25, epsilon=0.03\n", + "947: episode done, reward=13.000, steps=13, epsilon=0.02\n", + "961: episode done, reward=14.000, steps=14, epsilon=0.02\n", + "974: episode done, reward=13.000, steps=13, epsilon=0.02\n", + "986: episode done, reward=12.000, steps=12, epsilon=0.02\n", + "1018: episode done, reward=32.000, steps=32, epsilon=0.02\n", + "1060: episode done, reward=42.000, steps=42, epsilon=0.01\n", + "1079: episode done, reward=19.000, steps=19, epsilon=0.01\n", + "1131: episode done, reward=52.000, steps=52, epsilon=0.01\n", + "1175: episode done, reward=44.000, steps=44, epsilon=0.01\n", + "1275: episode done, reward=100.000, steps=100, epsilon=0.00\n", + "1309: episode done, reward=34.000, steps=34, epsilon=0.00\n", + "1339: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1405: episode done, reward=66.000, steps=66, epsilon=0.00\n", + "1500: episode done, reward=95.000, steps=95, epsilon=0.00\n", + "1535: episode done, reward=35.000, steps=35, epsilon=0.00\n", + "1556: episode done, reward=21.000, steps=21, epsilon=0.00\n", + "1580: episode done, reward=24.000, steps=24, epsilon=0.00\n", + "1625: episode done, reward=45.000, steps=45, epsilon=0.00\n", + "1654: episode done, reward=29.000, steps=29, epsilon=0.00\n", + "1684: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1714: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1748: episode done, reward=34.000, steps=34, epsilon=0.00\n", + "1774: episode done, reward=26.000, steps=26, epsilon=0.00\n", + "1804: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1831: episode done, reward=27.000, steps=27, epsilon=0.00\n", + "1848: episode done, reward=17.000, steps=17, epsilon=0.00\n", + "1864: episode done, reward=16.000, steps=16, epsilon=0.00\n", + "1894: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "1922: episode done, reward=28.000, steps=28, epsilon=0.00\n", + "1954: episode done, reward=32.000, steps=32, epsilon=0.00\n", + "1982: episode done, reward=28.000, steps=28, epsilon=0.00\n", + "2038: episode done, reward=56.000, steps=56, epsilon=0.00\n", + "2072: episode done, reward=34.000, steps=34, epsilon=0.00\n", + "2172: episode done, reward=100.000, steps=100, epsilon=0.00\n", + "2264: episode done, reward=92.000, steps=92, epsilon=0.00\n", + "2294: episode done, reward=30.000, steps=30, epsilon=0.00\n", + "2328: episode done, reward=34.000, steps=34, epsilon=0.00\n", + "2382: episode done, reward=54.000, steps=54, epsilon=0.00\n", + "2420: episode done, reward=38.000, steps=38, epsilon=0.00\n", + "2469: episode done, reward=49.000, steps=49, epsilon=0.00\n", + "2523: episode done, reward=54.000, steps=54, epsilon=0.00\n", + "2547: episode done, reward=24.000, steps=24, epsilon=0.00\n", + "2573: episode done, reward=26.000, steps=26, epsilon=0.00\n", + "2606: episode done, reward=33.000, steps=33, epsilon=0.00\n", + "2620: episode done, reward=14.000, steps=14, epsilon=0.00\n", + "2646: episode done, reward=26.000, steps=26, epsilon=0.00\n", + "2666: episode done, reward=20.000, steps=20, epsilon=0.00\n", + "2698: episode done, reward=32.000, steps=32, epsilon=0.00\n", + "2738: episode done, reward=40.000, steps=40, epsilon=0.00\n", + "2779: episode done, reward=41.000, steps=41, epsilon=0.00\n", + "2822: episode done, reward=43.000, steps=43, epsilon=0.00\n", + "2880: episode done, reward=58.000, steps=58, epsilon=0.00\n", + "2936: episode done, reward=56.000, steps=56, epsilon=0.00\n" + ] + } + ], + "source": [ + "step = 0\n", + "losses = []\n", + "rewards = []\n", + "\n", + "while True:\n", + " step += 1\n", + " buffer.populate(1)\n", + " solved = False\n", + " for reward, steps in exp_source.pop_rewards_steps():\n", + " print(\"%d: episode done, reward=%.3f, steps=%d, epsilon=%.2f\" % (\n", + " step, reward, steps, action_selector.epsilon))\n", + " rewards.append(reward)\n", + " solved = reward > 150\n", + " if solved:\n", + " print(\"Congrats!\")\n", + " break\n", + " if len(buffer) < 200:\n", + " continue\n", + " batch = buffer.sample(BATCH_SIZE)\n", + " states_v, actions_v, tgt_q_v = unpack_batch(batch, net, GAMMA)\n", + " optimizer.zero_grad()\n", + " q_v = net(states_v)\n", + " q_v = q_v.gather(1, actions_v.unsqueeze(-1)).squeeze(-1)\n", + " loss_v = F.mse_loss(q_v, tgt_q_v)\n", + " loss_v.backward()\n", + " optimizer.step() \n", + " losses.append(loss_v.item())\n", + " action_selector.epsilon *= EPS_DECAY\n", + " if step > 3000:\n", + " break" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(losses);" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(rewards);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Of course, hyperparams should be tuned, target network will improve stability, but you've got the idea :)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}