diff --git a/lessons/6-Other/22-DeepRL/CartPole-RL-PyTorch.ipynb b/lessons/6-Other/22-DeepRL/CartPole-RL-PyTorch.ipynb new file mode 100644 index 00000000..f574cd45 --- /dev/null +++ b/lessons/6-Other/22-DeepRL/CartPole-RL-PyTorch.ipynb @@ -0,0 +1,490 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Training RL to do Cartpole Balancing\n", + "\n", + "This notebooks is part of [AI for Beginners Curriculum](http://aka.ms/ai-beginners). It has been inspired by [official PyTorch tutorial](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html) and [this Cartpole Pytorch implementation](https://github.com/yc930401/Actor-Critic-pytorch).\n", + "\n", + "In this example, we will use RL to train a model to balance a pole on a cart that can move left and right on horizontal scale. We will use [OpenAI Gym](https://www.gymlibrary.ml/) environment to simulate the pole.\n", + "\n", + "> **Note**: You can run this lesson's code locally (eg. from Visual Studio Code), in which case the simulation will open in a new window. When running the code online, you may need to make some tweaks to the code, as described [here](https://towardsdatascience.com/rendering-openai-gym-envs-on-binder-and-google-colab-536f99391cc7).\n", + "\n", + "We will start by making sure Gym is installed:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "!{sys.executable} -m pip install gym" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's create the CartPole environment and see how to operate on it. An environment has the following properties:\n", + "\n", + "* **Action space** is the set of possible actions that we can perform at each step of the simulation\n", + "* **Observation space** is the space of observations that we can make" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import gym\n", + "\n", + "env = gym.make(\"CartPole-v1\")\n", + "\n", + "print(f\"Action space: {env.action_space}\")\n", + "print(f\"Observation space: {env.observation_space}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see how the simulation works. The following loop runs the simulation, until `env.step` does not return the termination flag `done`. We will randomly chose actions using `env.action_space.sample()`, which means the experiment will probably fail very fast (CartPole environment terminates when the speed of CartPole, its position or angle are outside certain limits).\n", + "\n", + "> Simulation will open in the new window. You can run the code several times and see how it behaves." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "env.reset()\n", + "\n", + "done = False\n", + "total_reward = 0\n", + "while not done:\n", + " env.render()\n", + " obs, rew, done, info = env.step(env.action_space.sample())\n", + " total_reward += rew\n", + " print(f\"{obs} -> {rew}\")\n", + "print(f\"Total reward: {total_reward}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Youn can notice that observations contain 4 numbers. They are:\n", + "- Position of cart\n", + "- Velocity of cart\n", + "- Angle of pole\n", + "- Rotation rate of pole\n", + "\n", + "`rew` is the reward we receive at each step. You can see that in CartPole environment you are rewarded 1 point for each simulation step, and the goal is to maximize total reward, i.e. the time CartPole is able to balance without falling.\n", + "\n", + "During reinforcement learning, our goal is to train a **policy** $\\pi$, that for each state $s$ will tell us which action $a$ to take, so essentially $a = \\pi(s)$.\n", + "\n", + "If you want probabilistic solution, you can think of policy as returning a set of probabilities for each action, i.e. $\\pi(a|s)$ would mean a probability that we should take action $a$ at state $s$.\n", + "\n", + "## Policy Gradient Method\n", + "\n", + "In simplest RL algorithm, called **Policy Gradient**, we will train a neural network to predict the next action." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import torch\n", + "\n", + "num_inputs = 4\n", + "num_actions = 2\n", + "\n", + "model = torch.nn.Sequential(\n", + " torch.nn.Linear(num_inputs, 128, bias=False, dtype=torch.float32),\n", + " torch.nn.ReLU(),\n", + " torch.nn.Linear(128, num_actions, bias = False, dtype=torch.float32),\n", + " torch.nn.Softmax(dim=1)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will train the network by running many experiments, and updating our network after each run. Let's define a function that will run the experiment and return the results (so-called **trace**) - all states, actions (and their recommended probabilities), and rewards:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def run_episode(max_steps_per_episode = 10000,render=False): \n", + " states, actions, probs, rewards = [],[],[],[]\n", + " state = env.reset()\n", + " for _ in range(max_steps_per_episode):\n", + " if render:\n", + " env.render()\n", + " action_probs = model(torch.from_numpy(np.expand_dims(state,0)))[0]\n", + " action = np.random.choice(num_actions, p=np.squeeze(action_probs.detach().numpy()))\n", + " nstate, reward, done, info = env.step(action)\n", + " if done:\n", + " break\n", + " states.append(state)\n", + " actions.append(action)\n", + " probs.append(action_probs.detach().numpy())\n", + " rewards.append(reward)\n", + " state = nstate\n", + " return np.vstack(states), np.vstack(actions), np.vstack(probs), np.vstack(rewards)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can run one episode with untrained network and observe that total reward (AKA length of episode) is very low:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s, a, p, r = run_episode()\n", + "print(f\"Total reward: {np.sum(r)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One of the tricky aspects of policy gradient algorithm is to use **discounted rewards**. The idea is that we compute the vector of total rewards at each step of the game, and during this process we discount the early rewards using some coefficient $gamma$. We also normalize the resulting vector, because we will use it as weight to affect our training: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eps = 0.0001\n", + "\n", + "def discounted_rewards(rewards,gamma=0.99,normalize=True):\n", + " ret = []\n", + " s = 0\n", + " for r in rewards[::-1]:\n", + " s = r + gamma * s\n", + " ret.insert(0, s)\n", + " if normalize:\n", + " ret = (ret-np.mean(ret))/(np.std(ret)+eps)\n", + " return ret" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's do the actual training! We will run 300 episodes, and at each episode we will do the following:\n", + "\n", + "1. Run the experiment and collect the trace\n", + "1. Calculate the difference (`gradients`) between the actions taken, and by predicted probabilities. The less the difference is, the more we are sure that we have taken the right action.\n", + "1. Calculate discounted rewards and multiply gradients by discounted rewards - that will make sure that steps with higher rewards will make more effect on the final result than lower-rewarded ones\n", + "1. Expected target actions for our neural network would be partly taken from the predicted probabilities during the run, and partly from calculated gradients. We will use `alpha` parameter to determine to which extent gradients and rewards are taken into account - this is called *learning rate* of reinforcement algorithm.\n", + "1. Finally, we train our network on states and expected actions, and repeat the process " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n", + "\n", + "def train_on_batch(x, y):\n", + " x = torch.from_numpy(x)\n", + " y = torch.from_numpy(y)\n", + " optimizer.zero_grad()\n", + " predictions = model(x)\n", + " loss = -torch.mean(torch.log(predictions) * y)\n", + " loss.backward()\n", + " optimizer.step()\n", + " return loss" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "alpha = 1e-4\n", + "\n", + "history = []\n", + "for epoch in range(300):\n", + " states, actions, probs, rewards = run_episode()\n", + " one_hot_actions = np.eye(2)[actions.T][0]\n", + " gradients = one_hot_actions-probs\n", + " dr = discounted_rewards(rewards)\n", + " gradients *= dr\n", + " target = alpha*np.vstack([gradients])+probs\n", + " train_on_batch(states,target)\n", + " history.append(np.sum(rewards))\n", + " if epoch%100==0:\n", + " print(f\"{epoch} -> {np.sum(rewards)}\")\n", + "\n", + "plt.plot(history)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's run the episode with rendering to see the result:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "_ = run_episode(render=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Hopefully, you can see that pole can now balance pretty well!\n", + "\n", + "## Actor-Critic Model\n", + "\n", + "Actor-Critic model is the further development of policy gradients, in which we build a neural network to learn both the policy and estimated rewards. The network will have two outputs (or you can view it as two separate networks):\n", + "* **Actor** will recommend the action to take by giving us the state probability distribution, as in policy gradient model\n", + "* **Critic** would estimate what the reward would be from those actions. It returns total estimated rewards in the future at the given state.\n", + "\n", + "Let's define such a model: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from itertools import count\n", + "import torch.nn.functional as F" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "env = gym.make(\"CartPole-v1\")\n", + "\n", + "state_size = env.observation_space.shape[0]\n", + "action_size = env.action_space.n\n", + "lr = 0.0001\n", + "\n", + "class Actor(torch.nn.Module):\n", + " def __init__(self, state_size, action_size):\n", + " super(Actor, self).__init__()\n", + " self.state_size = state_size\n", + " self.action_size = action_size\n", + " self.linear1 = torch.nn.Linear(self.state_size, 128)\n", + " self.linear2 = torch.nn.Linear(128, 256)\n", + " self.linear3 = torch.nn.Linear(256, self.action_size)\n", + "\n", + " def forward(self, state):\n", + " output = F.relu(self.linear1(state))\n", + " output = F.relu(self.linear2(output))\n", + " output = self.linear3(output)\n", + " distribution = torch.distributions.Categorical(F.softmax(output, dim=-1))\n", + " return distribution\n", + "\n", + "\n", + "class Critic(torch.nn.Module):\n", + " def __init__(self, state_size, action_size):\n", + " super(Critic, self).__init__()\n", + " self.state_size = state_size\n", + " self.action_size = action_size\n", + " self.linear1 = torch.nn.Linear(self.state_size, 128)\n", + " self.linear2 = torch.nn.Linear(128, 256)\n", + " self.linear3 = torch.nn.Linear(256, 1)\n", + "\n", + " def forward(self, state):\n", + " output = F.relu(self.linear1(state))\n", + " output = F.relu(self.linear2(output))\n", + " value = self.linear3(output)\n", + " return value" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We would need to slightly modify our `discounted_rewards` and `run_episode` functions:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def discounted_rewards(next_value, rewards, masks, gamma=0.99):\n", + " R = next_value\n", + " returns = []\n", + " for step in reversed(range(len(rewards))):\n", + " R = rewards[step] + gamma * R * masks[step]\n", + " returns.insert(0, R)\n", + " return returns\n", + "\n", + "def run_episode(actor, critic, n_iters):\n", + " optimizerA = torch.optim.Adam(actor.parameters())\n", + " optimizerC = torch.optim.Adam(critic.parameters())\n", + " for iter in range(n_iters):\n", + " state = env.reset()\n", + " log_probs = []\n", + " values = []\n", + " rewards = []\n", + " masks = []\n", + " entropy = 0\n", + " env.reset()\n", + "\n", + " for i in count():\n", + " env.render()\n", + " state = torch.FloatTensor(state).to(device)\n", + " dist, value = actor(state), critic(state)\n", + "\n", + " action = dist.sample()\n", + " next_state, reward, done, _ = env.step(action.cpu().numpy())\n", + "\n", + " log_prob = dist.log_prob(action).unsqueeze(0)\n", + " entropy += dist.entropy().mean()\n", + "\n", + " log_probs.append(log_prob)\n", + " values.append(value)\n", + " rewards.append(torch.tensor([reward], dtype=torch.float, device=device))\n", + " masks.append(torch.tensor([1-done], dtype=torch.float, device=device))\n", + "\n", + " state = next_state\n", + "\n", + " if done:\n", + " print('Iteration: {}, Score: {}'.format(iter, i))\n", + " break\n", + "\n", + "\n", + " next_state = torch.FloatTensor(next_state).to(device)\n", + " next_value = critic(next_state)\n", + " returns = discounted_rewards(next_value, rewards, masks)\n", + "\n", + " log_probs = torch.cat(log_probs)\n", + " returns = torch.cat(returns).detach()\n", + " values = torch.cat(values)\n", + "\n", + " advantage = returns - values\n", + "\n", + " actor_loss = -(log_probs * advantage.detach()).mean()\n", + " critic_loss = advantage.pow(2).mean()\n", + "\n", + " optimizerA.zero_grad()\n", + " optimizerC.zero_grad()\n", + " actor_loss.backward()\n", + " critic_loss.backward()\n", + " optimizerA.step()\n", + " optimizerC.step()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we will run the main training loop. We will use manual network training process by computing proper loss functions and updating network parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "actor = Actor(state_size, action_size).to(device)\n", + "critic = Critic(state_size, action_size).to(device)\n", + "run_episode(actor, critic, n_iters=100)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, let's close the environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "env.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Takeaway\n", + "\n", + "We have seen two RL algorithms in this demo: simple policy gradient, and more sophisticated actor-critic. You can see that those algorithms operate with abstract notions of state, action and reward - thus they can be applied to very different environments.\n", + "\n", + "Reinforcement learning allows us to learn the best strategy to solve the problem just by looking at the final reward. The fact that we do not need labelled datasets allows us to repeat simulations many times to optimize our models. However, there are still many challenges in RL, which you may learn if you decide to focus more on this interesting area of AI. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.10.4 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.4" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/lessons/6-Other/22-DeepRL/CartPole-RL-TF.ipynb b/lessons/6-Other/22-DeepRL/CartPole-RL-TF.ipynb index 5afa5a35..6315a1ce 100644 --- a/lessons/6-Other/22-DeepRL/CartPole-RL-TF.ipynb +++ b/lessons/6-Other/22-DeepRL/CartPole-RL-TF.ipynb @@ -17,23 +17,19 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Requirement already satisfied: gym in c:\\winapp\\miniconda3\\envs\\py38\\lib\\site-packages (0.23.1)\n", - "Collecting pygame\n", - " Downloading pygame-2.1.2-cp38-cp38-win_amd64.whl (8.4 MB)\n", - "Requirement already satisfied: importlib-metadata>=4.10.0 in c:\\winapp\\miniconda3\\envs\\py38\\lib\\site-packages (from gym) (4.11.3)\n", - "Requirement already satisfied: cloudpickle>=1.2.0 in c:\\winapp\\miniconda3\\envs\\py38\\lib\\site-packages (from gym) (2.0.0)\n", - "Requirement already satisfied: gym-notices>=0.0.4 in c:\\winapp\\miniconda3\\envs\\py38\\lib\\site-packages (from gym) (0.0.6)\n", - "Requirement already satisfied: numpy>=1.18.0 in c:\\winapp\\miniconda3\\envs\\py38\\lib\\site-packages (from gym) (1.22.3)\n", - "Requirement already satisfied: zipp>=0.5 in c:\\winapp\\miniconda3\\envs\\py38\\lib\\site-packages (from importlib-metadata>=4.10.0->gym) (3.6.0)\n", - "Installing collected packages: pygame\n", - "Successfully installed pygame-2.1.2\n" + "Defaulting to user installation because normal site-packages is not writeable\n", + "Requirement already satisfied: gym in /home/leo/.local/lib/python3.10/site-packages (0.25.0)\n", + "Requirement already satisfied: pygame in /home/leo/.local/lib/python3.10/site-packages (2.1.2)\n", + "Requirement already satisfied: gym-notices>=0.0.4 in /home/leo/.local/lib/python3.10/site-packages (from gym) (0.0.7)\n", + "Requirement already satisfied: cloudpickle>=1.2.0 in /home/leo/.local/lib/python3.10/site-packages (from gym) (2.1.0)\n", + "Requirement already satisfied: numpy>=1.18.0 in /usr/lib/python3/dist-packages (from gym) (1.21.5)\n" ] } ], @@ -54,7 +50,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -64,10 +60,22 @@ "Action space: Discrete(2)\n", "Observation space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)\n" ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/leo/.local/lib/python3.10/site-packages/gym/core.py:329: DeprecationWarning: \u001b[33mWARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n", + " deprecation(\n", + "/home/leo/.local/lib/python3.10/site-packages/gym/wrappers/step_api_compatibility.py:39: DeprecationWarning: \u001b[33mWARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n", + " deprecation(\n" + ] } ], "source": [ "import gym\n", + "import pygame\n", + "import tqdm\n", "\n", "env = gym.make(\"CartPole-v1\")\n", "\n", @@ -86,33 +94,58 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 3, "metadata": {}, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/leo/.local/lib/python3.10/site-packages/gym/core.py:57: DeprecationWarning: \u001b[33mWARN: You are calling render method, but you didn't specified the argument render_mode at environment initialization. To maintain backward compatibility, the environment will render in human mode.\n", + "If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.\n", + "See here for more information: https://www.gymlibrary.ml/content/api/\u001b[0m\n", + " deprecation(\n" + ] + }, { "name": "stdout", "output_type": "stream", "text": [ - "[-0.04825775 -0.21918646 -0.00796685 0.27188525] -> 1.0\n", - "[-0.05264147 -0.4141938 -0.00252914 0.5620448 ] -> 1.0\n", - "[-0.06092535 -0.21903647 0.00871175 0.26856613] -> 1.0\n", - "[-0.06530608 -0.41428167 0.01408308 0.56398404] -> 1.0\n", - "[-0.07359171 -0.60959834 0.02536276 0.8610703 ] -> 1.0\n", - "[-0.08578368 -0.4148308 0.04258416 0.57646877] -> 1.0\n", - "[-0.09408029 -0.610523 0.05411354 0.88225687] -> 1.0\n", - "[-0.10629076 -0.4161761 0.07175867 0.6070649 ] -> 1.0\n", - "[-0.11461428 -0.22212696 0.08389997 0.33781925] -> 1.0\n", - "[-0.11905681 -0.0282929 0.09065636 0.07272854] -> 1.0\n", - "[-0.11962268 0.16542032 0.09211093 -0.19003159] -> 1.0\n", - "[-0.11631427 -0.03089041 0.08831029 0.13022853] -> 1.0\n", - "[-0.11693208 -0.22715913 0.09091487 0.44941387] -> 1.0\n", - "[-0.12147526 -0.42344147 0.09990314 0.76931363] -> 1.0\n", - "[-0.12994409 -0.61978614 0.11528942 1.0916848 ] -> 1.0\n", - "[-0.14233981 -0.81622297 0.13712311 1.4182041 ] -> 1.0\n", - "[-0.15866427 -1.01275 0.1654872 1.7504154 ] -> 1.0\n", - "[-0.17891927 -1.2093195 0.2004955 2.0896728 ] -> 1.0\n", - "[-0.20310566 -1.4058217 0.24228896 2.4370732 ] -> 1.0\n", - "Total reward: 19.0\n" + "[ 0.00425272 -0.19994313 0.00917169 0.34113726] -> 1.0\n", + "[ 0.00025386 -0.00495286 0.01599443 0.05136059] -> 1.0\n", + "[ 1.5480528e-04 1.8993615e-01 1.7021643e-02 -2.3623335e-01] -> 1.0\n", + "[ 0.00395353 0.38481084 0.01229698 -0.5234989 ] -> 1.0\n", + "[ 0.01164974 0.18951797 0.001827 -0.22696657] -> 1.0\n", + "[ 0.0154401 0.38461378 -0.00271233 -0.51907265] -> 1.0\n", + "[ 0.02313238 0.5797738 -0.01309379 -0.812609 ] -> 1.0\n", + "[ 0.03472786 0.38483363 -0.02934597 -0.5240733 ] -> 1.0\n", + "[ 0.04242453 0.580356 -0.03982743 -0.8258571 ] -> 1.0\n", + "[ 0.05403165 0.38580072 -0.05634458 -0.54596174] -> 1.0\n", + "[ 0.06174766 0.19151384 -0.06726381 -0.27155042] -> 1.0\n", + "[ 0.06557794 -0.00258703 -0.07269482 -0.00081817] -> 1.0\n", + "[ 0.0655262 -0.19659522 -0.07271118 0.26807207] -> 1.0\n", + "[ 0.0615943 -0.00051497 -0.06734974 -0.04662942] -> 1.0\n", + "[ 0.061584 0.19550486 -0.06828233 -0.3597784 ] -> 1.0\n", + "[ 0.06549409 0.00141663 -0.0754779 -0.08938391] -> 1.0\n", + "[ 0.06552242 -0.19254686 -0.07726558 0.17856352] -> 1.0\n", + "[ 0.06167149 0.00359088 -0.0736943 -0.1374588 ] -> 1.0\n", + "[ 0.0617433 0.19968675 -0.07644348 -0.45245075] -> 1.0\n", + "[ 0.06573704 0.3958018 -0.0854925 -0.7682167 ] -> 1.0\n", + "[ 0.07365308 0.20195423 -0.10085683 -0.50361156] -> 1.0\n", + "[ 0.07769216 0.0083876 -0.11092906 -0.24433874] -> 1.0\n", + "[ 0.07785992 -0.18498953 -0.11581583 0.01139782] -> 1.0\n", + "[ 0.07416012 0.01158649 -0.11558788 -0.31546465] -> 1.0\n", + "[ 0.07439185 0.20814891 -0.12189718 -0.64224803] -> 1.0\n", + "[ 0.07855483 0.01491799 -0.13474214 -0.3903015 ] -> 1.0\n", + "[ 0.07885319 -0.17806001 -0.14254816 -0.14295265] -> 1.0\n", + "[ 0.07529199 0.01878517 -0.14540721 -0.47699296] -> 1.0\n", + "[ 0.07566769 -0.17401667 -0.15494707 -0.23344138] -> 1.0\n", + "[ 0.07218736 0.0229406 -0.1596159 -0.57071024] -> 1.0\n", + "[ 0.07264617 0.21989843 -0.1710301 -0.9091196 ] -> 1.0\n", + "[ 0.07704414 0.02745241 -0.1892125 -0.6747003 ] -> 1.0\n", + "[ 0.07759319 -0.16460665 -0.20270652 -0.4470505 ] -> 1.0\n", + "[ 0.07430106 -0.35637102 -0.21164753 -0.22448184] -> 1.0\n", + "Total reward: 34.0\n" ] } ], @@ -126,7 +159,9 @@ " obs, rew, done, info = env.step(env.action_space.sample())\n", " total_reward += rew\n", " print(f\"{obs} -> {rew}\")\n", - "print(f\"Total reward: {total_reward}\")" + "print(f\"Total reward: {total_reward}\")\n", + "\n", + "env.close()" ] }, { @@ -152,9 +187,35 @@ }, { "cell_type": "code", - "execution_count": 100, + "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/tensorflow/__init__.py:29: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives\n", + " import distutils as _distutils\n", + "2022-07-24 16:50:47.597258: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:47.597280: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n", + "/usr/local/lib/python3.10/dist-packages/flatbuffers/compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib and slated for removal in Python 3.12; see the module's documentation for alternative uses\n", + " import imp\n", + "2022-07-24 16:50:49.838826: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", + "2022-07-24 16:50:49.839078: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:49.839143: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:49.839194: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:49.839245: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:49.839295: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:49.839345: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:49.839392: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:49.839441: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory\n", + "2022-07-24 16:50:49.839449: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n", + "Skipping registering GPU devices...\n", + "2022-07-24 16:50:49.839649: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n", + "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n" + ] + } + ], "source": [ "import numpy as np\n", "import tensorflow as tf\n", @@ -181,7 +242,7 @@ }, { "cell_type": "code", - "execution_count": 101, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -213,14 +274,14 @@ }, { "cell_type": "code", - "execution_count": 102, + "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Total reward: 13.0\n" + "Total reward: 27.0\n" ] } ], @@ -238,7 +299,7 @@ }, { "cell_type": "code", - "execution_count": 79, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -270,38 +331,56 @@ }, { "cell_type": "code", - "execution_count": 73, + "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "0 -> 44.0\n", - "100 -> 105.0\n", - "200 -> 145.0\n", - "300 -> 70.0\n", - "400 -> 190.0\n", - "500 -> 298.0\n", - "600 -> 289.0\n", - "700 -> 499.0\n", - "800 -> 499.0\n", - "900 -> 499.0\n" + "0 -> 29.0\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2022-07-24 16:50:51.475024: W tensorflow/core/data/root_dataset.cc:247] Optimization loop failed: CANCELLED: Operation was cancelled\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100 -> 135.0\n", + "200 -> 484.0\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2022-07-24 16:51:35.910774: W tensorflow/core/data/root_dataset.cc:247] Optimization loop failed: CANCELLED: Operation was cancelled\n", + "2022-07-24 16:51:37.151017: W tensorflow/core/data/root_dataset.cc:247] Optimization loop failed: CANCELLED: Operation was cancelled\n", + "2022-07-24 16:51:39.284311: W tensorflow/core/data/root_dataset.cc:247] Optimization loop failed: CANCELLED: Operation was cancelled\n", + "2022-07-24 16:51:42.235074: W tensorflow/core/data/root_dataset.cc:247] Optimization loop failed: CANCELLED: Operation was cancelled\n", + "2022-07-24 16:51:44.691458: W tensorflow/core/data/root_dataset.cc:247] Optimization loop failed: CANCELLED: Operation was cancelled\n", + "2022-07-24 16:51:48.381946: W tensorflow/core/data/root_dataset.cc:247] Optimization loop failed: CANCELLED: Operation was cancelled\n" ] }, { "data": { "text/plain": [ - "[]" + "[]" ] }, - "execution_count": 73, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -340,9 +419,33 @@ }, { "cell_type": "code", - "execution_count": 82, + "execution_count": 10, "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "error", + "evalue": "display Surface quit", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31merror\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/tmp/ipykernel_44248/1459719159.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrun_episode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/tmp/ipykernel_44248/3855001447.py\u001b[0m in \u001b[0;36mrun_episode\u001b[0;34m(max_steps_per_episode, render)\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmax_steps_per_episode\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0maction_probs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexpand_dims\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstate\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0maction\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchoice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnum_actions\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msqueeze\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maction_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 430\u001b[0m \u001b[0;34m\"\"\"Renders the environment.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 431\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 432\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 433\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/wrappers/order_enforcing.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 49\u001b[0m \u001b[0;34m\"set `disable_render_order_enforcing=True` on the OrderEnforcer wrapper.\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 50\u001b[0m )\n\u001b[0;32m---> 51\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 52\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 53\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mproperty\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 430\u001b[0m \u001b[0;34m\"\"\"Renders the environment.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 431\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 432\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 433\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/wrappers/env_checker.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 53\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0menv_render_passive_checker\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 54\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 55\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/envs/classic_control/cartpole.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, mode)\u001b[0m\n\u001b[1;32m 215\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrenderer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_renders\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 216\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 217\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_render\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 218\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_render\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"human\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/envs/classic_control/cartpole.py\u001b[0m in \u001b[0;36m_render\u001b[0;34m(self, mode)\u001b[0m\n\u001b[1;32m 296\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 297\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msurf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpygame\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mflip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msurf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 298\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscreen\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mblit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msurf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 299\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmode\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"human\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 300\u001b[0m \u001b[0mpygame\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mevent\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpump\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31merror\u001b[0m: display Surface quit" + ] + } + ], "source": [ "_ = run_episode(render=True)" ] @@ -364,7 +467,7 @@ }, { "cell_type": "code", - "execution_count": 103, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -389,7 +492,7 @@ }, { "cell_type": "code", - "execution_count": 104, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -422,7 +525,7 @@ }, { "cell_type": "code", - "execution_count": 105, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -504,7 +607,7 @@ }, { "cell_type": "code", - "execution_count": 99, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -520,7 +623,7 @@ }, { "cell_type": "code", - "execution_count": 106, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -540,11 +643,8 @@ } ], "metadata": { - "interpreter": { - "hash": "16af2a8bbb083ea23e5e41c7f5787656b2ce26968575d8763f2c4b17f9cd711f" - }, "kernelspec": { - "display_name": "Python 3.8.12 ('py38')", + "display_name": "Python 3.10.4 64-bit", "language": "python", "name": "python3" }, @@ -558,9 +658,14 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.12" + "version": "3.10.4" }, - "orig_nbformat": 4 + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" + } + } }, "nbformat": 4, "nbformat_minor": 2 diff --git a/lessons/6-Other/22-DeepRL/README.md b/lessons/6-Other/22-DeepRL/README.md index f5c631d0..26e3f691 100644 --- a/lessons/6-Other/22-DeepRL/README.md +++ b/lessons/6-Other/22-DeepRL/README.md @@ -83,6 +83,7 @@ After running one of those algorithms, we can expect our CartPole to behave like Continue your learning in the following notebooks: * [RL in TensorFlow](CartPole-RL-TF.ipynb) +* [RL in PyTorch](CartPole-RL-PyTorch.ipynb) ## Other RL Tasks diff --git a/lessons/6-Other/22-DeepRL/notebook.ipynb b/lessons/6-Other/22-DeepRL/notebook.ipynb index b6338c89..50c2c33f 100644 --- a/lessons/6-Other/22-DeepRL/notebook.ipynb +++ b/lessons/6-Other/22-DeepRL/notebook.ipynb @@ -1,39 +1,15 @@ { - "metadata": { - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.0" - }, - "orig_nbformat": 4, - "kernelspec": { - "name": "python3", - "display_name": "Python 3.7.0 64-bit ('3.7')" - }, - "interpreter": { - "hash": "70b38d7a306a849643e446cd70466270a13445e5987dfa1344ef2b127438fa4d" - } - }, - "nbformat": 4, - "nbformat_minor": 2, "cells": [ { + "cell_type": "markdown", + "metadata": {}, "source": [ "## CartPole Skating\n", "\n", "> **Problem**: If Peter wants to escape from the wolf, he needs to be able to move faster than him. We will see how Peter can learn to skate, in particular, to keep balance, using Q-Learning.\n", "\n", "First, let's install the gym and import required libraries:" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -41,23 +17,34 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ - "Requirement already satisfied: gym in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (0.18.3)\n", - "Requirement already satisfied: Pillow<=8.2.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (7.0.0)\n", - "Requirement already satisfied: scipy in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.4.1)\n", - "Requirement already satisfied: numpy>=1.10.4 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.19.2)\n", - "Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.6.0)\n", - "Requirement already satisfied: pyglet<=1.5.15,>=1.4.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from gym) (1.5.15)\n", - "\u001b[33mWARNING: You are using pip version 20.2.3; however, version 21.1.2 is available.\n", - "You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -m pip install --upgrade pip' command.\u001b[0m\n" + "Defaulting to user installation because normal site-packages is not writeable\n", + "Collecting gym\n", + " Downloading gym-0.25.0.tar.gz (720 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m720.4/720.4 KB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", + "\u001b[?25h Installing build dependencies ... \u001b[?25ldone\n", + "\u001b[?25h Getting requirements to build wheel ... \u001b[?25ldone\n", + "\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n", + "\u001b[?25hRequirement already satisfied: numpy>=1.18.0 in /usr/lib/python3/dist-packages (from gym) (1.21.5)\n", + "Collecting gym-notices>=0.0.4\n", + " Downloading gym_notices-0.0.7-py3-none-any.whl (2.7 kB)\n", + "Collecting cloudpickle>=1.2.0\n", + " Downloading cloudpickle-2.1.0-py3-none-any.whl (25 kB)\n", + "Building wheels for collected packages: gym\n", + " Building wheel for gym (pyproject.toml) ... \u001b[?25ldone\n", + "\u001b[?25h Created wheel for gym: filename=gym-0.25.0-py3-none-any.whl size=824430 sha256=3f4ed647f1d12814bb457f7d83a7ccd0f682d12a0259ca07b7fab0db5100fc6e\n", + " Stored in directory: /home/leo/.cache/pip/wheels/c0/3c/33/32d86254a5bd554f5f07759ae1794646e490dd5fa81ebdcda3\n", + "Successfully built gym\n", + "Installing collected packages: gym-notices, cloudpickle, gym\n", + "Successfully installed cloudpickle-2.1.0 gym-0.25.0 gym-notices-0.0.7\n" ] } ], "source": [ "import sys\n", - "!pip install gym \n", + "!pip install gym pygame\n", "\n", "import gym\n", "import matplotlib.pyplot as plt\n", @@ -66,86 +53,118 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "## Create a cartpole environment" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { - "source": [ - "env = gym.make(\"CartPole-v1\")\n", - "print(env.action_space)\n", - "print(env.observation_space)\n", - "print(env.action_space.sample())" - ], "cell_type": "code", - "metadata": {}, "execution_count": 2, + "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", + "text": [ + "Discrete(2)\n", + "Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)\n", + "1\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", "text": [ - "Discrete(2)\nBox(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)\n0\n" + "/home/leo/.local/lib/python3.10/site-packages/gym/core.py:329: DeprecationWarning: \u001b[33mWARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n", + " deprecation(\n", + "/home/leo/.local/lib/python3.10/site-packages/gym/wrappers/step_api_compatibility.py:39: DeprecationWarning: \u001b[33mWARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n", + " deprecation(\n" ] } + ], + "source": [ + "env = gym.make(\"CartPole-v1\")\n", + "print(env.action_space)\n", + "print(env.observation_space)\n", + "print(env.action_space.sample())" ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "To see how the environment works, let's run a short simulation for 100 steps." - ], - "cell_type": "markdown", - "metadata": {} + ] }, { - "source": [ - "env.reset()\n", - "\n", - "for i in range(100):\n", - " env.render()\n", - " env.step(env.action_space.sample())\n", - "env.close()" - ], "cell_type": "code", - "metadata": {}, "execution_count": 3, + "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stderr", + "output_type": "stream", "text": [ - "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: \u001b[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.\u001b[0m\n warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))\n" + "/home/leo/.local/lib/python3.10/site-packages/gym/core.py:57: DeprecationWarning: \u001b[33mWARN: You are calling render method, but you didn't specified the argument render_mode at environment initialization. To maintain backward compatibility, the environment will render in human mode.\n", + "If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.\n", + "See here for more information: https://www.gymlibrary.ml/content/api/\u001b[0m\n", + " deprecation(\n" + ] + }, + { + "ename": "DependencyNotInstalled", + "evalue": "pygame is not installed, run `pip install gym[classic_control]`", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/envs/classic_control/cartpole.py\u001b[0m in \u001b[0;36m_render\u001b[0;34m(self, mode)\u001b[0m\n\u001b[1;32m 221\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 222\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mpygame\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 223\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mpygame\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mgfxdraw\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'pygame'", + "\nDuring handling of the above exception, another exception occurred:\n", + "\u001b[0;31mDependencyNotInstalled\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/tmp/ipykernel_32716/4123126963.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5\u001b[0m \u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstep\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0maction_space\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 430\u001b[0m \u001b[0;34m\"\"\"Renders the environment.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 431\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 432\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 433\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/wrappers/order_enforcing.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 49\u001b[0m \u001b[0;34m\"set `disable_render_order_enforcing=True` on the OrderEnforcer wrapper.\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 50\u001b[0m )\n\u001b[0;32m---> 51\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 52\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 53\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mproperty\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 430\u001b[0m \u001b[0;34m\"\"\"Renders the environment.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 431\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 432\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 433\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/wrappers/env_checker.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 51\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchecked_render\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 52\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchecked_render\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 53\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0menv_render_passive_checker\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 54\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 55\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/utils/passive_env_checker.py\u001b[0m in \u001b[0;36menv_render_passive_checker\u001b[0;34m(env, *args, **kwargs)\u001b[0m\n\u001b[1;32m 322\u001b[0m )\n\u001b[1;32m 323\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 324\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0menv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 325\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 326\u001b[0m \u001b[0;31m# TODO: Check that the result is correct\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/core.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 64\u001b[0m )\n\u001b[1;32m 65\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 66\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrender_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mrender\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/envs/classic_control/cartpole.py\u001b[0m in \u001b[0;36mrender\u001b[0;34m(self, mode)\u001b[0m\n\u001b[1;32m 215\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrenderer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_renders\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 216\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 217\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_render\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 218\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_render\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"human\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.local/lib/python3.10/site-packages/gym/envs/classic_control/cartpole.py\u001b[0m in \u001b[0;36m_render\u001b[0;34m(self, mode)\u001b[0m\n\u001b[1;32m 223\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mpygame\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mgfxdraw\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 224\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mImportError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 225\u001b[0;31m raise DependencyNotInstalled(\n\u001b[0m\u001b[1;32m 226\u001b[0m \u001b[0;34m\"pygame is not installed, run `pip install gym[classic_control]`\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 227\u001b[0m )\n", + "\u001b[0;31mDependencyNotInstalled\u001b[0m: pygame is not installed, run `pip install gym[classic_control]`" ] } + ], + "source": [ + "env.reset()\n", + "\n", + "for i in range(100):\n", + " env.render()\n", + " env.step(env.action_space.sample())\n", + "env.close()" ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "During simulation, we need to get observations in order to decide how to act. In fact, `step` function returns us back current observations, reward function, and the `done` flag that indicates whether it makes sense to continue the simulation or not:" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { - "source": [ - "env.reset()\n", - "\n", - "done = False\n", - "while not done:\n", - " env.render()\n", - " obs, rew, done, info = env.step(env.action_space.sample())\n", - " print(f\"{obs} -> {rew}\")\n", - "env.close()" - ], "cell_type": "code", - "metadata": {}, "execution_count": 4, + "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "[ 0.03044442 -0.19543914 -0.04496216 0.28125618] -> 1.0\n", "[ 0.02653564 -0.38989186 -0.03933704 0.55942606] -> 1.0\n", @@ -168,14 +187,24 @@ "[ 0.12921301 0.59883361 -0.22594088 -1.22169133] -> 1.0\n" ] } + ], + "source": [ + "env.reset()\n", + "\n", + "done = False\n", + "while not done:\n", + " env.render()\n", + " obs, rew, done, info = env.step(env.action_space.sample())\n", + " print(f\"{obs} -> {rew}\")\n", + "env.close()" ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "We can get min and max value of those numbers:" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -183,10 +212,11 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ - "[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]\n[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]\n" + "[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]\n", + "[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]\n" ] } ], @@ -196,11 +226,11 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "## State Discretization" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -213,11 +243,11 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "Let's also explore other discretization method using bins:" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -225,10 +255,11 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ - "Sample bins for interval (-5,5) with 10 bins\n [-5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5.]\n" + "Sample bins for interval (-5,5) with 10 bins\n", + " [-5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5.]\n" ] } ], @@ -247,11 +278,11 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "Let's now run a short simulation and observe those discrete environment values." - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -259,10 +290,21 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ - "(0, 0, -1, -3)\n(0, 0, -2, 0)\n(0, 0, -2, -3)\n(0, 1, -3, -6)\n(0, 2, -4, -9)\n(0, 3, -6, -12)\n(0, 2, -8, -9)\n(0, 3, -10, -13)\n(0, 4, -13, -16)\n(0, 4, -16, -19)\n(0, 4, -20, -17)\n(0, 4, -24, -20)\n" + "(0, 0, -1, -3)\n", + "(0, 0, -2, 0)\n", + "(0, 0, -2, -3)\n", + "(0, 1, -3, -6)\n", + "(0, 2, -4, -9)\n", + "(0, 3, -6, -12)\n", + "(0, 2, -8, -9)\n", + "(0, 3, -10, -13)\n", + "(0, 4, -13, -16)\n", + "(0, 4, -16, -19)\n", + "(0, 4, -20, -17)\n", + "(0, 4, -24, -20)\n" ] } ], @@ -279,11 +321,11 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "## Q-Table Structure" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -299,11 +341,11 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "## Let's Start Q-Learning!" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -323,8 +365,8 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "0: 108.0, alpha=0.3, epsilon=0.9\n" ] @@ -370,11 +412,11 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "## Plotting Training Progress" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -382,25 +424,27 @@ "metadata": {}, "outputs": [ { - "output_type": "execute_result", "data": { "text/plain": [ "[]" ] }, + "execution_count": 20, "metadata": {}, - "execution_count": 20 + "output_type": "execute_result" }, { - "output_type": "display_data", "data": { - "text/plain": "
", + "image/png": "", "image/svg+xml": "\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n", - "image/png": "\n" + "text/plain": [ + "
" + ] }, "metadata": { "needs_background": "light" - } + }, + "output_type": "display_data" } ], "source": [ @@ -408,11 +452,11 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "From this graph, it is not possible to tell anything, because due to the nature of stochastic training process the length of training sessions varies greatly. To make more sense of this graph, we can calculate **running average** over series of experiments, let's say 100. This can be done conveniently using `np.convolve`:" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -420,25 +464,27 @@ "metadata": {}, "outputs": [ { - "output_type": "execute_result", "data": { "text/plain": [ "[]" ] }, + "execution_count": 22, "metadata": {}, - "execution_count": 22 + "output_type": "execute_result" }, { - "output_type": "display_data", "data": { - "text/plain": "
", + "image/png": "", "image/svg+xml": "\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n", - "image/png": "\n" + "text/plain": [ + "
" + ] }, "metadata": { "needs_background": "light" - } + }, + "output_type": "display_data" } ], "source": [ @@ -449,13 +495,13 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "## Varying Hyperparameters and Seeing the Result in Action\n", "\n", "Now it would be interesting to actually see how the trained model behaves. Let's run the simulation, and we will be following the same action selection strategy as during training: sampling according to the probability distribution in Q-Table: " - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -475,14 +521,14 @@ ] }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "\n", "## Saving result to an animated GIF\n", "\n", "If you want to impress your friends, you may want to send them the animated GIF picture of the balancing pole. To do this, we can invoke `env.render` to produce an image frame, and then save those to animated GIF using PIL library:" - ], - "cell_type": "markdown", - "metadata": {} + ] }, { "cell_type": "code", @@ -490,8 +536,8 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "360\n" ] @@ -516,5 +562,32 @@ "print(i)" ] } - ] -} \ No newline at end of file + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.10.4 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.4" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}