diff --git a/README.md b/README.md index 9bf292f2..7aa4f8bd 100644 --- a/README.md +++ b/README.md @@ -22,22 +22,23 @@ # Basics -`lagom` balances between the flexibility and the userability when developing reinforcement learning (RL) algorithms. The library is built on top of [PyTorch](https://pytorch.org/) and provides modular tools to quickly prototype RL algorithms. However, we do not go overboard, because going too low level is rather time consuming and prone to potential bugs, while going too high level degrades the flexibility which makes it difficult to try out some crazy ideas. +`lagom` balances between the flexibility and the usability when developing reinforcement learning (RL) algorithms. The library is built on top of [PyTorch](https://pytorch.org/) and provides modular tools to quickly prototype RL algorithms. However, it does not go overboard, because too low level is often time consuming and prone to potential bugs, while too high level degrades the flexibility which makes it difficult to try out some crazy ideas fast. -We are continuously making `lagom` more 'self-contained' to run experiments quickly. Now, it internally supports base classes for multiprocessing ([master-worker framework](https://en.wikipedia.org/wiki/Master/slave_(technology))) to parallelize (e.g. experiments and evolution strategies). It also supports hyperparameter search by defining configurations either as grid search or random search. +We are continuously making `lagom` more 'self-contained' to set up and run experiments quickly. It internally supports base classes for multiprocessing ([master-worker framework](https://en.wikipedia.org/wiki/Master/slave_(technology))) for parallelization (e.g. experiments and evolution strategies). It also supports hyperparameter search by defining configurations either as grid search or random search. -One of the main pipelines to use `lagom` can be done as following: -1. Define environment and RL agent -2. User runner to collect data for agent -3. Define algorithm to train agent -4. Define experiment and configurations. +A common pipeline to use `lagom` can be done as following: +1. Define [environment](lagom/envs) and [agent](lagom/agents) (mainly for RL) +2. Use [runner](lagom/runner) to collect data (trajectories or segments) for agent +3. Define [engine](lagom/engine) for training and evaluating the agent +4. Define [algorithm](lagom/base_algo.py) +5. Define [experiment](lagom/experiment) and [configurations](lagom/experiment/configurator.py) A graphical illustration is coming soon. # Installation ## Install dependencies -Run the following command to install [all the dependencies](./requirements.txt): +Run the following command to install [all required dependencies](./requirements.txt): ```bash pip install -r requirements.txt @@ -53,7 +54,7 @@ We also provide some bash scripts in [scripts/](scripts/) directory to automatic ## Install lagom -Run the following command to install from source: +Run the following commands to install lagom from source: ```bash git clone https://github.com/zuoxingdong/lagom.git @@ -73,7 +74,7 @@ The documentation hosted by ReadTheDocs is available online at [http://lagom.rea # Examples -We shall continuously provide [examples/](examples/) to use lagom. +We are continuously providing [examples/](examples/) to use lagom. # Test @@ -86,7 +87,6 @@ pytest test -v # Roadmap ## Core - - Readthedocs Documentation - Tutorials ## More standard RL baselines - TRPO/PPO @@ -99,7 +99,6 @@ pytest test -v ## More standard networks - Monte Carlo Dropout/Concrete Dropout ## Misc - - VecEnv: similar to that of OpenAI baseline - Support pip install - Technical report diff --git a/examples/es/rl/README.md b/examples/es/rl/README.md index 00583572..7621c412 100644 --- a/examples/es/rl/README.md +++ b/examples/es/rl/README.md @@ -14,4 +14,4 @@ One could modify [experiment.py](./experiment.py) to quickly set up different co # Results - + diff --git a/examples/policy_gradient/README.md b/examples/policy_gradient/README.md index 01367380..0e5455e3 100644 --- a/examples/policy_gradient/README.md +++ b/examples/policy_gradient/README.md @@ -1,5 +1,5 @@ -We benchmark three baselines for policy gradient method in several different perspectives -1. REINFORCE -2. Actor-Critic/Vanilla Policy Gradient -3. Advantage Actor-Critic (A2C) +This example includes the implementations of the following policy gradient algorithms: +- [REINFORCE](reinforce) +- [Vanilla Policy Gradient (VPG)](vpg) +- [Advantage Actor-Critic (A2C)](a2c) diff --git a/examples/policy_gradient/a2c/README.md b/examples/policy_gradient/a2c/README.md new file mode 100644 index 00000000..1a2742ff --- /dev/null +++ b/examples/policy_gradient/a2c/README.md @@ -0,0 +1,17 @@ +# Advantage Actor Critic (A2C) + +This is an implementation of [A2C](https://blog.openai.com/baselines-acktr-a2c/) algorithm. + +# Usage + +Run the following command to start parallelized training: + +```bash +python main.py +``` + +One could modify [experiment.py](./experiment.py) to quickly set up different configurations. + +# Results + + diff --git a/examples/policy_gradient/a2c/experiment.py b/examples/policy_gradient/a2c/experiment.py index b508ee3e..cff242e0 100644 --- a/examples/policy_gradient/a2c/experiment.py +++ b/examples/policy_gradient/a2c/experiment.py @@ -28,6 +28,7 @@ def make_configs(self): configurator.fixed('algo.gamma', 0.99) configurator.fixed('agent.standardize_Q', False) # whether to standardize discounted returns + configurator.fixed('agent.standardize_adv', True) # whether to standardize advantage estimates configurator.fixed('agent.max_grad_norm', 0.5) # grad clipping, set None to turn off configurator.fixed('agent.entropy_coef', 0.01) configurator.fixed('agent.value_coef', 0.5) diff --git a/examples/policy_gradient/a2c/main.ipynb b/examples/policy_gradient/a2c/main.ipynb new file mode 100644 index 00000000..fc472a7d --- /dev/null +++ b/examples/policy_gradient/a2c/main.ipynb @@ -0,0 +1,232 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/zuo/Code/lagom/lagom/core/plotter/__init__.py:9: UserWarning: ImageViewer failed to import due to pyglet. \n", + " warnings.warn('ImageViewer failed to import due to pyglet. ')\n" + ] + } + ], + "source": [ + "from pathlib import Path\n", + "from lagom.experiment import Configurator\n", + "\n", + "from lagom import pickle_load\n", + "\n", + "from lagom.core.plotter import CurvePlot" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDcudaenv.idenv.standardizenetwork.hidden_sizesalgo.lralgo.use_lr_scheduleralgo.gammaagent.standardize_Qagent.standardize_adv...agent.constant_stdagent.std_state_dependentagent.init_stdtrain.timesteptrain.Ntrain.Teval.Nlog.record_intervallog.print_intervallog.dir
00TrueHalfCheetah-v2True[64, 64]0.001True0.99FalseTrue...NoneFalse0.51000000.0165101001000logs
\n", + "

1 rows × 25 columns

\n", + "
" + ], + "text/plain": [ + " ID cuda env.id env.standardize network.hidden_sizes algo.lr \\\n", + "0 0 True HalfCheetah-v2 True [64, 64] 0.001 \n", + "\n", + " algo.use_lr_scheduler algo.gamma agent.standardize_Q \\\n", + "0 True 0.99 False \n", + "\n", + " agent.standardize_adv ... agent.constant_std \\\n", + "0 True ... None \n", + "\n", + " agent.std_state_dependent agent.init_std train.timestep train.N train.T \\\n", + "0 False 0.5 1000000.0 16 5 \n", + "\n", + " eval.N log.record_interval log.print_interval log.dir \n", + "0 10 100 1000 logs \n", + "\n", + "[1 rows x 25 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "log_folder = Path('logs')\n", + "\n", + "list_config = pickle_load(log_folder/'configs.pkl')\n", + "configs = Configurator.to_dataframe(list_config)\n", + "configs" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def load_results(log_folder, ID, f):\n", + " p = Path(log_folder)/str(ID)\n", + " \n", + " list_result = []\n", + " for sub in p.iterdir():\n", + " if sub.is_dir() and (sub/f).exists():\n", + " list_result.append(pickle_load(sub/f))\n", + " \n", + " return list_result\n", + "\n", + "\n", + "def get_returns(list_result):\n", + " returns = []\n", + " for result in list_result:\n", + " #x_values = [i['evaluation_iteration'][0] for i in result]\n", + " x_values = [i['accumulated_trained_timesteps'][0] for i in result]\n", + " y_values = [i['average_return'][0] for i in result]\n", + " returns.append([x_values, y_values])\n", + " \n", + " return returns\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "ID = 0\n", + "env_id = configs.loc[configs['ID'] == ID]['env.id'].values[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "list_result = load_results('logs', ID, 'eval_logs.pkl')\n", + "returns = get_returns(list_result)\n", + "x_values, y_values = zip(*returns)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot = CurvePlot()\n", + "plot.add('A2C', y_values, xvalues=x_values)\n", + "ax = plot(title=f'A2C on {env_id}', \n", + " xlabel='Iteration', \n", + " ylabel='Mean Episode Reward', \n", + " num_tick=6, \n", + " xscale_magnitude=None)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ax.figure.savefig('data/result.png')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/policy_gradient/reinforce/README.md b/examples/policy_gradient/reinforce/README.md new file mode 100644 index 00000000..c4b3b215 --- /dev/null +++ b/examples/policy_gradient/reinforce/README.md @@ -0,0 +1,17 @@ +# REINFORCE + +This is an implementation of [REINFORCE](https://link.springer.com/article/10.1007/BF00992696) algorithm. + +# Usage + +Run the following command to start parallelized training: + +```bash +python main.py +``` + +One could modify [experiment.py](./experiment.py) to quickly set up different configurations. + +# Results + + diff --git a/examples/policy_gradient/reinforce/experiment.py b/examples/policy_gradient/reinforce/experiment.py index 7052fcf3..a2035059 100644 --- a/examples/policy_gradient/reinforce/experiment.py +++ b/examples/policy_gradient/reinforce/experiment.py @@ -18,7 +18,7 @@ def make_configs(self): configurator.fixed('cuda', True) # whether to use GPU - configurator.fixed('env.id', 'Reacher-v2') + configurator.fixed('env.id', 'HalfCheetah-v2') configurator.fixed('env.standardize', True) # whether to use VecStandardize configurator.fixed('network.hidden_sizes', [64, 64]) diff --git a/examples/policy_gradient/reinforce/main.ipynb b/examples/policy_gradient/reinforce/main.ipynb index 30d01b74..2a46a7af 100644 --- a/examples/policy_gradient/reinforce/main.ipynb +++ b/examples/policy_gradient/reinforce/main.ipynb @@ -77,18 +77,18 @@ " 0\n", " 0\n", " True\n", - " Reacher-v2\n", + " HalfCheetah-v2\n", " True\n", - " [64]\n", + " [64, 64]\n", " 0.001\n", - " False\n", + " True\n", " 0.99\n", " True\n", " 0.5\n", " ...\n", " None\n", " False\n", - " 0.5\n", + " 1.0\n", " 1000000.0\n", " 1\n", " 200\n", @@ -103,17 +103,17 @@ "" ], "text/plain": [ - " ID cuda env.id env.standardize network.hidden_sizes algo.lr \\\n", - "0 0 True Reacher-v2 True [64] 0.001 \n", + " ID cuda env.id env.standardize network.hidden_sizes algo.lr \\\n", + "0 0 True HalfCheetah-v2 True [64, 64] 0.001 \n", "\n", " algo.use_lr_scheduler algo.gamma agent.standardize_Q \\\n", - "0 False 0.99 True \n", + "0 True 0.99 True \n", "\n", " agent.max_grad_norm ... agent.constant_std \\\n", "0 0.5 ... None \n", "\n", " agent.std_state_dependent agent.init_std train.timestep train.N train.T \\\n", - "0 False 0.5 1000000.0 1 200 \n", + "0 False 1.0 1000000.0 1 200 \n", "\n", " eval.N log.record_interval log.print_interval log.dir \n", "0 10 100 1000 logs \n", @@ -136,7 +136,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -164,23 +164,33 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ - "list_result = load_results('logs', 0, 'eval_logs.pkl')\n", + "ID = 0\n", + "env_id = configs.loc[configs['ID'] == ID]['env.id'].values[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "list_result = load_results('logs', ID, 'eval_logs.pkl')\n", "returns = get_returns(list_result)\n", "x_values, y_values = zip(*returns)" ] }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 20, "metadata": {}, "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -192,12 +202,21 @@ "source": [ "plot = CurvePlot()\n", "plot.add('REINFORCE', y_values, xvalues=x_values)\n", - "ax = plot(title='REINFORCE', \n", + "ax = plot(title=f'REINFORCE on {env_id}', \n", " xlabel='Iteration', \n", " ylabel='Mean Episode Reward', \n", " num_tick=6, \n", " xscale_magnitude=None)" ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "ax.figure.savefig('data/result.png')" + ] } ], "metadata": { diff --git a/examples/policy_gradient/vpg/README.md b/examples/policy_gradient/vpg/README.md new file mode 100644 index 00000000..76d38e2c --- /dev/null +++ b/examples/policy_gradient/vpg/README.md @@ -0,0 +1,17 @@ +# Vanilla Policy Gradient (VPG) + +This is an implementation of [VPG](http://rll.berkeley.edu/deeprlcoursesp17/docs/lec2.pdf) algorithm. + +# Usage + +Run the following command to start parallelized training: + +```bash +python main.py +``` + +One could modify [experiment.py](./experiment.py) to quickly set up different configurations. + +# Results + + diff --git a/examples/policy_gradient/vpg/experiment.py b/examples/policy_gradient/vpg/experiment.py index d2077b75..6a237289 100644 --- a/examples/policy_gradient/vpg/experiment.py +++ b/examples/policy_gradient/vpg/experiment.py @@ -28,6 +28,7 @@ def make_configs(self): configurator.fixed('algo.gamma', 0.99) configurator.fixed('agent.standardize_Q', True) # whether to standardize discounted returns + configurator.fixed('agent.standardize_adv', False) # whether to standardize advantage estimates configurator.fixed('agent.max_grad_norm', 0.5) # grad clipping, set None to turn off configurator.fixed('agent.entropy_coef', 0.01) configurator.fixed('agent.value_coef', 0.5) diff --git a/examples/policy_gradient/vpg/main.ipynb b/examples/policy_gradient/vpg/main.ipynb new file mode 100644 index 00000000..01c3cabd --- /dev/null +++ b/examples/policy_gradient/vpg/main.ipynb @@ -0,0 +1,234 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "from lagom.experiment import Configurator\n", + "\n", + "from lagom import pickle_load\n", + "\n", + "from lagom.core.plotter import CurvePlot" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDcudaenv.idenv.standardizenetwork.hidden_sizesalgo.lralgo.use_lr_scheduleralgo.gammaagent.standardize_Qagent.max_grad_norm...agent.constant_stdagent.std_state_dependentagent.init_stdtrain.timesteptrain.Ntrain.Teval.Nlog.record_intervallog.print_intervallog.dir
00TrueHalfCheetah-v2True[64, 64]0.001True0.99True0.5...NoneFalse0.51000000.01200101001000logs
\n", + "

1 rows × 24 columns

\n", + "
" + ], + "text/plain": [ + " ID cuda env.id env.standardize network.hidden_sizes algo.lr \\\n", + "0 0 True HalfCheetah-v2 True [64, 64] 0.001 \n", + "\n", + " algo.use_lr_scheduler algo.gamma agent.standardize_Q \\\n", + "0 True 0.99 True \n", + "\n", + " agent.max_grad_norm ... agent.constant_std \\\n", + "0 0.5 ... None \n", + "\n", + " agent.std_state_dependent agent.init_std train.timestep train.N train.T \\\n", + "0 False 0.5 1000000.0 1 200 \n", + "\n", + " eval.N log.record_interval log.print_interval log.dir \n", + "0 10 100 1000 logs \n", + "\n", + "[1 rows x 24 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "log_folder = Path('logs')\n", + "\n", + "list_config = pickle_load(log_folder/'configs.pkl')\n", + "configs = Configurator.to_dataframe(list_config)\n", + "configs" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "def load_results(log_folder, ID, f):\n", + " p = Path(log_folder)/str(ID)\n", + " \n", + " list_result = []\n", + " for sub in p.iterdir():\n", + " if sub.is_dir() and (sub/f).exists():\n", + " list_result.append(pickle_load(sub/f))\n", + " \n", + " return list_result\n", + "\n", + "\n", + "def get_returns(list_result):\n", + " returns = []\n", + " for result in list_result:\n", + " #x_values = [i['evaluation_iteration'][0] for i in result]\n", + " x_values = [i['accumulated_trained_timesteps'][0] for i in result]\n", + " y_values = [i['average_return'][0] for i in result]\n", + " returns.append([x_values, y_values])\n", + " \n", + " return returns\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "ID = 0\n", + "env_id = configs.loc[configs['ID'] == ID]['env.id'].values[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "list_result = load_results('logs', ID, 'eval_logs.pkl')\n", + "returns = get_returns(list_result)\n", + "x_values, y_values = zip(*returns)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plot = CurvePlot()\n", + "plot.add('VPG', y_values, xvalues=x_values)\n", + "ax = plot(title=f'VPG on {env_id}', \n", + " xlabel='Iteration', \n", + " ylabel='Mean Episode Reward', \n", + " num_tick=6, \n", + " xscale_magnitude=None)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "ax.figure.savefig('data/result.png')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/lagom/agents/a2c_agent.py b/lagom/agents/a2c_agent.py index db7985bf..9d878bc3 100644 --- a/lagom/agents/a2c_agent.py +++ b/lagom/agents/a2c_agent.py @@ -57,7 +57,6 @@ def learn(self, D): for segment in D: # iterate over segments # Get all boostrapped discounted returns as estimate of Q Qs = segment.all_bootstrapped_discounted_returns - # Standardize: encourage/discourage half of performed actions if self.config['agent.standardize_Q']: Qs = Standardize()(Qs).tolist() @@ -69,6 +68,9 @@ def learn(self, D): # Advantage estimates As = [Q - V.item() for Q, V in zip(Qs, Vs)] + # Standardize advantage: encourage/discourage half of performed actions + if self.config['agent.standardize_adv']: + As = Standardize()(As).tolist() # Get all log-probabilities and entropies logprobs = segment.all_info('action_logprob') diff --git a/lagom/agents/vpg_agent.py b/lagom/agents/vpg_agent.py index 7c6cf848..b2bbb41d 100644 --- a/lagom/agents/vpg_agent.py +++ b/lagom/agents/vpg_agent.py @@ -41,7 +41,6 @@ def learn(self, D): for trajectory in D: # iterate over trajectories # Get all discounted returns as estimate of Q Qs = trajectory.all_discounted_returns - # Standardize: encourage/discourage half of performed actions if self.config['agent.standardize_Q']: Qs = Standardize()(Qs).tolist() @@ -53,6 +52,9 @@ def learn(self, D): # Advantage estimates As = [Q - V.item() for Q, V in zip(Qs, Vs)] + # Standardize advantage: encourage/discourage half of performed actions + if self.config['agent.standardize_adv']: + As = Standardize()(As).tolist() # Get all log-probabilities and entropies logprobs = trajectory.all_info('action_logprob')