From 4af4a32d1b5acb06d585ef7bb0a00c83810fe5c3 Mon Sep 17 00:00:00 2001 From: Antonin Raffin Date: Mon, 22 Apr 2024 10:24:53 +0200 Subject: [PATCH] Update RL Tips and Tricks section --- docs/guide/rl_tips.rst | 43 +++++++++++++++++++++-------------------- docs/misc/changelog.rst | 1 + 2 files changed, 23 insertions(+), 21 deletions(-) diff --git a/docs/guide/rl_tips.rst b/docs/guide/rl_tips.rst index ae37640c7..c4f277f3e 100644 --- a/docs/guide/rl_tips.rst +++ b/docs/guide/rl_tips.rst @@ -4,7 +4,7 @@ Reinforcement Learning Tips and Tricks ====================================== -The aim of this section is to help you do reinforcement learning experiments. +The aim of this section is to help you run reinforcement learning experiments. It covers general advice about RL (where to start, which algorithm to choose, how to evaluate an algorithm, ...), as well as tips and tricks when using a custom environment or implementing an RL algorithm. @@ -14,6 +14,11 @@ as well as tips and tricks when using a custom environment or implementing an RL this section in more details. You can also find the `slides here `_. +.. note:: + + We also have a `video on Designing and Running Real-World RL Experiments `_, slides are `can be found online `_. + + General advice when using Reinforcement Learning ================================================ @@ -103,19 +108,19 @@ and this `issue `_ by Cé Which algorithm should I use? ============================= -There is no silver bullet in RL, depending on your needs and problem, you may choose one or the other. +There is no silver bullet in RL, you can choose one or the other depending on your needs and problems. The first distinction comes from your action space, i.e., do you have discrete (e.g. LEFT, RIGHT, ...) or continuous actions (ex: go to a certain speed)? -Some algorithms are only tailored for one or the other domain: ``DQN`` only supports discrete actions, where ``SAC`` is restricted to continuous actions. +Some algorithms are only tailored for one or the other domain: ``DQN`` supports only discrete actions, while ``SAC`` is restricted to continuous actions. -The second difference that will help you choose is whether you can parallelize your training or not. +The second difference that will help you decide is whether you can parallelize your training or not. If what matters is the wall clock training time, then you should lean towards ``A2C`` and its derivatives (PPO, ...). Take a look at the `Vectorized Environments `_ to learn more about training with multiple workers. -To accelerate training, you can also take a look at `SBX`_, which is SB3 + Jax, it has fewer features than SB3 but can be up to 20x faster than SB3 PyTorch thanks to JIT compilation of the gradient update. +To accelerate training, you can also take a look at `SBX`_, which is SB3 + Jax, it has less features than SB3 but can be up to 20x faster than SB3 PyTorch thanks to JIT compilation of the gradient update. -In sparse reward settings, we either recommend to use dedicated methods like HER (see below) or population-based algorithms like ARS (available in our :ref:`contrib repo `). +In sparse reward settings, we either recommend using either dedicated methods like HER (see below) or population-based algorithms like ARS (available in our :ref:`contrib repo `). To sum it up: @@ -146,7 +151,7 @@ Continuous Actions Continuous Actions - Single Process ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Current State Of The Art (SOTA) algorithms are ``SAC``, ``TD3`` and ``TQC`` (available in our :ref:`contrib repo `). +Current State Of The Art (SOTA) algorithms are ``SAC``, ``TD3``, ``CrossQ`` and ``TQC`` (available in our :ref:`contrib repo ` and :ref:`SBX (SB3 + Jax) repo `). Please use the hyperparameters in the `RL zoo `_ for best results. If you want an extremely sample-efficient algorithm, we recommend using the `DroQ configuration `_ in `SBX`_ (it does many gradient steps per step in the environment). @@ -155,8 +160,7 @@ If you want an extremely sample-efficient algorithm, we recommend using the `Dro Continuous Actions - Multiprocessed ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Take a look at ``PPO``, ``TRPO`` (available in our :ref:`contrib repo `) or ``A2C``. Again, don't forget to take the hyperparameters from the `RL zoo `_ -for continuous actions problems (cf *Bullet* envs). +Take a look at ``PPO``, ``TRPO`` (available in our :ref:`contrib repo `) or ``A2C``. Again, don't forget to take the hyperparameters from the `RL zoo `_ for continuous actions problems (cf *Bullet* envs). .. note:: @@ -181,26 +185,23 @@ Tips and Tricks when creating a custom environment ================================================== If you want to learn about how to create a custom environment, we recommend you read this `page `_. -We also provide a `colab notebook `_ for -a concrete example of creating a custom gym environment. +We also provide a `colab notebook `_ for a concrete example of creating a custom gym environment. Some basic advice: -- always normalize your observation space when you can, i.e., when you know the boundaries -- normalize your action space and make it symmetric when continuous (cf potential issue below) A good practice is to rescale your actions to lie in [-1, 1]. This does not limit you as you can easily rescale the action inside the environment -- start with shaped reward (i.e. informative reward) and simplified version of your problem -- debug with random actions to check that your environment works and follows the gym interface: +- always normalize your observation space if you can, i.e. if you know the boundaries +- normalize your action space and make it symmetric if it is continuous (see potential problem below) A good practice is to rescale your actions so that they lie in [-1, 1]. This does not limit you, as you can easily rescale the action within the environment. +- start with a shaped reward (i.e. informative reward) and a simplified version of your problem +- debug with random actions to check if your environment works and follows the gym interface (with ``check_env``, see below) -Two important things to keep in mind when creating a custom environment is to avoid breaking Markov assumption +Two important things to keep in mind when creating a custom environment are avoiding breaking the Markov assumption and properly handle termination due to a timeout (maximum number of steps in an episode). -For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give a history of observations -as input. +For example, if there is a time delay between action and observation (e.g. due to wifi communication), you should provide a history of observations as input. Termination due to timeout (max number of steps per episode) needs to be handled separately. You should return ``truncated = True``. If you are using the gym ``TimeLimit`` wrapper, this will be done automatically. -You can read `Time Limit in RL `_ or take a look at the `RL Tips and Tricks video `_ -for more details. +You can read `Time Limit in RL `_, take a look at the `Designing and Running Real-World RL Experiments video `_ or `RL Tips and Tricks video `_ for more details. We provide a helper to check that your environment runs without error: @@ -234,7 +235,7 @@ If you want to quickly try a random agent on your environment, you can also do: Most reinforcement learning algorithms rely on a Gaussian distribution (initially centered at 0 with std 1) for continuous actions. So, if you forget to normalize the action space when using a custom environment, -this can harm learning and be difficult to debug (cf attached image and `issue #473 `_). +this can harm learning and can be difficult to debug (cf attached image and `issue #473 `_). .. figure:: ../_static/img/mistake.png diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst index 9080b6245..db065ee4a 100644 --- a/docs/misc/changelog.rst +++ b/docs/misc/changelog.rst @@ -13,6 +13,7 @@ Bug Fixes: Documentation: ^^^^^^^^^^^^^^ - Updated SBX documentation (CrossQ and deprecated DroQ) +- Updated RL Tips and Tricks section Release 2.3.0 (2024-03-31)