Skip to content

Commit

Permalink
RL quickstart guide
Browse files Browse the repository at this point in the history
  • Loading branch information
Joseph Suarez committed Nov 16, 2024
1 parent 6d80c2e commit b0f867b
Showing 1 changed file with 48 additions and 6 deletions.
54 changes: 48 additions & 6 deletions docs/blog.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,56 @@
<main class="content">
<nav class="nav-box">
<ul>
<li><a href="#post-1">The Puffer Stack</a></li>
<li><a href="#post-2">PufferLib 0.7: Puffing Up Performance with Shared Memory</a></li>
<li><a href="#post-3">PufferLib 0.6: 🐡🌊 An Ocean of Environments for Learning Pufferfish</a></li>
<li><a href="#post-4">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
<li><a href="#post-5">PufferLib 0.4: Ready to Take on Bigger Fish</a></li>
<li><a href="#post-6">PufferLib 0.2: Ready to Take on the Big Fish</a></li>
<li><a href="#post-2">Reinforcement Learning Quickstart Guide</a></li>

<li><a href="#post-2">The Puffer Stack</a></li>
<li><a href="#post-3">PufferLib 0.7: Puffing Up Performance with Shared Memory</a></li>
<li><a href="#post-4">PufferLib 0.6: 🐡🌊 An Ocean of Environments for Learning Pufferfish</a></li>
<li><a href="#post-5">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
<li><a href="#post-6">PufferLib 0.4: Ready to Take on Bigger Fish</a></li>
<li><a href="#post-7">PufferLib 0.2: Ready to Take on the Big Fish</a></li>
</ul>
</nav>
<article id="post-1" class="blog-post">
<header class="post-header">
<h1>Reinforcement Learning Quickstart Guide</h1>
</header>
<p>So you want to learn reinforcement learning? It's a hard mountain to climb, but I'm going to be giving you some of the best tricks and insights from my playbook. Star PufferLib on GitHub if you learn something useful. It's the library I'm building to make RL fast and sane.</p>

<h2>What is RL?</h2>
<p>(Deep) reinforcement learning is a branch of ML focused on learning through interaction. You are training an agent or policy. Both of these just mean neural network. The world, game, or sim the agent is interacting with is called the environment, which is in a particular state at any point in time. The agent makes an observation of the state at each timestep. That's the data it sees and can use to make decisions. In some environments, this is simply the full state, in which case we say the environment is fully observed. Otherwise, it is partially observed. In math heavy RL lit, you will see these described by Markov Decision Processes (MDP) and Partially Observable MDPs (POMDP). Consider this nomenclature optional. The agent makes an action based on the observation, which is then used to step the environment forward in time. The environment will return a reward based on what happens in the environment. This can be 0 and often is, but it might be 0.5 if an agent scores a point and -1 is the agent dies, for example.</p>

<h2>Fundamentals</h2>
<p>There are a few different common classes of algorithms, as well as some stuff masquerading as RL that really isn't. I'd rather eat broken glass than read academic papers all day, so I'll keep the background material light.</p>
<p><strong>On-policy:</strong> You learn a function that maps observations to actions. Read Karpathy's blog post on policy gradients, then skip right to the PPO paper. You might need to refer to the Generalized Advantage Estimation paper for context.</p>
<p><strong>Off-policy:</strong> You learn a function that maps (observation, action) to some value that tells you how good that action is. The agent then acts either by selecting the highest-value action, or sometimes by sampling. You should absolutely read DQN as one of the first fundamental papers in this area. Then, skip right to Rainbow because it summarizes and cites most of the intermediate improvements anyways. Nowadays, Soft Actor Critic is the most widely used off-policy algorithm. (Edit: Yes, I am aware that off-policy is typically defined as training on data that doesn't come from the policy. Stay tuned for a follow-up on why this distinction is largely irrelevant and misleading)</p>
<p>Researchers made a big deal about on-policy vs. off-policy in the mid-late 2010s. It really doesn't matter that much. There have even been theoretical results showing some equivalences. Kind of expected. On-policy maps observations to actions but usually also predicts a value function. Off-policy predicts a sort of action-conditioned value function. The big difference is that off-policy algorithms are almost always trained with some sort of experience replay, meaning that the algorithm collects and samples training data from a big buffer. In contrast, on-policy methods usually use data as soon as it is collected... but the batch sizes can get pretty huge, so again, the differences are exaggerated.</p>
<p><strong>Model based:</strong> is poorly named. In this context, "model" refers to the environment. Your agent is trained to directly predict future observations. You can use this as an auxiliary loss or even use the learned world model to simulate new training data. Model-based training is intuitively appealing and has shown some impressive results, but some of the recent lit is a bit dodgy. Most of RL is model free. I suggest the original World Models paper by David Ha & Schmidhuber.</p>
<p><strong>Offline RL:</strong> is not RL. It's supervised learning on a fixed set of observations, actions, and rewards usually collected from humans or by an expert policy. This is similar to imitation learning or behavioral cloning, but with the addition of a reward signal. Either way, it is missing the key element of learning through interaction, since the policy does not have any control over its data collection.</p>
<p><strong>Multiagent RL:</strong> Is the same as single-agent RL except that some of the environments and tools are jank. The most common approach is to use the same policy for all agents, applied independently. This is as if you had N single-agent environments instead of one N-agent environment. See? No different from single-agent. You can also compute actions jointly just by concatenating all the observations for a single environment together. Sometimes researchers come up with separate algorithm names to describe these techniques, like IPPO and MAPPO... but this is really all there is to it. There are also some dedicated multiagent algorithms, but having worked extensively in multiagent RL, these are pretty mixed.</p>

<h2>Perspective</h2>
<p>Learn which areas of research to pay attention to and which you can ignore. The large-scale industry papers are great for developing this intuition. If I had to pick just one, it's OpenAI Five. PPO with simple historical self-play solves DoTA. There's a lot more in that paper, too. The core architecture is a 1-layer 4096 dim LSTM. The other papers are Alphastar, Learning Dexterity, Emergent Tool Use, and Capture the Flag in roughly that order. Don't forget about the whole AlphaGo line from DeepMind!</p>
<p>So why is this relevant? There's some important missing context here... RL is very sensitive to hyperparameters, and many of the common benchmarks are slow. Couple this with starving academic budgets and you inevitably get a lot of bad science. Algorithm A does 20% better than algorithm B, but hyperparameters alone make a difference of 3x. So why even bother developing fancy new algorithms if you can't test them properly? Well, that's how you get published. And a lot of the people developing faster envs were treated so badly by academia that they took their ball and went home (hint, that's why I just write blogs now!).</p>
<p>So how do you know what lines of work are promising? Look for papers with comprehensive experiments and ablations. Especially the ones that do this on one core idea. I particularly like the OpenAI blog post how AI training scales and the paper scaling laws for single-agent reinforcement learning. Personally, I think the most promising thing right now is to just rerun old work with more experiments on faster environments. We're developing tons of these at Puffer, so you can run hundreds of experiments per GPU per day. If that sounds boring, learn to be excited by the result rather than the method. The goal is to understand, and science is just one tool for doing so.</p>
<p>I also avoid work that advances research now at the price of making it slower in the future. Anything introducing slow environments or expensive training had better have a very good reason for it. On the contrary, anything that improves the pace of research is shortlisted. DreamerV3 caught my eye because it worked with one set of hyperparameters... but that was before blowing 10,000 A100 hours on ablation studies. You won't always be right!</p>
<p>When I'm assessing a new area of work, I always look for wrong fundamental assumptions. For example, a lot of work in curiosity or exploration don't reasonably define those terms. Several papers in this area abuse human intuition to propose environments that look easy, but are actually hard or impossible to learn tabula-rasa, or from a blank slate. That doesn't mean I won't consider any of the ideas from these papers... but I'm going to assume that the results don't generalize until a mountain of evidence proves otherwise.</p>

<h2>Things I use a lot</h2>
<p><strong>PPO:</strong> This is my go-to algorithm. It's simple and solved DoTA. Actually, it's simpler than most people appreciate. The way to think about PPO is vanilla policy gradients + GAE. That's just fancy exponential reward discounting with a value function baseline. Then, it adds policy clipping. This just means each weight can't change too much on any single update. Clipping lets PPO use the same batch of observations for multiple gradient updates. But if your environment is really fast, there's not much reason to do that, since new data is free. So it's just a simple and effective sample efficiency hack. Read Costa's 37 PPO details blog if you really want to understand the algorithm.</p>
<p><strong>Hyperparameter Intuition:</strong> I'll cover these for PPO, but several are common across many algorithms. Learning rate, gamma, and lambda are the most important. You always sweep learning rate. Gamma and lambda are GAE parameters that relate to the effective horizon of your task. I like to think about the effective horizon of my task. For instance, in Pong, you don't need to look ahead more than a couple of seconds. If the framerate is 15, then there are 30 frames in 2 seconds, so I might try 1 - 1/30 = 0.97 as a starting point. Lamba is usually set a bit below gamma, so I would try 0.95. This should at least give you a decent starting point for an automated sweep. I leave clipping parameters at 0.1 or 0.2. Tuning these lower will cause aggressive "on rails" runs that learn well for a while before diverging. Batch size, minibatch size, and number of environments should be set based on hardware. For my fast environments with small networks, batch size 4096 is hardware efficient, so I use 4096 environments. Then I multiply by 128 to get the batch size. The reason for this is to allow GAE to compute discounted returns over 128-length trajectory segments. Decrease this if your environment has very short horizons or increase it for longer ones. I set minibatch size to be a quarter of batch size by default, but I also set this one based on GPU memory. Update epochs is 1 for fast environments or 3-4 for slow environments. You can go higher, but then you have to also worry about KL targets.</p>
<p><strong>CARBS:</strong> A really good hyperparameter tuning algorithm from ImbueAI. We have bindings in PufferLib, and it is way better than standard random or bayes. We're still learning how best to configure this, but even if you do it wrong, it's still usually pretty good. Don't sweep clip coefficients like I mentioned above, otherwise you get some pretty nutty runs.</p>
<p><strong>Common Architectures:</strong> I use an LSTM by default because it's fast and PufferLib makes adding one trivial. This replaces the main hidden layer, so networks can be as simple as fc-relu-lstm-fc-atn/val with 128-512 hidden dim. For 2d data, I will usually use a stack of 2-3 convolution layers with relus as an encoder. Avoid redundant fully connected layers when combining data from multiple sources, such as flat and 2d data. Deeper networks are not always better in RL, and they can sometimes be much harder to train. Also, know that RL tends to be more data hungry than other areas of AI, so you are often better off running more samples on a smaller network. That presumes your problem has a fast simulator. Feel free to experiment more here if you don't. The resnet architecture from the Impala paper is a decent slower one.</p>
<p><strong>Normalize your data:</strong> Observations should be divided into discrete and continuous. One-hot or embed discrete data. Divide continuous data by its maximum value per channel. Do not do this using mean stats. Just say "max health is 100 so I will divide by 100" etc. Do the same for rewards.</p>
<p><strong>Designing Rewards:</strong> These days, I just pick 3-5 things in the environment that are relevant to performance. For my MOBA, I did agent dying, getting xp, and taking a tower. I come up with rough guesses of the coefficients in the range of -5 to 5 (-0.5 to 0.5 for more common rewards). Then, I add the reward components to a hyperparameter sweep and tune them automatically. Be careful with continuous rewards. If you want an agent to go to a target, 1 for getting closer and -1 for getting farther is way better than just negative distance to target. The reason is that if the agent gets 0.01 closer, it might have a reward of -0.95 one step and -0.94 the next. Not much of a magnitude change to differentiate.</p>
<p><strong>Whitebox software:</strong> Try not to over-modularize RL. CleanRL provides single-file training implementations, and if you talk to any researchers, you'll quickly find that it's the best thing since sliced bread. Anything and everything can go wrong in RL, and you don't want to be digging through several layers of abstraction searching for the issue. I can't tell you how many times I've seen environments break because someone forgot they were using some wrapper that no longer made sense. Or just an environment was passing data in a weird way. Seriously, just keep it simple. A high level API isn't going to save you. Assume anything you build will break, and you or someone else will have to read the source for it.</p>
<p><strong>General engineering:</strong> Lots of AI researchers work out of notebooks. That doesn't fly in RL. In addition to the whole normal ML stack, RL requires you to deal with high-performance distributed simulation. The biggest innovations in PufferLib required me to get my hands dirty with asynchronous multiprocessing. Once that was done, I was able to 100x the standard training speed by writing envs from scratch in C. If you're coming from high level dev, it's much easier than you'd think. I wrote Python for 10 years. Within just a few weeks, I was as productive in C as in Python, and now I'm actually more productive. How is that possible? Well, I don't have to think about fancy performance optimization tricks. I just write the braindead loops and it's fast, done. You wouldn't believe some of the hoops I had to jump through during my PhD to get Neural MMO to run fast enough in Python.</p>
<p><strong>Write better code:</strong> This one is more personal, but I'm irrationally obsessive about code quality. In order to get better, bad code has to cause you severe mental distress. I've gone through several phases here, some of which involved me writing a lot of bad code. One thing I want to emphasize: good engineers don't use every design pattern in the book. If you learned Java, unlearn it. No abstraction is zero-cost, and first year CS students should be able to read almost all of your code. I've hit a kind of zen state where dev is pretty easy, and I'd like to think I make it easier for new contributors too.</p>

<h2>Contribute to PufferLib</h2>
<p>I'm not here to sell you courses. I wrote this mostly so new contributors would have a place to start. Many came in with zero RL experience. I spend a lot of time going through PRs and helping fill in knowledge gaps on stream or in voice chat. This all happens through the Discord. Folks usually start off by contributing to environments and then move into the science side as they get more comfortable. Major contributors even get hardware access for running experiments!</p>
</article>


<article id="post-1" class="blog-post">
<header class="post-header">
Expand Down

0 comments on commit b0f867b

Please sign in to comment.