Skip to content

Commit

Permalink
polish(nyz): polish ch7 docs
Browse files Browse the repository at this point in the history
  • Loading branch information
PaParaZz1 committed Jul 24, 2023
1 parent 4c996af commit def9d68
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 24 deletions.
11 changes: 6 additions & 5 deletions docs/dual_clip.html
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
<!DOCTYPE html>
<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/[email protected]/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/[email protected]/mode/python/python.min.js"></script></head><body><div class="section" id="section-0"><div class="docs doc-strings"><p><p><a href="index.html"><b>HOME<br></b></a></p></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a> <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a> <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily/tree/main/chapter7_tricks/dual_clip.py" target="_blank">View code on GitHub</a><br><br>PPO Dual Clip. These method limit the updates to policy, preventing it from deviating too much from its previous versions and ensuring more stable and reliable training.<br><a href="https://arxiv.org/pdf/1912.09729.pdf">Related Link</a></div></div><div class="section" id="section-1"><div class="docs doc-strings"><p> <b>Overview</b><br> Implementation of Dual Clip.<br> Arguments:<br> - logp_new (:obj:`torch.FloatTensor`): log_p calculated by old policy.<br> - logp_old (:obj:`torch.FloatTensor`): log_p calculated by new policy.<br> - adv (:obj:`torch.FloatTensor`): The advantage value.<br> - clip_ratio (:obj:`float`): The clip ratio of policy.<br> - dual_clip (:obj:`float`): The dual clip ratio of policy.<br> Returns:<br> - policy_loss (:obj:`torch.FloatTensor`): the calculated policy loss.</p></div><div class="code"><pre><code id="code_1" name="py_code">import torch
<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/[email protected]/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/[email protected]/mode/python/python.min.js"></script></head><body><div class="section" id="section-0"><div class="docs doc-strings"><p><p><a href="index.html"><b>HOME<br></b></a></p></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a> <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a> <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily/tree/main/chapter7_tricks/dual_clip.py" target="_blank">View code on GitHub</a><br><br>PPO (Policy) Dual Clip.<br><br>The Dual-Clip Proximal Policy Optimization (PPO) method is designed to constrain updates to<br>the policy,effectively preventing it from diverging excessively from its preceding iterations.<br>This approach thereby ensures a more stable and reliable learning process during training.<br>For further details, please refer to the source paper: Mastering Complex Control in MOBA Games with Deep Reinforcement Learning. <a href="https://arxiv.org/pdf/1912.09729.pdf">Related Link</a>.</div></div><div class="section" id="section-1"><div class="docs doc-strings"><p> <b>Overview</b><br> This function implements the Proximal Policy Optimization (PPO) policy loss with dual-clip<br> mechanism, which is a variant of PPO that provides more reliable and stable training by<br> limiting the updates to the policy, preventing it from deviating too much from its previous versions.<br> Arguments:<br> - logp_new (:obj:`torch.FloatTensor`): The log probability calculated by the new policy.<br> - logp_old (:obj:`torch.FloatTensor`): The log probability calculated by the old policy.<br> - adv (:obj:`torch.FloatTensor`): The advantage value, which measures how much better an<br> action is compared to the average action at that state.<br> - clip_ratio (:obj:`float`): The clipping ratio used to limit the change of policy during an update.<br> - dual_clip (:obj:`float`): The dual clipping ratio used to further limit the change of policy during an update.<br> Returns:<br> - policy_loss (:obj:`torch.FloatTensor`): The calculated policy loss, which is the objective we<br> want to minimize for improving the policy.</p></div><div class="code"><pre><code id="code_1" name="py_code">import torch


def ppo_dual_clip(logp_new: torch.FloatTensor, logp_old: torch.FloatTensor, adv: torch.FloatTensor, clip_ratio: float, dual_clip: float) -> torch.FloatTensor:</code></pre></div></div><div class="section" id="section-3"><div class="docs doc-strings"><p> $$r(\theta) = \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)}$$</p></div><div class="code"><pre><code id="code_3" name="py_code"> ratio = torch.exp(logp_new - logp_old)</code></pre></div></div><div class="section" id="section-4"><div class="docs doc-strings"><p> $$clip_1 = min(r(\theta)*A(s,a), clip(r(\theta), 1-clip\_ratio, 1+clip\_ratio)*A(s,a))$$</p></div><div class="code"><pre><code id="code_4" name="py_code"> surr1 = ratio * adv
def ppo_dual_clip(logp_new: torch.FloatTensor, logp_old: torch.FloatTensor, adv: torch.FloatTensor, clip_ratio: float,
dual_clip: float) -> torch.FloatTensor:</code></pre></div></div><div class="section" id="section-3"><div class="docs doc-strings"><p> This is the ratio of the new policy probability to the old policy probability.<br> $$r(\theta) = \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)}$$</p></div><div class="code"><pre><code id="code_3" name="py_code"> ratio = torch.exp(logp_new - logp_old)</code></pre></div></div><div class="section" id="section-4"><div class="docs doc-strings"><p> The first clipping operation is performed here, we limit the update to be within a certain range.<br> $$clip_1 = min(r(\theta)*A(s,a), clip(r(\theta), 1-clip\_ratio, 1+clip\_ratio)*A(s,a))$$</p></div><div class="code"><pre><code id="code_4" name="py_code"> surr1 = ratio * adv
surr2 = ratio.clamp(1 - clip_ratio, 1 + clip_ratio) * adv
clip1 = torch.min(surr1, surr2)</code></pre></div></div><div class="section" id="section-5"><div class="docs doc-strings"><p> $$clip_2 = max(clip_1, dual\_clip * A(s,a))$$</p></div><div class="code"><pre><code id="code_5" name="py_code"> clip2 = torch.max(clip1, dual_clip * adv)</code></pre></div></div><div class="section" id="section-6"><div class="docs doc-strings"><p> Only use dual_clip when adv < 0.</p></div><div class="code"><pre><code id="code_6" name="py_code"> policy_loss = -(torch.where(adv < 0, clip2, clip1)).mean()
clip1 = torch.min(surr1, surr2)</code></pre></div></div><div class="section" id="section-5"><div class="docs doc-strings"><p> The second clipping operation is performed here, we further limit the update to be within a stricter range.<br> $$clip_2 = max(clip_1, dual\_clip * A(s,a))$$</p></div><div class="code"><pre><code id="code_5" name="py_code"> clip2 = torch.max(clip1, dual_clip * adv)</code></pre></div></div><div class="section" id="section-6"><div class="docs doc-strings"><p> We only apply the dual-clip when the advantage is negative, i.e., when the action is worse than the average.</p></div><div class="code"><pre><code id="code_6" name="py_code"> policy_loss = -(torch.where(adv < 0, clip2, clip1)).mean()
return policy_loss

</code></pre></div></div><div class="section" id="section-7"><div class="docs doc-strings"><p> <b>Overview</b><br> Test <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">dual_clip</span> function.</p></div><div class="code"><pre><code id="code_7" name="py_code">def test_ppo_dual_clip() -> None:</code></pre></div></div><div class="section" id="section-9"><div class="docs doc-strings"><p> Generate data, batch size is 6.</p></div><div class="code"><pre><code id="code_9" name="py_code"> B = 6
</code></pre></div></div><div class="section" id="section-7"><div class="docs doc-strings"><p> <b>Overview</b><br> This function tests the ppo_dual_clip function. It generates some sample data, calculates the<br> policy loss using the ppo_dual_clip function, and checks if the returned value is a scalar.</p></div><div class="code"><pre><code id="code_7" name="py_code">def test_ppo_dual_clip() -> None:</code></pre></div></div><div class="section" id="section-9"><div class="docs doc-strings"><p> Generate random data for testing. The batch size is 6.</p></div><div class="code"><pre><code id="code_9" name="py_code"> B = 6
logp_new = torch.randn(B)
logp_old = torch.randn(B)
adv = torch.randn(B)</code></pre></div></div><div class="section" id="section-10"><div class="docs doc-strings"><p> Calculate policy loss with policy loss.</p></div><div class="code"><pre><code id="code_10" name="py_code"> policy_loss = ppo_dual_clip(logp_new, logp_old, adv, 0.2, 0.2)</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p> The returned value is a scalar.</p></div><div class="code"><pre><code id="code_11" name="py_code"> assert policy_loss.shape == torch.Size([])
adv = torch.randn(B)</code></pre></div></div><div class="section" id="section-10"><div class="docs doc-strings"><p> Calculate policy loss using the ppo_dual_clip function.</p></div><div class="code"><pre><code id="code_10" name="py_code"> policy_loss = ppo_dual_clip(logp_new, logp_old, adv, 0.2, 0.2)</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p> Assert that the returned policy loss is a scalar (i.e., its shape is an empty tuple).</p></div><div class="code"><pre><code id="code_11" name="py_code"> assert policy_loss.shape == torch.Size([])

</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p><i>If you have any questions or advices about this documation, you can raise issues in GitHub (https://github.com/opendilab/PPOxFamily) or email us ([email protected]).</i></p></div></div></body><script type="text/javascript">
window.onload = function(){
Expand Down
Loading

0 comments on commit def9d68

Please sign in to comment.