-
Notifications
You must be signed in to change notification settings - Fork 176
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
29 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,17 @@ | ||
<!DOCTYPE html> | ||
<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/[email protected]/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/[email protected]/mode/python/python.min.js"></script></head><body><div class="section" id="section-0"><div class="docs doc-strings"><p><p><a href="index.html"><b>HOME<br></b></a></p></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a> <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a> <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily/tree/main/chapter7_tricks/dual_clip.py" target="_blank">View code on GitHub</a><br><br>PPO Dual Clip. These method limit the updates to policy, preventing it from deviating too much from its previous versions and ensuring more stable and reliable training.<br><a href="https://arxiv.org/pdf/1912.09729.pdf">Related Link</a></div></div><div class="section" id="section-1"><div class="docs doc-strings"><p> <b>Overview</b><br> Implementation of Dual Clip.<br> Arguments:<br> - logp_new (:obj:`torch.FloatTensor`): log_p calculated by old policy.<br> - logp_old (:obj:`torch.FloatTensor`): log_p calculated by new policy.<br> - adv (:obj:`torch.FloatTensor`): The advantage value.<br> - clip_ratio (:obj:`float`): The clip ratio of policy.<br> - dual_clip (:obj:`float`): The dual clip ratio of policy.<br> Returns:<br> - policy_loss (:obj:`torch.FloatTensor`): the calculated policy loss.</p></div><div class="code"><pre><code id="code_1" name="py_code">import torch | ||
<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/[email protected]/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/[email protected]/mode/python/python.min.js"></script></head><body><div class="section" id="section-0"><div class="docs doc-strings"><p><p><a href="index.html"><b>HOME<br></b></a></p></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a> <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a> <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily/tree/main/chapter7_tricks/dual_clip.py" target="_blank">View code on GitHub</a><br><br>PPO (Policy) Dual Clip.<br><br>The Dual-Clip Proximal Policy Optimization (PPO) method is designed to constrain updates to<br>the policy,effectively preventing it from diverging excessively from its preceding iterations.<br>This approach thereby ensures a more stable and reliable learning process during training.<br>For further details, please refer to the source paper: Mastering Complex Control in MOBA Games with Deep Reinforcement Learning. <a href="https://arxiv.org/pdf/1912.09729.pdf">Related Link</a>.</div></div><div class="section" id="section-1"><div class="docs doc-strings"><p> <b>Overview</b><br> This function implements the Proximal Policy Optimization (PPO) policy loss with dual-clip<br> mechanism, which is a variant of PPO that provides more reliable and stable training by<br> limiting the updates to the policy, preventing it from deviating too much from its previous versions.<br> Arguments:<br> - logp_new (:obj:`torch.FloatTensor`): The log probability calculated by the new policy.<br> - logp_old (:obj:`torch.FloatTensor`): The log probability calculated by the old policy.<br> - adv (:obj:`torch.FloatTensor`): The advantage value, which measures how much better an<br> action is compared to the average action at that state.<br> - clip_ratio (:obj:`float`): The clipping ratio used to limit the change of policy during an update.<br> - dual_clip (:obj:`float`): The dual clipping ratio used to further limit the change of policy during an update.<br> Returns:<br> - policy_loss (:obj:`torch.FloatTensor`): The calculated policy loss, which is the objective we<br> want to minimize for improving the policy.</p></div><div class="code"><pre><code id="code_1" name="py_code">import torch | ||
|
||
|
||
def ppo_dual_clip(logp_new: torch.FloatTensor, logp_old: torch.FloatTensor, adv: torch.FloatTensor, clip_ratio: float, dual_clip: float) -> torch.FloatTensor:</code></pre></div></div><div class="section" id="section-3"><div class="docs doc-strings"><p> $$r(\theta) = \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)}$$</p></div><div class="code"><pre><code id="code_3" name="py_code"> ratio = torch.exp(logp_new - logp_old)</code></pre></div></div><div class="section" id="section-4"><div class="docs doc-strings"><p> $$clip_1 = min(r(\theta)*A(s,a), clip(r(\theta), 1-clip\_ratio, 1+clip\_ratio)*A(s,a))$$</p></div><div class="code"><pre><code id="code_4" name="py_code"> surr1 = ratio * adv | ||
def ppo_dual_clip(logp_new: torch.FloatTensor, logp_old: torch.FloatTensor, adv: torch.FloatTensor, clip_ratio: float, | ||
dual_clip: float) -> torch.FloatTensor:</code></pre></div></div><div class="section" id="section-3"><div class="docs doc-strings"><p> This is the ratio of the new policy probability to the old policy probability.<br> $$r(\theta) = \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)}$$</p></div><div class="code"><pre><code id="code_3" name="py_code"> ratio = torch.exp(logp_new - logp_old)</code></pre></div></div><div class="section" id="section-4"><div class="docs doc-strings"><p> The first clipping operation is performed here, we limit the update to be within a certain range.<br> $$clip_1 = min(r(\theta)*A(s,a), clip(r(\theta), 1-clip\_ratio, 1+clip\_ratio)*A(s,a))$$</p></div><div class="code"><pre><code id="code_4" name="py_code"> surr1 = ratio * adv | ||
surr2 = ratio.clamp(1 - clip_ratio, 1 + clip_ratio) * adv | ||
clip1 = torch.min(surr1, surr2)</code></pre></div></div><div class="section" id="section-5"><div class="docs doc-strings"><p> $$clip_2 = max(clip_1, dual\_clip * A(s,a))$$</p></div><div class="code"><pre><code id="code_5" name="py_code"> clip2 = torch.max(clip1, dual_clip * adv)</code></pre></div></div><div class="section" id="section-6"><div class="docs doc-strings"><p> Only use dual_clip when adv < 0.</p></div><div class="code"><pre><code id="code_6" name="py_code"> policy_loss = -(torch.where(adv < 0, clip2, clip1)).mean() | ||
clip1 = torch.min(surr1, surr2)</code></pre></div></div><div class="section" id="section-5"><div class="docs doc-strings"><p> The second clipping operation is performed here, we further limit the update to be within a stricter range.<br> $$clip_2 = max(clip_1, dual\_clip * A(s,a))$$</p></div><div class="code"><pre><code id="code_5" name="py_code"> clip2 = torch.max(clip1, dual_clip * adv)</code></pre></div></div><div class="section" id="section-6"><div class="docs doc-strings"><p> We only apply the dual-clip when the advantage is negative, i.e., when the action is worse than the average.</p></div><div class="code"><pre><code id="code_6" name="py_code"> policy_loss = -(torch.where(adv < 0, clip2, clip1)).mean() | ||
return policy_loss | ||
|
||
</code></pre></div></div><div class="section" id="section-7"><div class="docs doc-strings"><p> <b>Overview</b><br> Test <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">dual_clip</span> function.</p></div><div class="code"><pre><code id="code_7" name="py_code">def test_ppo_dual_clip() -> None:</code></pre></div></div><div class="section" id="section-9"><div class="docs doc-strings"><p> Generate data, batch size is 6.</p></div><div class="code"><pre><code id="code_9" name="py_code"> B = 6 | ||
</code></pre></div></div><div class="section" id="section-7"><div class="docs doc-strings"><p> <b>Overview</b><br> This function tests the ppo_dual_clip function. It generates some sample data, calculates the<br> policy loss using the ppo_dual_clip function, and checks if the returned value is a scalar.</p></div><div class="code"><pre><code id="code_7" name="py_code">def test_ppo_dual_clip() -> None:</code></pre></div></div><div class="section" id="section-9"><div class="docs doc-strings"><p> Generate random data for testing. The batch size is 6.</p></div><div class="code"><pre><code id="code_9" name="py_code"> B = 6 | ||
logp_new = torch.randn(B) | ||
logp_old = torch.randn(B) | ||
adv = torch.randn(B)</code></pre></div></div><div class="section" id="section-10"><div class="docs doc-strings"><p> Calculate policy loss with policy loss.</p></div><div class="code"><pre><code id="code_10" name="py_code"> policy_loss = ppo_dual_clip(logp_new, logp_old, adv, 0.2, 0.2)</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p> The returned value is a scalar.</p></div><div class="code"><pre><code id="code_11" name="py_code"> assert policy_loss.shape == torch.Size([]) | ||
adv = torch.randn(B)</code></pre></div></div><div class="section" id="section-10"><div class="docs doc-strings"><p> Calculate policy loss using the ppo_dual_clip function.</p></div><div class="code"><pre><code id="code_10" name="py_code"> policy_loss = ppo_dual_clip(logp_new, logp_old, adv, 0.2, 0.2)</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p> Assert that the returned policy loss is a scalar (i.e., its shape is an empty tuple).</p></div><div class="code"><pre><code id="code_11" name="py_code"> assert policy_loss.shape == torch.Size([]) | ||
|
||
</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p><i>If you have any questions or advices about this documation, you can raise issues in GitHub (https://github.com/opendilab/PPOxFamily) or email us ([email protected]).</i></p></div></div></body><script type="text/javascript"> | ||
window.onload = function(){ | ||
|
Oops, something went wrong.