From 829c246d5a85da9395be18107df5522831239d3b Mon Sep 17 00:00:00 2001
From: niuyazhe <niuyazhe@sensetime.com>
Date: Mon, 24 Jul 2023 15:15:58 +0800
Subject: [PATCH] doc(nyz): add ch4 popart doc

---
 docs/index.html  |  2 +-
 docs/popart.html | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+), 1 deletion(-)
 create mode 100644 docs/popart.html
diff --git a/docs/index.html b/docs/index.html
index 7b797d6..8adb145 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -1,5 +1,5 @@
 <!DOCTYPE html>
-<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/mode/python/python.min.js"></script></head><body><div class="section" id="section0"><div class="docs doc-strings"><p><a href="index.html"><b>HOME</b></a></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a>  <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a>  <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily" target="_blank">View code on GitHub</a></div></div><div class="section" id="section1"><div class="docs doc-strings"><h1><a href="https://github.com/opendilab/PPOxFamily">PPO × Family PyTorch 注解文档</a></h1><img alt="logo" src="./imgs/ppof_logo.png"></img><p>作为 PPO × Family 决策智能入门公开课的“算法-代码”注解文档，力求发掘 PPO 算法的每一个细节，帮助读者快速掌握设计决策人工智能的万能钥匙。</p></div></div><div class="section" id="section1"><div class="docs doc-strings"><h2>各章节代码解读示例目录</h2><h4>开启决策 AI 探索之旅</h4><li><a href="./pg_zh.html">策略梯度（PG）算法核心代码</a>  |  <a href="./pg.html">Policy Gradient core loss function</a></li><li><a href="./a2c_zh.html">A2C 算法核心代码</a>  |  <a href="./a2c.html">A2C core loss function</a></li><li><a href="./ppo_zh.html">PPO 算法核心代码</a>  |  <a href="./ppo.html">PPO core loss function</a></li><br><h4>解构复杂动作空间</h4><li><a href="./discrete_zh.html">PPO 建模离散动作空间</a>  |  <a href="./discrete.html">PPO in discrete action space</a></li><li><a href="./continuous_zh.html">PPO 建模连续动作空间</a>  |  <a href="./continuous.html">PPO in continuous action space</a></li><li><a href="./hybrid_zh.html">PPO 建模混合动作空间</a>  |  <a href="./hybrid.html">PPO in hybrid action space</a></li><br><h4>表征多模态观察空间</h4><li><a href="./encoding_zh.html">特征编码的各种技巧</a>  |  <a href="./encoding.html">Encoding methods for vector obs space</a></li><li><a href="./mario_wrapper_zh.html">图片动作空间的各类环境包装器</a>  |  <a href="./mario_wrapper.html">Env wrappers for image obs space</a></li><li><a href="./gradient_zh.html">神经网络梯度计算的代码解析</a>  |  <a href="./gradient.html">Automatic gradient mechanism</a></li><br><h4>统筹多智能体</h4><li><a href="./marl_network.html">Multi-Agent cooperation network</a></li><li><a href="./independentpg.html">Independent policy gradient training</a></li><li><a href="./mapg.html">Multi-Agent policy gradient training</a></li><li><a href="./mappo.html">Multi-Agent PPO training</a></li><br><h4>挖掘黑科技</h4><li><a href="./gae.html">GAE technique used in PPO</a></li><li><a href="./recompute.html">Recompute adv trick used in PPO</a></li><li><a href="./grad_clip_norm_zh.html">PPO 中使用的梯度范数裁剪</a>  |  <a href="./grad_clip_norm.html">Gradient norm clip trick used in PPO</a></li><li><a href="./grad_clip_value.html">Gradient value clip trick used in PPO</a></li><li><a href="./grad_ignore.html">Gradient ignore trick used in PPO</a></li><li><a href="./orthogonal_init.html">Orthogonal initialization of networks used in PPO</a></li><li><a href="./dual_clip.html">Dual clip trick used in PPO</a></li><li><a href="./value_clip.html">Value clip trick used in PPO</a></li></div></div><div class="section" id="section-final"><div class="docs doc-strings"><p><i>如果读者关于本文档有任何问题和建议，可以在 GitHub 提 issue 或是直接发邮件给我们 (opendilab@pjlab.org.cn) 。</i></p></div></div></body><script type="text/javascript">
+<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/mode/python/python.min.js"></script></head><body><div class="section" id="section0"><div class="docs doc-strings"><p><a href="index.html"><b>HOME</b></a></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a>  <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a>  <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily" target="_blank">View code on GitHub</a></div></div><div class="section" id="section1"><div class="docs doc-strings"><h1><a href="https://github.com/opendilab/PPOxFamily">PPO × Family PyTorch 注解文档</a></h1><img alt="logo" src="./imgs/ppof_logo.png"></img><p>作为 PPO × Family 决策智能入门公开课的“算法-代码”注解文档，力求发掘 PPO 算法的每一个细节，帮助读者快速掌握设计决策人工智能的万能钥匙。</p></div></div><div class="section" id="section1"><div class="docs doc-strings"><h2>各章节代码解读示例目录</h2><h4>开启决策 AI 探索之旅</h4><li><a href="./pg_zh.html">策略梯度（PG）算法核心代码</a>  |  <a href="./pg.html">Policy Gradient core loss function</a></li><li><a href="./a2c_zh.html">A2C 算法核心代码</a>  |  <a href="./a2c.html">A2C core loss function</a></li><li><a href="./ppo_zh.html">PPO 算法核心代码</a>  |  <a href="./ppo.html">PPO core loss function</a></li><br><h4>解构复杂动作空间</h4><li><a href="./discrete_zh.html">PPO 建模离散动作空间</a>  |  <a href="./discrete.html">PPO in discrete action space</a></li><li><a href="./continuous_zh.html">PPO 建模连续动作空间</a>  |  <a href="./continuous.html">PPO in continuous action space</a></li><li><a href="./hybrid_zh.html">PPO 建模混合动作空间</a>  |  <a href="./hybrid.html">PPO in hybrid action space</a></li><br><h4>表征多模态观察空间</h4><li><a href="./encoding_zh.html">特征编码的各种技巧</a>  |  <a href="./encoding.html">Encoding methods for vector obs space</a></li><li><a href="./mario_wrapper_zh.html">图片动作空间的各类环境包装器</a>  |  <a href="./mario_wrapper.html">Env wrappers for image obs space</a></li><li><a href="./gradient_zh.html">神经网络梯度计算的代码解析</a>  |  <a href="./gradient.html">Automatic gradient mechanism</a></li><br><h4>解密稀疏奖励空间</h4><li><a href="./popart.html">Pop-Art normalization trick used in PPO</a></li><br><h4>统筹多智能体</h4><li><a href="./marl_network.html">Multi-Agent cooperation network</a></li><li><a href="./independentpg.html">Independent policy gradient training</a></li><li><a href="./mapg.html">Multi-Agent policy gradient training</a></li><li><a href="./mappo.html">Multi-Agent PPO training</a></li><br><h4>挖掘黑科技</h4><li><a href="./gae.html">GAE technique used in PPO</a></li><li><a href="./recompute.html">Recompute adv trick used in PPO</a></li><li><a href="./grad_clip_norm_zh.html">PPO 中使用的梯度范数裁剪</a>  |  <a href="./grad_clip_norm.html">Gradient norm clip trick used in PPO</a></li><li><a href="./grad_clip_value.html">Gradient value clip trick used in PPO</a></li><li><a href="./grad_ignore.html">Gradient ignore trick used in PPO</a></li><li><a href="./orthogonal_init.html">Orthogonal initialization of networks used in PPO</a></li><li><a href="./dual_clip.html">Dual clip trick used in PPO</a></li><li><a href="./value_clip.html">Value clip trick used in PPO</a></li></div></div><div class="section" id="section-final"><div class="docs doc-strings"><p><i>如果读者关于本文档有任何问题和建议，可以在 GitHub 提 issue 或是直接发邮件给我们 (opendilab@pjlab.org.cn) 。</i></p></div></div></body><script type="text/javascript">
 window.onload = function(){
     var codeElement = document.getElementsByName('py_code');
     var lineCount = 1;
diff --git a/docs/popart.html b/docs/popart.html
new file mode 100644
index 0000000..fa3e18e
--- /dev/null
+++ b/docs/popart.html
@@ -0,0 +1,98 @@
+<!DOCTYPE html>
+<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/mode/python/python.min.js"></script></head><body><div class="section" id="section-0"><div class="docs doc-strings"><p><p><a href="index.html"><b>HOME<br></b></a></p></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a>  <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a>  <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily/tree/main/chapter4_reward/popart.py" target="_blank">View code on GitHub</a><br><br>PyTorch implementation of <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">Pop-Art</span> algorithm for adaptive normalization techniques.<br><a href="https://arxiv.org/abs/1602.07714">Related Link</a><br><br>Pop-Art is an adaptive normalization algorithm to normalized the targets used in the learning updates.<br>It can be used in value normalization in PPO algorithm to address multi-magnitude reward problem.<br><br>The two main components in Pop-Art are:<br>- <b>ART</b> to update scale and shift such that the return is appropriately normalized<br>- <b>POP</b> to preserve the outputs of the unnormalized function when we change the scale and shift.</div></div><div class="section" id="section-1"><div class="docs doc-strings"><p>    <b>Overview</b><br>        The definition of Pop-Art layer, i.e., a linear layer with popart normalization, which should be<br>        used as the last layer of a network.<br>        For more information, you can refer to the paper <a href="https://arxiv.org/abs/1809.04474">Related Link</a></p></div><div class="code"><pre><code id="code_1" name="py_code">import pickle
+import math
+import torch
+import torch.nn as nn
+import treetensor.torch as ttorch
+from torch.optim import AdamW
+from torch.utils.data import DataLoader
+
+
+class PopArt(nn.Module):</code></pre></div></div><div class="section" id="section-3"><div class="docs doc-strings"><p>        PyTorch necessary requirements for extending <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">nn.Module</span> . Our network should also subclass this class.</p></div><div class="code"><pre><code id="code_3" name="py_code">        super(PopArt, self).__init__()
+</code></pre></div></div><div class="section" id="section-4"><div class="docs doc-strings"><p>        Define soft-update parameter beta.</p></div><div class="code"><pre><code id="code_4" name="py_code">        self.beta = beta</code></pre></div></div><div class="section" id="section-5"><div class="docs doc-strings"><p>        Define the input and output feature dimensions of the linear layer.</p></div><div class="code"><pre><code id="code_5" name="py_code">        self.input_features = input_features
+        self.output_features = output_features</code></pre></div></div><div class="section" id="section-6"><div class="docs doc-strings"><p>        Initialize the linear layer parameters, weight and bias.</p></div><div class="code"><pre><code id="code_6" name="py_code">        self.weight = nn.Parameter(torch.Tensor(output_features, input_features))
+        self.bias = nn.Parameter(torch.Tensor(output_features))</code></pre></div></div><div class="section" id="section-7"><div class="docs doc-strings"><p>        Register a buffer for normalization parameters which can not be considered as model parameters.<br>        Therefore, the tensor registered in buffer will not refer to gradient propagation but still can<br>        be saved in state_dict.<br>        The normalization parameters will be used later to save the target value's scale and shift.</p></div><div class="code"><pre><code id="code_7" name="py_code">        self.register_buffer('mu', torch.zeros(output_features, requires_grad=False))
+        self.register_buffer('sigma', torch.ones(output_features, requires_grad=False))
+        self.register_buffer('v', torch.ones(output_features, requires_grad=False))
+</code></pre></div></div><div class="section" id="section-8"><div class="docs doc-strings"><p>        Reset the learned parameters, i.e., weight and bias.</p></div><div class="code"><pre><code id="code_8" name="py_code">        self.reset_parameters()
+</code></pre></div></div><div class="section" id="section-9"><div class="docs doc-strings"><p>        <b>Overview</b><br>            The parameters initialization of the linear layer (i.e. weight and bias).</p></div><div class="code"><pre><code id="code_9" name="py_code">    def reset_parameters(self) -> None:</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p>        In Kaiming Initialization, the mean of weights increment slowly and the std is close to 1,<br>        which avoid the vanishing gradient problem and exploding gradient problem of deep models.<br>        Specifically, the Kaiming Initialization funciton is as follows:<br>        $$std = \sqrt{\frac{2}{(1+a^2)\times fan\_in}}$$<br>        where a is the the negative slope of the rectifier used after this layer (0 for ReLU by default),<br>        and fan_in is the number of input dimension.<br>        For more kaiming intialization info, you can refer to the paper:<br>        <a href="https://arxiv.org/pdf/1502.01852.pdf">Related Link</a></p></div><div class="code"><pre><code id="code_11" name="py_code">        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
+        if self.bias is not None:
+            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
+            bound = 1 / math.sqrt(fan_in)
+            nn.init.uniform_(self.bias, -bound, bound)
+</code></pre></div></div><div class="section" id="section-12"><div class="docs doc-strings"><p>        <b>Overview</b><br>            The computation graph of the linear layer with popart mechanism, which outputs both the output and the normalized output of the layer.</p></div><div class="code"><pre><code id="code_12" name="py_code">    def forward(self, x: torch.Tensor) -> ttorch.Tensor:</code></pre></div></div><div class="section" id="section-14"><div class="docs doc-strings"><p>        Execute the linear layer computation $$y=Wx+b$$, note here we use expand and broadcast to add bias.</p></div><div class="code"><pre><code id="code_14" name="py_code">        normalized_output = x.mm(self.weight.t())
+        normalized_output += self.bias.unsqueeze(0).expand_as(normalized_output)</code></pre></div></div><div class="section" id="section-15"><div class="docs doc-strings"><p>        Unnormalize the output for more convenient usage.</p></div><div class="code"><pre><code id="code_15" name="py_code">        with torch.no_grad():
+            output = normalized_output * self.sigma + self.mu
+
+        return ttorch.as_tensor({'output': output, 'normalized_output': normalized_output})
+</code></pre></div></div><div class="section" id="section-16"><div class="docs doc-strings"><p>        <b>Overview</b><br>            The parameters update defined in Pop-Art, which outputs both the output and the normalized output of the layer.</p></div><div class="code"><pre><code id="code_16" name="py_code">    def update_parameters(self, value: torch.Tensor) -> ttorch.Tensor:</code></pre></div></div><div class="section" id="section-18"><div class="docs doc-strings"><p>        Tensor device conversion of the normalization parameters.</p></div><div class="code"><pre><code id="code_18" name="py_code">        self.mu = self.mu.to(value.device)
+        self.sigma = self.sigma.to(value.device)
+        self.v = self.v.to(value.device)
+</code></pre></div></div><div class="section" id="section-19"><div class="docs doc-strings"><p>        Store the old normalization parameters for later usage.</p></div><div class="code"><pre><code id="code_19" name="py_code">        old_mu = self.mu
+        old_std = self.sigma</code></pre></div></div><div class="section" id="section-20"><div class="docs doc-strings"><p>        Calculate the first and second moments (mean and variance) of the target value:<br>        $$\mu = \frac{G_t}{B}$$<br>        $$v = \frac{G_t^2}{B}$$.</p></div><div class="code"><pre><code id="code_20" name="py_code">        batch_mean = torch.mean(value, 0)
+        batch_v = torch.mean(torch.pow(value, 2), 0)</code></pre></div></div><div class="section" id="section-21"><div class="docs doc-strings"><p>        Replace the nan value with old value for more stable training.</p></div><div class="code"><pre><code id="code_21" name="py_code">        batch_mean[torch.isnan(batch_mean)] = self.mu[torch.isnan(batch_mean)]
+        batch_v[torch.isnan(batch_v)] = self.v[torch.isnan(batch_v)]</code></pre></div></div><div class="section" id="section-22"><div class="docs doc-strings"><p>        Soft update the normalization parameters according to:<br>        $$\mu_t = (1-\beta)\mu_{t-1}+\beta G^v_t$$<br>        $$v_t = (1-\beta)v_{t-1}+\beta(G^v_t)^2$$.</p></div><div class="code"><pre><code id="code_22" name="py_code">        batch_mean = (1 - self.beta) * self.mu + self.beta * batch_mean
+        batch_v = (1 - self.beta) * self.v + self.beta * batch_v</code></pre></div></div><div class="section" id="section-23"><div class="docs doc-strings"><p>        Calculate the standard deviation with the mean and variance:<br>        $$\sigma = \sqrt{v-\mu^2}$$</p></div><div class="code"><pre><code id="code_23" name="py_code">        batch_std = torch.sqrt(batch_v - (batch_mean ** 2))</code></pre></div></div><div class="section" id="section-24"><div class="docs doc-strings"><p>        Clip the standard deviation to reject the outlier data.</p></div><div class="code"><pre><code id="code_24" name="py_code">        batch_std = torch.clamp(batch_std, min=1e-4, max=1e+6)</code></pre></div></div><div class="section" id="section-25"><div class="docs doc-strings"><p>        Replace the nan value with old value.</p></div><div class="code"><pre><code id="code_25" name="py_code">        batch_std[torch.isnan(batch_std)] = self.sigma[torch.isnan(batch_std)]
+</code></pre></div></div><div class="section" id="section-26"><div class="docs doc-strings"><p>        Update the normalization parameters.</p></div><div class="code"><pre><code id="code_26" name="py_code">        self.mu = batch_mean
+        self.v = batch_v
+        self.sigma = batch_std</code></pre></div></div><div class="section" id="section-27"><div class="docs doc-strings"><p>        Update weight and bias with mean and standard deviation to preserve unnormalised outputs:<br>        $$w'_i = \frac{\sigma_i}{\sigma'_i}w_i$$<br>        $$b'_i = \frac{\sigma_i b_i + \mu_i-\mu'_i}{\sigma'_i}$$</p></div><div class="code"><pre><code id="code_27" name="py_code">        self.weight.data = (self.weight.t() * old_std / self.sigma).t()
+        self.bias.data = (old_std * self.bias + old_mu - self.mu) / self.sigma
+</code></pre></div></div><div class="section" id="section-28"><div class="docs doc-strings"><p>        Return treetensor-type statistics.</p></div><div class="code"><pre><code id="code_28" name="py_code">        return ttorch.as_tensor({'new_mean': batch_mean, 'new_std': batch_std})
+
+</code></pre></div></div><div class="section" id="section-29"><div class="docs doc-strings"><p>        <b>Overview</b><br>            A MLP network with popart as the final layer.<br>            Input: observations and actions<br>            Output: Estimated Q value<br>            <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">cat(obs,actions) -> encoder -> popart</span> .</p></div><div class="code"><pre><code id="code_29" name="py_code">class MLP(nn.Module):
+
+    def __init__(self, obs_shape: int, action_shape: int) -> None:</code></pre></div></div><div class="section" id="section-31"><div class="docs doc-strings"><p>        Define the encoder and popart layer.<br>        Here we use MLP with two layer and ReLU as activate function. The final layer is popart layer.</p></div><div class="code"><pre><code id="code_31" name="py_code">        self.encoder = nn.Sequential(
+            nn.Linear(obs_shape + action_shape, 16),
+            nn.ReLU(),
+            nn.Linear(16, 32),
+            nn.ReLU(),
+        )
+        self.popart = PopArt(32, 1)
+</code></pre></div></div><div class="section" id="section-32"><div class="docs doc-strings"><p>        <b>Overview</b><br>            Forward computation of the MLP network with popart layer.</p></div><div class="code"><pre><code id="code_32" name="py_code">    def forward(self, obs: torch.Tensor, actions: torch.Tensor) -> ttorch.Tensor:</code></pre></div></div><div class="section" id="section-34"><div class="docs doc-strings"><p>        The encoder first concatenate the observation vectors and actions,<br>        then map the input to an embedding vector.</p></div><div class="code"><pre><code id="code_34" name="py_code">        x = torch.cat((obs, actions), 1)
+        x = self.encoder(x)</code></pre></div></div><div class="section" id="section-35"><div class="docs doc-strings"><p>        The popart layer maps the embedding vector to a normalized value.</p></div><div class="code"><pre><code id="code_35" name="py_code">        x = self.popart(x)
+        return x
+
+</code></pre></div></div><div class="section" id="section-36"><div class="docs doc-strings"><p>    <b>Overview</b><br>        Example training function for using MLP network with Pop-Art layer in fixed Q value approximation.</p></div><div class="code"><pre><code id="code_36" name="py_code">def train(obs_shape: int, action_shape: int, NUM_EPOCH: int, train_data):</code></pre></div></div><div class="section" id="section-38"><div class="docs doc-strings"><p>    Define the MLP network and optimizer, and loss function.</p></div><div class="code"><pre><code id="code_38" name="py_code">    model = MLP(obs_shape, action_shape)
+    optimizer = AdamW(model.parameters(), lr=0.0001, weight_decay=0.0001)
+    MSEloss = nn.MSELoss()</code></pre></div></div><div class="section" id="section-39"><div class="docs doc-strings"><p>    Read the preprocessed data of trained agent on lunarlander environment.<br>    Each sample in the datasets should be a dict with following format:<br>    $$key\quad dim$$<br>    $$observations\quad (*,8)$$<br>    $$actions\quad (*,)$$<br>    $$returns\quad (*,)$$<br>    where the returns is the discounted return from the current state.</p></div><div class="code"><pre><code id="code_39" name="py_code">    train_data = DataLoader(train_data, batch_size=64, shuffle=True)
+</code></pre></div></div><div class="section" id="section-40"><div class="docs doc-strings"><p>    For loop 1: train MLP network for <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">NUM_EPOCH</span> epochs.</p></div><div class="code"><pre><code id="code_40" name="py_code">    running_loss = 0.0
+    for epoch in range(NUM_EPOCH):</code></pre></div></div><div class="section" id="section-41"><div class="docs doc-strings"><p>        For loop 2: Inside each epoch, split the entire dataset into mini-batches, then train on each mini-batch.</p></div><div class="code"><pre><code id="code_41" name="py_code">        for idx, data in enumerate(train_data):</code></pre></div></div><div class="section" id="section-42"><div class="docs doc-strings"><p>            Compute the original output and the normalized output.</p></div><div class="code"><pre><code id="code_42" name="py_code">            output = model(data['observations'], data['actions'])
+            mu = model.popart.mu
+            sigma = model.popart.sigma</code></pre></div></div><div class="section" id="section-43"><div class="docs doc-strings"><p>            Normalize the target return to align with the normalized Q value.</p></div><div class="code"><pre><code id="code_43" name="py_code">            with torch.no_grad():
+                normalized_return = (data['returns'] - mu) / sigma</code></pre></div></div><div class="section" id="section-44"><div class="docs doc-strings"><p>            The loss is calculated as the MSE loss between normalized Q value and normalized target return.</p></div><div class="code"><pre><code id="code_44" name="py_code">            loss = MSEloss(output.normalized_output, normalized_return)</code></pre></div></div><div class="section" id="section-45"><div class="docs doc-strings"><p>            Loss backward and optimizer update step.</p></div><div class="code"><pre><code id="code_45" name="py_code">            optimizer.zero_grad()
+            loss.backward()
+            optimizer.step()</code></pre></div></div><div class="section" id="section-46"><div class="docs doc-strings"><p>            After the model parameters are updated with the gradient,<br>            the weights and bias should be updated to preserve unnormalised outputs.</p></div><div class="code"><pre><code id="code_46" name="py_code">            model.popart.update_parameters(data['returns'])
+</code></pre></div></div><div class="section" id="section-47"><div class="docs doc-strings"><p>            Use <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">item</span> method to get the pure Python scalar of the loss, then add it into <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">running_loss</span> .</p></div><div class="code"><pre><code id="code_47" name="py_code">            running_loss += loss.item()
+</code></pre></div></div><div class="section" id="section-48"><div class="docs doc-strings"><p>        Print the loss every 100 epochs.</p></div><div class="code"><pre><code id="code_48" name="py_code">        if epoch % 100 == 99:
+            print('Epoch [%d] loss: %.6f' % (epoch + 1, running_loss / 100))
+            running_loss = 0.0
+
+</code></pre></div></div><div class="section" id="section-48"><div class="docs doc-strings"><p><i>If you have any questions or advices about this documation, you can raise issues in GitHub (https://github.com/opendilab/PPOxFamily) or email us (opendilab@pjlab.org.cn).</i></p></div></div></body><script type="text/javascript">
+window.onload = function(){
+    var codeElement = document.getElementsByName('py_code');
+    var lineCount = 1;
+    for (var i = 0; i < codeElement.length; i++) {
+        var code = codeElement[i].innerText;
+        if (code.length <= 1) {
+            continue;
+        }
+
+        codeElement[i].innerHTML = "";
+
+        var codeMirror = CodeMirror(
+          codeElement[i],
+          {
+            value: code,
+            mode: "python",
+            theme: "solarized dark",
+            lineNumbers: true,
+            firstLineNumber: lineCount,
+            readOnly: false,
+            lineWrapping: true,
+          }
+        );
+        var noNewLineCode = code.replace(/[\r\n]/g, "");
+        lineCount += code.length - noNewLineCode.length + 1;
+    }
+};
+</script></html>
\ No newline at end of file