diff --git a/grad_clip_value_zh.html b/grad_clip_value_zh.html
new file mode 100644
index 0000000..79ff015
--- /dev/null
+++ b/grad_clip_value_zh.html
@@ -0,0 +1,48 @@
+<!DOCTYPE html>
+<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/mode/python/python.min.js"></script></head><body><div class="section" id="section-0"><div class="docs doc-strings"><p><p><a href="index.html"><b>HOME<br></b></a></p></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a>  <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a>  <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily/tree/main/chapter7_tricks/grad_clip_value_zh.py" target="_blank">View code on GitHub</a><br><br>本文件是梯度裁剪模块 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">torch.nn.utils.grad_clip_value</span> 的 PyTorch 实现。</div></div><div class="section" id="section-1"><div class="docs doc-strings"><p>    <b>概述</b><br>        梯度裁剪函数的实现，即 grad_clip_value  <a href="https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html#clip_grad_value_">Related Link</a><br>        该函数在 loss 反向传播后使用，它会将网络参数的所有梯度剪裁 (clip) 到一个固定范围 [-clip_value, clip_value] 之间。<br>        注意这个函数是原地操作，修改梯度并没有任何返回值。</p></div><div class="code"><pre><code id="code_1" name="py_code">from typing import Union, Iterable
+import torch
+
+_tensor_or_tensors = Union[torch.Tensor, Iterable[torch.Tensor]]
+
+
+def grad_clip_value(parameters: _tensor_or_tensors, clip_value: float) -> None:</code></pre></div></div><div class="section" id="section-3"><div class="docs doc-strings"><p>    将可训练参数的非空梯度保存到列表中。</p></div><div class="code"><pre><code id="code_3" name="py_code">    if isinstance(parameters, torch.Tensor):
+        parameters = [parameters]
+    grads = [p.grad for p in parameters if p.grad is not None]</code></pre></div></div><div class="section" id="section-4"><div class="docs doc-strings"><p>    将原始 clip_value 转换为 float 类型。</p></div><div class="code"><pre><code id="code_4" name="py_code">    clip_value = float(clip_value)</code></pre></div></div><div class="section" id="section-5"><div class="docs doc-strings"><p>    将梯度原地剪裁到 [-clip_value, Clip_value]。</p></div><div class="code"><pre><code id="code_5" name="py_code">    for grad in grads:
+        grad.data.clamp_(min=-clip_value, max=clip_value)
+
+</code></pre></div></div><div class="section" id="section-6"><div class="docs doc-strings"><p>    <b>概述</b><br>        对于使用固定值做梯度裁剪的测试函数。</p></div><div class="code"><pre><code id="code_6" name="py_code">def test_grad_clip_value():</code></pre></div></div><div class="section" id="section-8"><div class="docs doc-strings"><p>    准备超参数, batch size=4, action=32</p></div><div class="code"><pre><code id="code_8" name="py_code">    B, N = 4, 32</code></pre></div></div><div class="section" id="section-9"><div class="docs doc-strings"><p>    设置 clip_value 为 1e-3</p></div><div class="code"><pre><code id="code_9" name="py_code">    clip_value = 1e-3</code></pre></div></div><div class="section" id="section-10"><div class="docs doc-strings"><p>    生成回归的 logit 值和标签，在实际应用中， logit 值是整个网络的输出，并需要梯度计算。</p></div><div class="code"><pre><code id="code_10" name="py_code">    logit = torch.randn(B, N).requires_grad_(True)
+    label = torch.randn(B, N)</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p>    定义标准并计算 loss。</p></div><div class="code"><pre><code id="code_11" name="py_code">    criterion = torch.nn.MSELoss()
+    output = criterion(logit, label)</code></pre></div></div><div class="section" id="section-12"><div class="docs doc-strings"><p>    进行 loss 的反向传播并计算梯度。</p></div><div class="code"><pre><code id="code_12" name="py_code">    output.backward()</code></pre></div></div><div class="section" id="section-13"><div class="docs doc-strings"><p>    使用固定值对梯度进行剪裁（clip）。</p></div><div class="code"><pre><code id="code_13" name="py_code">    grad_clip_value(logit, clip_value)</code></pre></div></div><div class="section" id="section-14"><div class="docs doc-strings"><p>    在剪裁后，断言（assert）剪裁后的梯度值是否合理。</p></div><div class="code"><pre><code id="code_14" name="py_code">    assert isinstance(logit.grad, torch.Tensor)
+    for g in logit.grad:
+        assert (g <= clip_value).all()
+        assert (g >= -clip_value).all()
+
+</code></pre></div></div><div class="section" id="section-14"><div class="docs doc-strings"><p><i>如果读者关于本文档有任何问题和建议，可以在 GitHub 提 issue 或是直接发邮件给我们 (opendilab@pjlab.org.cn) 。</i></p></div></div></body><script type="text/javascript">
+window.onload = function(){
+    var codeElement = document.getElementsByName('py_code');
+    var lineCount = 1;
+    for (var i = 0; i < codeElement.length; i++) {
+        var code = codeElement[i].innerText;
+        if (code.length <= 1) {
+            continue;
+        }
+
+        codeElement[i].innerHTML = "";
+
+        var codeMirror = CodeMirror(
+          codeElement[i],
+          {
+            value: code,
+            mode: "python",
+            theme: "solarized dark",
+            lineNumbers: true,
+            firstLineNumber: lineCount,
+            readOnly: false,
+            lineWrapping: true,
+          }
+        );
+        var noNewLineCode = code.replace(/[\r\n]/g, "");
+        lineCount += code.length - noNewLineCode.length + 1;
+    }
+};
+</script></html>
\ No newline at end of file
diff --git a/index.html b/index.html
new file mode 100644
index 0000000..3e5820a
--- /dev/null
+++ b/index.html
@@ -0,0 +1,30 @@
+<!DOCTYPE html>
+<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/mode/python/python.min.js"></script></head><body><div class="section" id="section0"><div class="docs doc-strings"><p><a href="index.html"><b>HOME</b></a></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a>  <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a>  <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily" target="_blank">View code on GitHub</a></div></div><div class="section" id="section1"><div class="docs doc-strings"><h1><a href="https://github.com/opendilab/PPOxFamily">PPO × Family PyTorch 注解文档</a></h1><img alt="logo" src="./imgs/ppof_logo.png"></img><p>作为 PPO × Family 决策智能入门公开课的“算法-代码”注解文档，力求发掘 PPO 算法的每一个细节，帮助读者快速掌握设计决策人工智能的万能钥匙。</p></div></div><div class="section" id="section1"><div class="docs doc-strings"><h2>各章节代码解读示例目录</h2><h4>开启决策 AI 探索之旅</h4><li><a href="./pg_zh.html">策略梯度（PG）算法核心代码</a>  |  <a href="./pg.html">Policy Gradient core loss function</a></li><li><a href="./a2c_zh.html">A2C 算法核心代码</a>  |  <a href="./a2c.html">A2C core loss function</a></li><li><a href="./ppo_zh.html">PPO 算法核心代码</a>  |  <a href="./ppo.html">PPO core loss function</a></li><br><h4>解构复杂动作空间</h4><li><a href="./discrete_zh.html">PPO 建模离散动作空间</a>  |  <a href="./discrete.html">PPO in discrete action space</a></li><li><a href="./continuous_zh.html">PPO 建模连续动作空间</a>  |  <a href="./continuous.html">PPO in continuous action space</a></li><li><a href="./hybrid_zh.html">PPO 建模混合动作空间</a>  |  <a href="./hybrid.html">PPO in hybrid action space</a></li><br><h4>表征多模态观察空间</h4><li><a href="./encoding_zh.html">特征编码的各种技巧</a>  |  <a href="./encoding.html">Encoding methods for vector obs space</a></li><li><a href="./mario_wrapper_zh.html">图片动作空间的各类环境包装器</a>  |  <a href="./mario_wrapper.html">Env wrappers for image obs space</a></li><li><a href="./gradient_zh.html">神经网络梯度计算的代码解析</a>  |  <a href="./gradient.html">Automatic gradient mechanism</a></li><br><h4>解密稀疏奖励空间</h4><li><a href="./popart.html">Pop-Art normalization trick used in PPO</a></li><li><a href="./value_rescale.html">Value rescale trick used in PPO</a></li><br><h4>探索时序建模</h4><li><a href="./lstm.html">PPO + LSTM</a></li><li><a href="./gtrxl.html">PPO + Gated Transformer-XL</a></li><br><h4>统筹多智能体</h4><li><a href="./marl_network_zh.html">多智能体协作经典的神经网络架构</a>  |  <a href="./marl_network.html">Multi-Agent cooperation network</a></li><li><a href="./independentpg_zh.html">多智能体独立决策的策略梯度训练流程</a>  |  <a href="./independentpg.html">Independent policy gradient training</a></li><li><a href="./mapg_zh.html">多智能体协作决策的策略梯度训练流程</a>  |  <a href="./mapg.html">Multi-Agent policy gradient training</a></li><li><a href="./mappo_zh.html">多智能体协作决策的 PPO 算法训练流程</a>  |  <a href="./mappo.html">Multi-Agent PPO training</a></li><br><h4>挖掘黑科技</h4><li><a href="./gae.html">GAE technique used in PPO</a></li><li><a href="./recompute.html">Recompute adv trick used in PPO</a></li><li><a href="./grad_clip_norm_zh.html">PPO 中使用的梯度范数裁剪</a>  |  <a href="./grad_clip_norm.html">Gradient norm clip trick used in PPO</a></li><li><a href="./grad_clip_value_zh.html">PPO 中使用的梯度数值裁剪</a>  |  <a href="./grad_clip_value.html">Gradient value clip trick used in PPO</a></li><li><a href="./grad_ignore.html">Gradient ignore trick used in PPO</a></li><li><a href="./orthogonal_init.html">Orthogonal initialization of networks used in PPO</a></li><li><a href="./dual_clip.html">Dual clip trick used in PPO</a></li><li><a href="./value_clip.html">Value clip trick used in PPO</a></li></div></div><div class="section" id="section-final"><div class="docs doc-strings"><p><i>如果读者关于本文档有任何问题和建议，可以在 GitHub 提 issue 或是直接发邮件给我们 (opendilab@pjlab.org.cn) 。</i></p></div></div></body><script type="text/javascript">
+window.onload = function(){
+    var codeElement = document.getElementsByName('py_code');
+    var lineCount = 1;
+    for (var i = 0; i < codeElement.length; i++) {
+        var code = codeElement[i].innerText;
+        if (code.length <= 1) {
+            continue;
+        }
+
+        codeElement[i].innerHTML = "";
+
+        var codeMirror = CodeMirror(
+          codeElement[i],
+          {
+            value: code,
+            mode: "python",
+            theme: "solarized dark",
+            lineNumbers: true,
+            firstLineNumber: lineCount,
+            readOnly: true,
+            lineWrapping: true,
+          }
+        );
+        var noNewLineCode = code.replace(/[\r\n]/g, "");
+        lineCount += code.length - noNewLineCode.length + 1;
+    }
+};
+</script></html>
\ No newline at end of file
diff --git a/mappo_zh.html b/mappo_zh.html
new file mode 100644
index 0000000..0f46b96
--- /dev/null
+++ b/mappo_zh.html
@@ -0,0 +1,59 @@
+<!DOCTYPE html>
+<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/mode/python/python.min.js"></script></head><body><div class="section" id="section-0"><div class="docs doc-strings"><p><p><a href="index.html"><b>HOME<br></b></a></p></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a>  <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a>  <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/mappo_zh.py" target="_blank">View code on GitHub</a><br><br>PyTorch基础集中式训练和分布式执行（CTDE）MAPPO 算法的教程，适用于多智能体合作场景。<br>本教程使用 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">marl_network</span> 中定义的 CTDEActorCriticNetwork 和 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">ppo</span> 中定义的损失函数，并结合 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">gae</span> 中定义的优势函数计算方法。<br>另外，主函数使用构造的测试数据描述了 CTDE MAPPO 算法的核心部分。<br>关于多智能体合作强化学习的更多细节可以在 <a href="https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_lecture.pdf">Related Link</a> 中找到。</div></div><div class="section" id="section-2"><div class="docs doc-strings"><p>需要复制 chapter1_overview 中 ppo 的实现到当前目录</p></div><div class="code"><pre><code id="code_2" name="py_code">from ppo import ppo_policy_data, ppo_policy_error</code></pre></div></div><div class="section" id="section-3"><div class="docs doc-strings"><p>需要复制 chapter7_tricks 中 gae 的实现到当前目录</p></div><div class="code"><pre><code id="code_3" name="py_code">from gae import gae
+
+</code></pre></div></div><div class="section" id="section-4"><div class="docs doc-strings"><p>    <b>概述<b><br>    这是关于 CTDE MAPPO 算法训练过程的核心函数。<br>    首先，定义一些超参数，神经网络和优化器，然后生成构造的测试数据并计算演员-评论家损失 (Actor-Critic loss)。<br>    最后，使用优化器更新网络参数。在实际应用中，训练数据应该是由环境进行在线交互得到的。<br>    注意在本文件中，策略网络指的是演员 (Actor)，价值网络指的是评论家 (Critic)。</p></div><div class="code"><pre><code id="code_4" name="py_code">def mappo_training_opeator() -> None:</code></pre></div></div><div class="section" id="section-6"><div class="docs doc-strings"><p>    设置必要的超参数。</p></div><div class="code"><pre><code id="code_6" name="py_code">    batch_size, agent_num, local_state_shape, agent_specific_global_state_shape, action_shape = 4, 5, 10, 25, 6</code></pre></div></div><div class="section" id="section-7"><div class="docs doc-strings"><p>    熵加成权重，有利于探索。</p></div><div class="code"><pre><code id="code_7" name="py_code">    entropy_weight = 0.001</code></pre></div></div><div class="section" id="section-8"><div class="docs doc-strings"><p>    价值损失权重，旨在平衡不同损失函数量级。</p></div><div class="code"><pre><code id="code_8" name="py_code">    value_weight = 0.5</code></pre></div></div><div class="section" id="section-9"><div class="docs doc-strings"><p>    未来奖励的折扣系数。</p></div><div class="code"><pre><code id="code_9" name="py_code">    discount_factor = 0.99</code></pre></div></div><div class="section" id="section-10"><div class="docs doc-strings"><p>    根据运行环境设置 tensor 设备为 cuda 或者 cpu。</p></div><div class="code"><pre><code id="code_10" name="py_code">    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+</code></pre></div></div><div class="section" id="section-11"><div class="docs doc-strings"><p>    定义多智能体神经网络和优化器。</p></div><div class="code"><pre><code id="code_11" name="py_code">    model = CTDEActorCriticNetwork(agent_num, local_state_shape, agent_specific_global_state_shape, action_shape)
+    model.to(device)</code></pre></div></div><div class="section" id="section-12"><div class="docs doc-strings"><p>    Adam 是深度强化学习中最常用的优化器。 如果你想添加权重衰减机制，应该使用 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">torch.optim.AdamW</span> 。</p></div><div class="code"><pre><code id="code_12" name="py_code">    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+</code></pre></div></div><div class="section" id="section-13"><div class="docs doc-strings"><p>    定义相应的测试数据，需要保持数据格式与环境交互生成的数据格式相同。<br>    注意，数据应该与网络保持相同的计算设备 (device)。<br>    为简单起见，这里我们将整个批次数据视为一个完整的 episode。<br>    在实际应用中，训练批次是多个 episode 的组合。我们通常使用 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">done</span> 变量来划分不同的 episode 。</p></div><div class="code"><pre><code id="code_13" name="py_code">    local_state = torch.randn(batch_size, agent_num, local_state_shape).to(device)
+    agent_specific_global_state = torch.randn(batch_size, agent_num, agent_specific_global_state_shape).to(device)
+    logit_old = torch.randn(batch_size, agent_num, action_shape).to(device)
+    value_old = torch.randn(batch_size, agent_num).to(device)
+    done = torch.zeros(batch_size).to(device)
+    done[-1] = 1
+    action = torch.randint(0, action_shape, (batch_size, agent_num)).to(device)
+    reward = torch.randn(batch_size, agent_num).to(device)</code></pre></div></div><div class="section" id="section-14"><div class="docs doc-strings"><p>    目标回报可以用不同的方法计算。这里我们使用奖励的折扣累计值。<br>    还可以使用广义优势估计 (GAE) 法、n-step TD 方法等等。</p></div><div class="code"><pre><code id="code_14" name="py_code">    return_ = torch.zeros_like(reward)
+    for i in reversed(range(batch_size)):
+        return_[i] = reward[i] + (discount_factor * return_[i + 1] if i + 1 < batch_size else 0)
+</code></pre></div></div><div class="section" id="section-15"><div class="docs doc-strings"><p>    Actor-Critic 网络前向传播。</p></div><div class="code"><pre><code id="code_15" name="py_code">    output = model(local_state, agent_specific_global_state)</code></pre></div></div><div class="section" id="section-16"><div class="docs doc-strings"><p>    <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">squeeze</span> 操作将 shape 从 $$(B, A, 1)$$ 转化为 $$(B, A)$$.</p></div><div class="code"><pre><code id="code_16" name="py_code">    value = output.value.squeeze(-1)</code></pre></div></div><div class="section" id="section-17"><div class="docs doc-strings"><p>    使用广义优势估计（Generalized Advantage Estimation，简称GAE）方法来计算优势（Advantage）。<br>    优势是策略损失的一种“权重”，因此它被包含在 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">torch.no_grad()</span> 中，表示不进行梯度计算。<br>    <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">done</span> 是回合结束的标志。``traj_flag</span> 是轨迹（trajectory）的标志。<br>    在这里，我们将整个批次数据视为一个完整的回合，所以 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">done</span> 和 <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">traj_flag</span> 是相同的。</p></div><div class="code"><pre><code id="code_17" name="py_code">    with torch.no_grad():
+        traj_flag = done
+        gae_data = (value, value_old, reward, done, traj_flag)
+        adv = gae(gae_data, discount_factor, 0.95)</code></pre></div></div><div class="section" id="section-18"><div class="docs doc-strings"><p>    为 PPO policy loss 计算准备数据.</p></div><div class="code"><pre><code id="code_18" name="py_code">    data = ppo_policy_data(output.logit, logit_old, action, adv, None)</code></pre></div></div><div class="section" id="section-19"><div class="docs doc-strings"><p>    计算 PPO policy loss.</p></div><div class="code"><pre><code id="code_19" name="py_code">    loss, info = ppo_policy_error(data)</code></pre></div></div><div class="section" id="section-20"><div class="docs doc-strings"><p>    计算 value loss.</p></div><div class="code"><pre><code id="code_20" name="py_code">    value_loss = torch.nn.functional.mse_loss(value, return_)</code></pre></div></div><div class="section" id="section-21"><div class="docs doc-strings"><p>    策略损失 (PPO policy loss)、价值损失 (value loss) 和熵损失 (entropy_loss) 的加权和。</p></div><div class="code"><pre><code id="code_21" name="py_code">    total_loss = loss.policy_loss + value_weight * value_loss - entropy_weight * loss.entropy_loss
+</code></pre></div></div><div class="section" id="section-22"><div class="docs doc-strings"><p>    PyTorch loss 反向传播和优化器更新。</p></div><div class="code"><pre><code id="code_22" name="py_code">    optimizer.zero_grad()
+    total_loss.backward()
+    optimizer.step()</code></pre></div></div><div class="section" id="section-23"><div class="docs doc-strings"><p>    打印训练信息。</p></div><div class="code"><pre><code id="code_23" name="py_code">    print(
+        'total_loss: {:.4f}, policy_loss: {:.4f}, value_loss: {:.4f}, entropy_loss: {:.4f}'.format(
+            total_loss, loss.policy_loss, value_loss, loss.entropy_loss
+        )
+    )
+    print('approximate_kl_divergence: {:.4f}, clip_fraction: {:.4f}'.format(info.approx_kl, info.clipfrac))
+    print('mappo_training_opeator is ok')
+
+</code></pre></div></div><div class="section" id="section-23"><div class="docs doc-strings"><p><i>如果读者关于本文档有任何问题和建议，可以在 GitHub 提 issue 或是直接发邮件给我们 (opendilab@pjlab.org.cn) 。</i></p></div></div></body><script type="text/javascript">
+window.onload = function(){
+    var codeElement = document.getElementsByName('py_code');
+    var lineCount = 1;
+    for (var i = 0; i < codeElement.length; i++) {
+        var code = codeElement[i].innerText;
+        if (code.length <= 1) {
+            continue;
+        }
+
+        codeElement[i].innerHTML = "";
+
+        var codeMirror = CodeMirror(
+          codeElement[i],
+          {
+            value: code,
+            mode: "python",
+            theme: "solarized dark",
+            lineNumbers: true,
+            firstLineNumber: lineCount,
+            readOnly: false,
+            lineWrapping: true,
+          }
+        );
+        var noNewLineCode = code.replace(/[\r\n]/g, "");
+        lineCount += code.length - noNewLineCode.length + 1;
+    }
+};
+</script></html>
\ No newline at end of file
diff --git a/value_rescale.html b/value_rescale.html
new file mode 100644
index 0000000..fb08903
--- /dev/null
+++ b/value_rescale.html
@@ -0,0 +1,47 @@
+<!DOCTYPE html>
+<html><head><meta charset="utf-8"></meta><title>Annonated Algorithm Visualization</title><link rel="stylesheet" href="pylit.css?v=1"></link><link rel="stylesheet" href="solarized.css"></link><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.css" integrity="sha384-Juol1FqnotbkyZUT5Z7gUPjQ9gzlwCENvUZTpQBAPxtusdwFLRy382PSDx5UUJ4/" crossorigin="anonymous"></link><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/katex.min.js" integrity="sha384-97gW6UIJxnlKemYavrqDHSX3SiygeOwIZhwyOKRfSaf0JWKRVj9hLASHgFTzT+0O" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/npm/katex@0.16.3/dist/contrib/auto-render.min.js" integrity="sha384-+VBxd3r6XgURycqtZ117nYw44OOcIax56Z4dCRWbxyPt0Koah1uHoK0o4+/RRE05" crossorigin="anonymous" onload="renderMathInElement(document.body);" defer="True"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.css"></link><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/lib/codemirror.min.js"></script><script src="https://cdn.jsdelivr.net/npm/codemirror@5.61.0/mode/python/python.min.js"></script></head><body><div class="section" id="section-0"><div class="docs doc-strings"><p><p><a href="index.html"><b>HOME<br></b></a></p></p><a href="https://github.com/opendilab/PPOxFamily" target="_blank"><img alt="GitHub" style="max-width:100%;" src="https://img.shields.io/github/stars/opendilab/PPOxFamily?style=social"></img></a>  <a href="https://space.bilibili.com/1112854351?spm_id_from=333.337.0.0" target="_blank"><img alt="bilibili" style="max-width:100%;" src="https://img.shields.io/badge/bilibili-video%20course-blue"></img></a>  <a href="https://twitter.com/OpenDILab" rel="nofollow" target="_blank"><img alt="twitter" style="max-width:100%;" src="https://img.shields.io/twitter/follow/opendilab?style=social"></img></a><br><a href="https://github.com/opendilab/PPOxFamily/tree/main/chapter4_reward/value_rescale.py" target="_blank">View code on GitHub</a><br><br>Typically, we need to apply normalization functions in RL training to reduce the scale of some predictions of neural networks (e.g. value function) to enhance the RL training process.<br>In this document, we will demonstrate two kinds of data normalization methods and their corresponding inverse operations.<br>- The first one is <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">value_transform</span> , which can reduce the scale of the action-value function. Its corresponding inverse operation is <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">value_inv_transform</span> . <a href="https://arxiv.org/pdf/1805.11593.pdf">Related Link</a><br>- The second one is <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">symlog</span> , which is another approach to normalize the input tensor. Its corresponding inverse operation is <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">inv_symlog</span> . <a href="https://arxiv.org/pdf/2301.04104.pdf">Related Link</a></div></div><div class="section" id="section-1"><div class="docs doc-strings"><p>    <b>Overview</b><br>        A function to reduce the scale of the action-value function. For extensive reading, please refer to: Achieving Consistent Performance on Atari <a href="https://arxiv.org/abs/1805.11593">Related Link</a><br>        Given the input tensor <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">x</span> , this function will return the normalized tensor.<br>        The argument <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">eps</span> is a hyper-parameter that controls the additive regularization term to ensure the corresponding inverse operation is Lipschitz continuous.</p></div><div class="code"><pre><code id="code_1" name="py_code">import torch
+
+
+def value_transform(x: torch.Tensor, eps: float = 1e-2) -> torch.Tensor:</code></pre></div></div><div class="section" id="section-3"><div class="docs doc-strings"><p>    Core implementation.<br>    The formula of the normalization is: $$h(x) = sign(x)(\sqrt{(|x|+1)} - 1) + \epsilon * x$$</p></div><div class="code"><pre><code id="code_3" name="py_code">    return torch.sign(x) * (torch.sqrt(torch.abs(x) + 1) - 1) + eps * x
+
+</code></pre></div></div><div class="section" id="section-4"><div class="docs doc-strings"><p>    <b>Overview</b><br>        The inverse form of value transform. Given the input tensor <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">x</span> , this function will return the unnormalized tensor.</p></div><div class="code"><pre><code id="code_4" name="py_code">def value_inv_transform(x: torch.Tensor, eps: float = 1e-2) -> torch.Tensor:</code></pre></div></div><div class="section" id="section-6"><div class="docs doc-strings"><p>    The formula of the unnormalization is: $$h^{-1}(x) = sign(x)({(\frac{\sqrt{1+4\epsilon(|x|+1+\epsilon)}-1}{2\epsilon})}^2-1)$$</p></div><div class="code"><pre><code id="code_6" name="py_code">    return torch.sign(x) * (((torch.sqrt(1 + 4 * eps * (torch.abs(x) + 1 + eps)) - 1) / (2 * eps)) ** 2 - 1)
+
+</code></pre></div></div><div class="section" id="section-7"><div class="docs doc-strings"><p>    <b>Overview</b><br>        A function to normalize the targets. For extensive reading, please refer to: Mastering Diverse Domains through World Models <a href="https://arxiv.org/abs/2301.04104">Related Link</a><br>        Given the input tensor <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">x</span> , this function will return the normalized tensor.</p></div><div class="code"><pre><code id="code_7" name="py_code">def symlog(x: torch.Tensor) -> torch.Tensor:</code></pre></div></div><div class="section" id="section-9"><div class="docs doc-strings"><p>    The formula of the normalization is: $$symlog(x) = sign(x)(\ln{|x|+1})$$</p></div><div class="code"><pre><code id="code_9" name="py_code">    return torch.sign(x) * (torch.log(torch.abs(x) + 1))
+
+</code></pre></div></div><div class="section" id="section-10"><div class="docs doc-strings"><p>    <b>Overview</b><br>        The inverse form of symlog. Given the input tensor <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">x</span> , this function will return the unnormalized tensor.</p></div><div class="code"><pre><code id="code_10" name="py_code">def inv_symlog(x: torch.Tensor) -> torch.Tensor:</code></pre></div></div><div class="section" id="section-12"><div class="docs doc-strings"><p>    The formula of the unnormalization is: $$symexp(x) = sign(x)(\exp{|x|}-1)$$</p></div><div class="code"><pre><code id="code_12" name="py_code">    return torch.sign(x) * (torch.exp(torch.abs(x)) - 1)
+
+</code></pre></div></div><div class="section" id="section-13"><div class="docs doc-strings"><p>    <b>Overview</b><br>        Generate fake data and test the <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">value_transform</span> and <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">value_inv_transform</span> functions.</p></div><div class="code"><pre><code id="code_13" name="py_code">def test_value_transform():</code></pre></div></div><div class="section" id="section-15"><div class="docs doc-strings"><p>    Generate fake data.</p></div><div class="code"><pre><code id="code_15" name="py_code">    test_x = torch.randn(10)</code></pre></div></div><div class="section" id="section-16"><div class="docs doc-strings"><p>    Normalize the generated data.</p></div><div class="code"><pre><code id="code_16" name="py_code">    normalized_x = value_transform(test_x)
+    assert normalized_x.shape == (10,)</code></pre></div></div><div class="section" id="section-17"><div class="docs doc-strings"><p>    Unnormalize the data.</p></div><div class="code"><pre><code id="code_17" name="py_code">    unnormalized_x = value_inv_transform(normalized_x)</code></pre></div></div><div class="section" id="section-18"><div class="docs doc-strings"><p>    Test whether the data before and after the transformation is the same.</p></div><div class="code"><pre><code id="code_18" name="py_code">    assert torch.sum(torch.abs(test_x - unnormalized_x)) < 1e-3
+
+</code></pre></div></div><div class="section" id="section-19"><div class="docs doc-strings"><p>    <b>Overview</b><br>        Generate fake data and test the <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">symlog</span> and <span style="color:#00cbf694;font-family:Monaco,IBMPlexMono;">inv_symlog</span> functions.</p></div><div class="code"><pre><code id="code_19" name="py_code">def test_symlog():</code></pre></div></div><div class="section" id="section-21"><div class="docs doc-strings"><p>    Generate fake data.</p></div><div class="code"><pre><code id="code_21" name="py_code">    test_x = torch.randn(10)</code></pre></div></div><div class="section" id="section-22"><div class="docs doc-strings"><p>    Normalize the generated data.</p></div><div class="code"><pre><code id="code_22" name="py_code">    normalized_x = symlog(test_x)
+    assert normalized_x.shape == (10,)</code></pre></div></div><div class="section" id="section-23"><div class="docs doc-strings"><p>    Unnormalize the data.</p></div><div class="code"><pre><code id="code_23" name="py_code">    unnormalized_x = inv_symlog(normalized_x)</code></pre></div></div><div class="section" id="section-24"><div class="docs doc-strings"><p>    Test whether the data before and after the transformation is the same.</p></div><div class="code"><pre><code id="code_24" name="py_code">    assert torch.sum(torch.abs(test_x - unnormalized_x)) < 1e-3
+
+</code></pre></div></div><div class="section" id="section-24"><div class="docs doc-strings"><p><i>If you have any questions or advices about this documation, you can raise issues in GitHub (https://github.com/opendilab/PPOxFamily) or email us (opendilab@pjlab.org.cn).</i></p></div></div></body><script type="text/javascript">
+window.onload = function(){
+    var codeElement = document.getElementsByName('py_code');
+    var lineCount = 1;
+    for (var i = 0; i < codeElement.length; i++) {
+        var code = codeElement[i].innerText;
+        if (code.length <= 1) {
+            continue;
+        }
+
+        codeElement[i].innerHTML = "";
+
+        var codeMirror = CodeMirror(
+          codeElement[i],
+          {
+            value: code,
+            mode: "python",
+            theme: "solarized dark",
+            lineNumbers: true,
+            firstLineNumber: lineCount,
+            readOnly: false,
+            lineWrapping: true,
+          }
+        );
+        var noNewLineCode = code.replace(/[\r\n]/g, "");
+        lineCount += code.length - noNewLineCode.length + 1;
+    }
+};
+</script></html>
\ No newline at end of file