diff --git a/grad_clip_value_zh.html b/grad_clip_value_zh.html new file mode 100644 index 0000000..79ff015 --- /dev/null +++ b/grad_clip_value_zh.html @@ -0,0 +1,48 @@ + +Annonated Algorithm Visualization

HOME

GitHub bilibili twitter
View code on GitHub

本文件是梯度裁剪模块 torch.nn.utils.grad_clip_value 的 PyTorch 实现。

概述
梯度裁剪函数的实现,即 grad_clip_value Related Link
该函数在 loss 反向传播后使用,它会将网络参数的所有梯度剪裁 (clip) 到一个固定范围 [-clip_value, clip_value] 之间。
注意这个函数是原地操作,修改梯度并没有任何返回值。

from typing import Union, Iterable
+import torch
+
+_tensor_or_tensors = Union[torch.Tensor, Iterable[torch.Tensor]]
+
+
+def grad_clip_value(parameters: _tensor_or_tensors, clip_value: float) -> None:

将可训练参数的非空梯度保存到列表中。

    if isinstance(parameters, torch.Tensor):
+        parameters = [parameters]
+    grads = [p.grad for p in parameters if p.grad is not None]

将原始 clip_value 转换为 float 类型。

    clip_value = float(clip_value)

将梯度原地剪裁到 [-clip_value, Clip_value]。

    for grad in grads:
+        grad.data.clamp_(min=-clip_value, max=clip_value)
+
+

概述
对于使用固定值做梯度裁剪的测试函数。

def test_grad_clip_value():

准备超参数, batch size=4, action=32

    B, N = 4, 32

设置 clip_value 为 1e-3

    clip_value = 1e-3

生成回归的 logit 值和标签,在实际应用中, logit 值是整个网络的输出,并需要梯度计算。

    logit = torch.randn(B, N).requires_grad_(True)
+    label = torch.randn(B, N)

定义标准并计算 loss。

    criterion = torch.nn.MSELoss()
+    output = criterion(logit, label)

进行 loss 的反向传播并计算梯度。

    output.backward()

使用固定值对梯度进行剪裁(clip)。

    grad_clip_value(logit, clip_value)

在剪裁后,断言(assert)剪裁后的梯度值是否合理。

    assert isinstance(logit.grad, torch.Tensor)
+    for g in logit.grad:
+        assert (g <= clip_value).all()
+        assert (g >= -clip_value).all()
+
+

如果读者关于本文档有任何问题和建议,可以在 GitHub 提 issue 或是直接发邮件给我们 (opendilab@pjlab.org.cn) 。

\ No newline at end of file diff --git a/index.html b/index.html new file mode 100644 index 0000000..3e5820a --- /dev/null +++ b/index.html @@ -0,0 +1,30 @@ + +Annonated Algorithm Visualization

HOME

GitHub bilibili twitter
View code on GitHub

PPO × Family PyTorch 注解文档

logo

作为 PPO × Family 决策智能入门公开课的“算法-代码”注解文档,力求发掘 PPO 算法的每一个细节,帮助读者快速掌握设计决策人工智能的万能钥匙。

各章节代码解读示例目录

开启决策 AI 探索之旅

  • 策略梯度(PG)算法核心代码 | Policy Gradient core loss function
  • A2C 算法核心代码 | A2C core loss function
  • PPO 算法核心代码 | PPO core loss function

  • 解构复杂动作空间

  • PPO 建模离散动作空间 | PPO in discrete action space
  • PPO 建模连续动作空间 | PPO in continuous action space
  • PPO 建模混合动作空间 | PPO in hybrid action space

  • 表征多模态观察空间

  • 特征编码的各种技巧 | Encoding methods for vector obs space
  • 图片动作空间的各类环境包装器 | Env wrappers for image obs space
  • 神经网络梯度计算的代码解析 | Automatic gradient mechanism

  • 解密稀疏奖励空间

  • Pop-Art normalization trick used in PPO
  • Value rescale trick used in PPO

  • 探索时序建模

  • PPO + LSTM
  • PPO + Gated Transformer-XL

  • 统筹多智能体

  • 多智能体协作经典的神经网络架构 | Multi-Agent cooperation network
  • 多智能体独立决策的策略梯度训练流程 | Independent policy gradient training
  • 多智能体协作决策的策略梯度训练流程 | Multi-Agent policy gradient training
  • 多智能体协作决策的 PPO 算法训练流程 | Multi-Agent PPO training

  • 挖掘黑科技

  • GAE technique used in PPO
  • Recompute adv trick used in PPO
  • PPO 中使用的梯度范数裁剪 | Gradient norm clip trick used in PPO
  • PPO 中使用的梯度数值裁剪 | Gradient value clip trick used in PPO
  • Gradient ignore trick used in PPO
  • Orthogonal initialization of networks used in PPO
  • Dual clip trick used in PPO
  • Value clip trick used in PPO
  • 如果读者关于本文档有任何问题和建议,可以在 GitHub 提 issue 或是直接发邮件给我们 (opendilab@pjlab.org.cn) 。

    \ No newline at end of file diff --git a/mappo_zh.html b/mappo_zh.html new file mode 100644 index 0000000..0f46b96 --- /dev/null +++ b/mappo_zh.html @@ -0,0 +1,59 @@ + +Annonated Algorithm Visualization

    HOME

    GitHub bilibili twitter
    View code on GitHub

    PyTorch基础集中式训练和分布式执行(CTDE)MAPPO 算法的教程,适用于多智能体合作场景。
    本教程使用 marl_network 中定义的 CTDEActorCriticNetwork 和 ppo 中定义的损失函数,并结合 gae 中定义的优势函数计算方法。
    另外,主函数使用构造的测试数据描述了 CTDE MAPPO 算法的核心部分。
    关于多智能体合作强化学习的更多细节可以在 Related Link 中找到。

    需要复制 chapter1_overview 中 ppo 的实现到当前目录

    from ppo import ppo_policy_data, ppo_policy_error

    需要复制 chapter7_tricks 中 gae 的实现到当前目录

    from gae import gae
    +
    +

    概述
    这是关于 CTDE MAPPO 算法训练过程的核心函数。
    首先,定义一些超参数,神经网络和优化器,然后生成构造的测试数据并计算演员-评论家损失 (Actor-Critic loss)。
    最后,使用优化器更新网络参数。在实际应用中,训练数据应该是由环境进行在线交互得到的。
    注意在本文件中,策略网络指的是演员 (Actor),价值网络指的是评论家 (Critic)。

    def mappo_training_opeator() -> None:

    设置必要的超参数。

        batch_size, agent_num, local_state_shape, agent_specific_global_state_shape, action_shape = 4, 5, 10, 25, 6

    熵加成权重,有利于探索。

        entropy_weight = 0.001

    价值损失权重,旨在平衡不同损失函数量级。

        value_weight = 0.5

    未来奖励的折扣系数。

        discount_factor = 0.99

    根据运行环境设置 tensor 设备为 cuda 或者 cpu。

        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    +

    定义多智能体神经网络和优化器。

        model = CTDEActorCriticNetwork(agent_num, local_state_shape, agent_specific_global_state_shape, action_shape)
    +    model.to(device)

    Adam 是深度强化学习中最常用的优化器。 如果你想添加权重衰减机制,应该使用 torch.optim.AdamW

        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    +

    定义相应的测试数据,需要保持数据格式与环境交互生成的数据格式相同。
    注意,数据应该与网络保持相同的计算设备 (device)。
    为简单起见,这里我们将整个批次数据视为一个完整的 episode。
    在实际应用中,训练批次是多个 episode 的组合。我们通常使用 done 变量来划分不同的 episode 。

        local_state = torch.randn(batch_size, agent_num, local_state_shape).to(device)
    +    agent_specific_global_state = torch.randn(batch_size, agent_num, agent_specific_global_state_shape).to(device)
    +    logit_old = torch.randn(batch_size, agent_num, action_shape).to(device)
    +    value_old = torch.randn(batch_size, agent_num).to(device)
    +    done = torch.zeros(batch_size).to(device)
    +    done[-1] = 1
    +    action = torch.randint(0, action_shape, (batch_size, agent_num)).to(device)
    +    reward = torch.randn(batch_size, agent_num).to(device)

    目标回报可以用不同的方法计算。这里我们使用奖励的折扣累计值。
    还可以使用广义优势估计 (GAE) 法、n-step TD 方法等等。

        return_ = torch.zeros_like(reward)
    +    for i in reversed(range(batch_size)):
    +        return_[i] = reward[i] + (discount_factor * return_[i + 1] if i + 1 < batch_size else 0)
    +

    Actor-Critic 网络前向传播。

        output = model(local_state, agent_specific_global_state)

    squeeze 操作将 shape 从 $$(B, A, 1)$$ 转化为 $$(B, A)$$.

        value = output.value.squeeze(-1)

    使用广义优势估计(Generalized Advantage Estimation,简称GAE)方法来计算优势(Advantage)。
    优势是策略损失的一种“权重”,因此它被包含在 torch.no_grad() 中,表示不进行梯度计算。
    done 是回合结束的标志。``traj_flag 是轨迹(trajectory)的标志。
    在这里,我们将整个批次数据视为一个完整的回合,所以 donetraj_flag 是相同的。

        with torch.no_grad():
    +        traj_flag = done
    +        gae_data = (value, value_old, reward, done, traj_flag)
    +        adv = gae(gae_data, discount_factor, 0.95)

    为 PPO policy loss 计算准备数据.

        data = ppo_policy_data(output.logit, logit_old, action, adv, None)

    计算 PPO policy loss.

        loss, info = ppo_policy_error(data)

    计算 value loss.

        value_loss = torch.nn.functional.mse_loss(value, return_)

    策略损失 (PPO policy loss)、价值损失 (value loss) 和熵损失 (entropy_loss) 的加权和。

        total_loss = loss.policy_loss + value_weight * value_loss - entropy_weight * loss.entropy_loss
    +

    PyTorch loss 反向传播和优化器更新。

        optimizer.zero_grad()
    +    total_loss.backward()
    +    optimizer.step()

    打印训练信息。

        print(
    +        'total_loss: {:.4f}, policy_loss: {:.4f}, value_loss: {:.4f}, entropy_loss: {:.4f}'.format(
    +            total_loss, loss.policy_loss, value_loss, loss.entropy_loss
    +        )
    +    )
    +    print('approximate_kl_divergence: {:.4f}, clip_fraction: {:.4f}'.format(info.approx_kl, info.clipfrac))
    +    print('mappo_training_opeator is ok')
    +
    +

    如果读者关于本文档有任何问题和建议,可以在 GitHub 提 issue 或是直接发邮件给我们 (opendilab@pjlab.org.cn) 。

    \ No newline at end of file diff --git a/value_rescale.html b/value_rescale.html new file mode 100644 index 0000000..fb08903 --- /dev/null +++ b/value_rescale.html @@ -0,0 +1,47 @@ + +Annonated Algorithm Visualization

    HOME

    GitHub bilibili twitter
    View code on GitHub

    Typically, we need to apply normalization functions in RL training to reduce the scale of some predictions of neural networks (e.g. value function) to enhance the RL training process.
    In this document, we will demonstrate two kinds of data normalization methods and their corresponding inverse operations.
    - The first one is value_transform , which can reduce the scale of the action-value function. Its corresponding inverse operation is value_inv_transform . Related Link
    - The second one is symlog , which is another approach to normalize the input tensor. Its corresponding inverse operation is inv_symlog . Related Link

    Overview
    A function to reduce the scale of the action-value function. For extensive reading, please refer to: Achieving Consistent Performance on Atari Related Link
    Given the input tensor x , this function will return the normalized tensor.
    The argument eps is a hyper-parameter that controls the additive regularization term to ensure the corresponding inverse operation is Lipschitz continuous.

    import torch
    +
    +
    +def value_transform(x: torch.Tensor, eps: float = 1e-2) -> torch.Tensor:

    Core implementation.
    The formula of the normalization is: $$h(x) = sign(x)(\sqrt{(|x|+1)} - 1) + \epsilon * x$$

        return torch.sign(x) * (torch.sqrt(torch.abs(x) + 1) - 1) + eps * x
    +
    +

    Overview
    The inverse form of value transform. Given the input tensor x , this function will return the unnormalized tensor.

    def value_inv_transform(x: torch.Tensor, eps: float = 1e-2) -> torch.Tensor:

    The formula of the unnormalization is: $$h^{-1}(x) = sign(x)({(\frac{\sqrt{1+4\epsilon(|x|+1+\epsilon)}-1}{2\epsilon})}^2-1)$$

        return torch.sign(x) * (((torch.sqrt(1 + 4 * eps * (torch.abs(x) + 1 + eps)) - 1) / (2 * eps)) ** 2 - 1)
    +
    +

    Overview
    A function to normalize the targets. For extensive reading, please refer to: Mastering Diverse Domains through World Models Related Link
    Given the input tensor x , this function will return the normalized tensor.

    def symlog(x: torch.Tensor) -> torch.Tensor:

    The formula of the normalization is: $$symlog(x) = sign(x)(\ln{|x|+1})$$

        return torch.sign(x) * (torch.log(torch.abs(x) + 1))
    +
    +

    Overview
    The inverse form of symlog. Given the input tensor x , this function will return the unnormalized tensor.

    def inv_symlog(x: torch.Tensor) -> torch.Tensor:

    The formula of the unnormalization is: $$symexp(x) = sign(x)(\exp{|x|}-1)$$

        return torch.sign(x) * (torch.exp(torch.abs(x)) - 1)
    +
    +

    Overview
    Generate fake data and test the value_transform and value_inv_transform functions.

    def test_value_transform():

    Generate fake data.

        test_x = torch.randn(10)

    Normalize the generated data.

        normalized_x = value_transform(test_x)
    +    assert normalized_x.shape == (10,)

    Unnormalize the data.

        unnormalized_x = value_inv_transform(normalized_x)

    Test whether the data before and after the transformation is the same.

        assert torch.sum(torch.abs(test_x - unnormalized_x)) < 1e-3
    +
    +

    Overview
    Generate fake data and test the symlog and inv_symlog functions.

    def test_symlog():

    Generate fake data.

        test_x = torch.randn(10)

    Normalize the generated data.

        normalized_x = symlog(test_x)
    +    assert normalized_x.shape == (10,)

    Unnormalize the data.

        unnormalized_x = inv_symlog(normalized_x)

    Test whether the data before and after the transformation is the same.

        assert torch.sum(torch.abs(test_x - unnormalized_x)) < 1e-3
    +
    +

    If you have any questions or advices about this documation, you can raise issues in GitHub (https://github.com/opendilab/PPOxFamily) or email us (opendilab@pjlab.org.cn).

    \ No newline at end of file