Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter1 Discussion #3

Open
PaParaZz1 opened this issue Dec 22, 2022 · 2 comments
Open

Chapter1 Discussion #3

PaParaZz1 opened this issue Dec 22, 2022 · 2 comments
Labels
discussion Topic discussion

Comments

@PaParaZz1
Copy link
Member

PaParaZz1 commented Dec 22, 2022

本 issue 将会追踪和记录各种有关课程第一讲的问题和延伸思考,欢迎有兴趣的同学在这个 issue 中评论,课程组会定期整理信息

最新的 QA 合集文档(2022.12.22更新)

@PaParaZz1 PaParaZz1 added the discussion Topic discussion label Dec 22, 2022
@PaParaZz1 PaParaZz1 pinned this issue Dec 22, 2022
@PaParaZz1 PaParaZz1 unpinned this issue Jan 13, 2023
@hccngu
Copy link

hccngu commented May 16, 2023

def pg_error(data: namedtuple) -> namedtuple:
"""
概述:
策略梯度(Policy Gradient,PG)算法的 PyTorch 实现。
"""
# 对数据 data 进行解包: $$<\pi(a|s), a, G_t>$$
logit, action, return_ = data
# 根据 logit 构建策略分布,然后得到对应动作的概率的对数值。
dist = torch.distributions.categorical.Categorical(logits=logit)
log_prob = dist.log_prob(action)
# 策略损失函数: $$- \frac 1 N \sum_{n=1}^{N} log(\pi(a^n|s^n)) G_t^n$$
policy_loss = -(log_prob * return_).mean()
# 熵奖赏(bonus)损失函数:$$\frac 1 N \sum_{n=1}^{N} \sum_{a^n}\pi(a^n|s^n) log(\pi(a^n|s^n))$$
# 注意:最终的损失是 policy_loss - entropy_weight * entropy_loss .
entropy_loss = dist.entropy().mean()
# 返回最终的各项损失函数:包含策略损失和熵损失。
return pg_loss(policy_loss, entropy_loss)

delimiter

def test_pg():
"""
概述:
策略梯度算法的测试函数,包括前向和反向传播测试。
"""
# 设置相关参数:batch size=4, action=32
B, N = 4, 32
# 从随机分布中生成测试数据:logit, action, return_.
logit = torch.randn(B, N).requires_grad_(True)
action = torch.randint(0, N, size=(B, ))
return_ = torch.randn(B) * 2
# 计算 PG error。
data = pg_data(logit, action, return_)
loss = pg_error(data)
# 测试 loss 是否是可微分的,是否能正确产生梯度
assert all([l.shape == tuple() for l in loss])
assert logit.grad is None
total_loss = sum(loss)
total_loss.backward()
assert isinstance(logit.grad, torch.Tensor)
上面是chapter1_overview /pg_zh.py文件中的内容,请问由于total_loss = sum(loss),所以entropy_loss = dist.entropy().mean()是否应该是entropy_loss = -dist.entropy().mean()【就是少了一个负号】?

@PaParaZz1
Copy link
Member Author

def pg_error(data: namedtuple) -> namedtuple: """ 概述: 策略梯度(Policy Gradient,PG)算法的 PyTorch 实现。 """ # 对数据 data 进行解包: <π(a|s),a,Gt> logit, action, return_ = data # 根据 logit 构建策略分布,然后得到对应动作的概率的对数值。 dist = torch.distributions.categorical.Categorical(logits=logit) log_prob = dist.log_prob(action) # 策略损失函数: −1N∑n=1Nlog(π(an|sn))Gtn policy_loss = -(log_prob * return_).mean() # 熵奖赏(bonus)损失函数:$$\frac 1 N \sum_{n=1}^{N} \sum_{a^n}\pi(a^n|s^n) log(\pi(a^n|s^n))$$ # 注意:最终的损失是 policy_loss - entropy_weight * entropy_loss . entropy_loss = dist.entropy().mean() # 返回最终的各项损失函数:包含策略损失和熵损失。 return pg_loss(policy_loss, entropy_loss)

delimiter

def test_pg(): """ 概述: 策略梯度算法的测试函数,包括前向和反向传播测试。 """ # 设置相关参数:batch size=4, action=32 B, N = 4, 32 # 从随机分布中生成测试数据:logit, action, return_. logit = torch.randn(B, N).requires_grad_(True) action = torch.randint(0, N, size=(B, )) return_ = torch.randn(B) * 2 # 计算 PG error。 data = pg_data(logit, action, return_) loss = pg_error(data) # 测试 loss 是否是可微分的,是否能正确产生梯度 assert all([l.shape == tuple() for l in loss]) assert logit.grad is None total_loss = sum(loss) total_loss.backward() assert isinstance(logit.grad, torch.Tensor) 上面是chapter1_overview /pg_zh.py文件中的内容,请问由于total_loss = sum(loss),所以entropy_loss = dist.entropy().mean()是否应该是entropy_loss = -dist.entropy().mean()【就是少了一个负号】?

这里只是计算各项损失函数,最终在把多项 loss 组合到一起的时候,会有这个符号,具体的代码段可以参考链接

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Topic discussion
Projects
None yet
Development

No branches or pull requests

2 participants