From d2d1496d582fc14dbd277cc9da1b23a1781b78d5 Mon Sep 17 00:00:00 2001 From: H Date: Thu, 10 Mar 2022 21:48:38 -0800 Subject: [PATCH] Sampling of VPG should be over D*T --- docs/spinningup/rl_intro3.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/spinningup/rl_intro3.rst b/docs/spinningup/rl_intro3.rst index 34e4d5d57..43f3cad36 100644 --- a/docs/spinningup/rl_intro3.rst +++ b/docs/spinningup/rl_intro3.rst @@ -91,7 +91,7 @@ This is an expectation, which means that we can estimate it with a sample mean. .. math:: - \hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau), + \hat{g} = \frac{1}{|\mathcal{D*T}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau), where :math:`|\mathcal{D}|` is the number of trajectories in :math:`\mathcal{D}` (here, :math:`N`). @@ -474,4 +474,4 @@ In this chapter, we described the basic theory of policy gradient methods and co .. _`advantage of an action`: ../spinningup/rl_intro.html#advantage-functions .. _`this page`: ../spinningup/extra_pg_proof2.html .. _`Generalized Advantage Estimation`: https://arxiv.org/abs/1506.02438 -.. _`Vanilla Policy Gradient`: ../algorithms/vpg.html \ No newline at end of file +.. _`Vanilla Policy Gradient`: ../algorithms/vpg.html