openai · HunderlineK · Mar 11, 2022
diff --git a/docs/spinningup/rl_intro3.rst b/docs/spinningup/rl_intro3.rst
@@ -91,7 +91,7 @@ This is an expectation, which means that we can estimate it with a sample mean.
 
 .. math::
 
- \hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau),
+ \hat{g} = \frac{1}{|\mathcal{D*T}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau),
 
 where :math:`|\mathcal{D}|` is the number of trajectories in :math:`\mathcal{D}` (here, :math:`N`).
 
@@ -474,4 +474,4 @@ In this chapter, we described the basic theory of policy gradient methods and co
 .. _`advantage of an action`: ../spinningup/rl_intro.html#advantage-functions
 .. _`this page`: ../spinningup/extra_pg_proof2.html
 .. _`Generalized Advantage Estimation`: https://arxiv.org/abs/1506.02438
-.. _`Vanilla Policy Gradient`: ../algorithms/vpg.html
+.. _`Vanilla Policy Gradient`: ../algorithms/vpg.html