You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
q_\pi(s, a)
& = \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) v_\pi(s')
\end{aligned}
$$
where $\sum_r p(r|s, a)r$ represents immediate reward and $\gamma \sum_{s'} p(s'|s, a) v_\pi(s')$ represents discounted value of new state.
The Bellman equation(matrix-vector form)
$$
v_\pi = r_\pi + \gamma P_\pi v_\pi
$$
usually solved in iterative solution rather than closed-form solution, where $[P_\pi]{s, s'} \triangleq \sum_a \pi(a|s) \sum{s'} p(s'|s, a)$.
$$
\begin{aligned}
v
&= \max_\pi \sum_a \pi(a|s) q(s, a) \\
&= \max_{a \in \mathbb{A}(s)} q(s, a)
\end{aligned}
$$
and the optimality is achieved when
$$
\pi(a|s) =
\begin{cases}
1, & a = a^* \\
0, & a \neq a^* \\
\end{cases}
$$
usually solved in iterative solution rather than closed-form solution, where $a^* = \argmax_a q(s, a)$.