-
Notifications
You must be signed in to change notification settings - Fork 0
/
sr.html
248 lines (199 loc) · 9.87 KB
/
sr.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
<!DOCTYPE html>
<html>
<head>
<title>The SR - YSA</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta charset="utf-8">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Nunito">
<link rel="stylesheet" href="styles.css"> <!-- Adjust path if using folders -->
<!-- MathJax Configuration -->
<script>
window.MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']],
displayMath: [['$$', '$$'], ['\\[', '\\]']]
},
svg: {
fontCache: 'global'
}
};
</script>
<!-- Load MathJax -->
<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js" async></script>
</head>
<body>
<div class="header">
<a href="index.html" class="logo">About me</a>
<div class="header-right">
<a href="index.html">Home</a>
<a href="https://github.com/YSanchezAraujo">Github</a>
<a href="writing.html">Writing</a>
<a href="https://twitter.com/ysanchezaraujo">Twitter</a>
</div>
</div>
<div class="unfloated-container">
<h1>Random things on the Successor Representation (SR)</h1>
<p>
Within reinforcement learning, in 1993 (I think?) <a href="https://www.mpg.de/12309370/biological-cybernetics-dayan">Peter Dayan</a> created? invented? the <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/d93b.pdf">successor representation</a>, or the SR for short. Lots of articles and summaries have been written about the SR, so my goal here
isn't exactly to introduce it. I'll instead lay out some details that I've had to write out myself to get a better understanding of the thing.
</p>
<p>
First let's lay out a reference point from which we can make sense of the SR. This reference point is the Bellman equation
under a particular policy $\pi$, taken directly from Sutton & Barto chapter 3, equation 3.14
\[
v_{\pi}(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)\Big[r + \gamma v_{\pi}(s') \Big] \; \; \; \; (1)
\]
</p>
<p>
we'll re-express this equation to make future comparisons a bit clearer:
\[
v_{\pi}(s) = \sum_a \pi(a | s) \Big[r(s, a) + \gamma \sum_{s'}p(s' | s, a) v_{\pi}(s') \Big] \; \; \; \; (2)
\]
the term $r(s, a) = \mathbb{E}\big[R_{t+1} | S_{t} = s, A_{t} = a \big] = \sum_r r \sum_{s'} p(s', r | s, a)$, so you can see
if you multiply the transition probabilities in equation (1) you get $r(s, a)$. What I want to do now is put this into a form where
we are primarily dealing with vectors and matrices. To do that I'll define two new expressions
\[
r_{\pi}(s) = \sum_a \pi(a | s)r(s, a) \; \; \; \; (3)
\]
\[
P_{\pi}(s, s') = \sum_a \pi(a | s)p(s' | s, a) \; \; \; \; (4)
\]
with these, we can write down a form of the Bellman equation under a particular policy $\pi$ that obeys the linear properties of matrix algebra:
\[
\boldsymbol{v}_{\pi} = \boldsymbol{r}_{\pi} + \gamma P_{\pi} \boldsymbol{v}_{\pi} \; \; \; \; (6)
\]
and just so that we are clear, the two vectors represent: $\boldsymbol{v}_{\pi} = \Big[ v_{\pi}(s_1) \; \; v_{\pi}(s_2) \; \; . . . \; \; v_{\pi}(s_{|S|})]^{\top} $
</p>
\[
\boldsymbol{r}_{\pi} = \begin{bmatrix}
\sum_a \pi(a | s_1) r(s_1, a) \\
\sum_a \pi(a | s_2) r(s_2, a) \\
\vdots \\
\sum_a \pi(a | s_{|S|}) r(s_{|S|}, a)
\end{bmatrix}
\]
and the matrix $P_{\pi}$ is:
\[
P_{\pi} = \begin{bmatrix}
\sum\limits_a \pi(a \mid s_1) p(s_1 \mid s_1, a) & \sum\limits_a \pi(a \mid s_1) p(s_2 \mid s_1, a) & \cdots & \sum\limits_a \pi(a \mid s_1) p(s_{|S|} \mid s_1, a) \\
\sum\limits_a \pi(a \mid s_2) p(s_1 \mid s_2, a) & \sum\limits_a \pi(a \mid s_2) p(s_2 \mid s_2, a) & \cdots & \sum\limits_a \pi(a \mid s_2) p(s_{|S|} \mid s_2, a) \\
\vdots & \vdots & \ddots & \vdots \\
\sum\limits_a \pi(a \mid s_{|S|}) p(s_1 \mid s_{|S|}, a) & \sum\limits_a \pi(a \mid s_{|S|}) p(s_2 \mid s_{|S|}, a) & \cdots & \sum\limits_a \pi(a \mid s_{|S|}) p(s_{|S|} \mid s_{|S|}, a)
\end{bmatrix}
\]
</div>
hopefully that spells out each element of equation (6). Now if $\gamma$ is set to some known number between zero and 1 then:
\[
\boldsymbol{v}_{\pi} = \boldsymbol{r}_{\pi} + \gamma P_{\pi} \boldsymbol{v}_{\pi}
\]
\[
\implies \big(I - \gamma P_{\pi} \big) \boldsymbol{v}_{\pi} = \boldsymbol{r}_{\pi}
\]
\[
\implies \boldsymbol{v}_{\pi} = \big(I - \gamma P_{\pi} \big)^{-1}\boldsymbol{r}_{\pi} \; \; \; \; (7)
\]
the inverted term in equation (7) happens to be the successor representation, which I will denote as $M_{\pi} = \big(I - \gamma P_{\pi} \big)^{-1}$.
What we've written above is typically not how the SR is introduced. Instead you see some variant of:
\[
M_{\pi} = \sum_{t=0}^{\infty} (\gamma P_{\pi})^t = I + \gamma P_{\pi} + \gamma^2 P_{\pi}^2 + \cdots \; \; \; \; (8)
\]
which is equal to the inverted term in (7) by some identity in some mathematical series. Another way that the SR is defined is through the use of
indicator functions:
\[
M_{\pi}(s, s') = \mathbb{E}_{\pi} \Bigg[\sum_{t=0}^{\infty} \gamma^t \mathbb{I}\big[ S_{j + t} = s'\big] | S_j = s \Bigg] \; \; \; \; (9)
\]
here, I'm willing to bet that I'm just too simple minded to see the connection to the SR that we arrived at in equation (7),
so let's try and derive the SR from an equation that easily gets us to equation (7). I'll write out another definition from Sutton & Barto
chapter 3, equation 3.12
\[
v_{\pi}(s) = \mathbb{E}_{\pi} \Bigg[\sum_{t=0}^{\infty} \gamma^t R_{j + t + 1} | S_j = s \Bigg] \; \; \; \; (10)
\]
for the following step I need to recall equation (3), let's do the same here, but for the next state $s'$, so we would have
$r_{\pi}(s') = \sum_{a'}\pi(a' | s') r(s', a')$
\[
v_{\pi}(s) = \mathbb{E}_{\pi}\Bigg[\sum_{t=0}^{\infty} \gamma^t r_{\pi}(S_{j+t}) | S_j = s \Bigg] \; \; \; \; (11)
\]
and now apply an indicator function:
\[
r_{\pi}(S_{j+t}) = \sum_{s'} r_{\pi}(s')\mathbb{I}\big[S_{j+t} = s'\big] \; \; \; \; (12)
\]
substitue this back into the value function:
\[
v_{\pi}(s) = \mathbb{E}_{\pi}\Bigg[\sum_{t=0}^{\infty} \gamma^t \sum_{s'} r_{\pi}(s')\mathbb{I}\big[S_{j+t} = s'\big] | S_j = s \Bigg] \; \; \; \; (13)
\]
as you can see, that there just about looks like the expression for SR, inside of the outer-most expectation. We just need to
justify swamping the order and taking $\sum_{s'} r_{\pi}(s')$ outside of the expectation. Insofar as I can muster, we have
to assume that $r_{\pi}(s')$ is a known deterministic function of $s'$, and when that is the case, we can use linearity of
expectation to move it to the outside.
\[
v_{\pi}(s) = \sum_{s'} r_{\pi}(s') \mathbb{E}_{\pi}\Bigg[\sum_{t=0}^{\infty} \gamma^t \mathbb{I}\big[S_{j+t} = s'\big] | S_j = s \Bigg]
\]
\[
\implies v_{\pi}(s) = \sum_{s'} r_{\pi}(s') M_{\pi}(s, s') \; \; \; \; (14)
\]
OK, enough of the math, we should get to conceptual matters. What is the point of going through the trouble of showing the SR as equation (7)
to then go and show equation (8) and in more detail equation (9) ? The first derivation of the SR gives us an extremely simple
way to compute and code it. The second was perhaps misplaced, and the third gives intuition about what it is.
To the point of computing the SR, what would one need? We'd need a policy, and it's probabilities: $\pi(a | s)$ and a transition matrix encoding each $P(s'|s, a)$
, with that we can compute $P_{\pi}(s, s') = \sum_{a \in A} \pi(a | s) P(s'|s, a)$ and then perform a in-some-cases-simple matrix inversion
$M_{\pi} = \big(I - \gamma P_{\pi}\big)^{-1}$. Alternatively if this matrix inversion is not feasible we can use an iterative method:
<!-- Inserted Algorithm -->
<div class="algorithm">
<h2>Algorithm: Iterative Computation of the Successor Representation (SR)</h2>
<pre><code>
Input:
Number of states N
State transition matrix under policy π, P_pi ∈ ℝ^{N × N}
Discount factor γ ∈ [0, 1)
Convergence threshold ε > 0
Output:
Successor Representation matrix M_pi ∈ ℝ^{N × N}
Initialize:
M_pi^(0) ← I // Identity matrix of size N × N
k ← 0
delta ← ∞
While (delta > ε):
M_pi^(k+1) ← I + γ * P_pi * M_pi^(k)
delta ← || M_pi^(k+1) - M_pi^(k) ||_∞
k ← k + 1
M_pi ← M_pi^(k)
Return M_pi
</code></pre>
</div>
<!-- Continue with content -->
Now for the last and perhaps most important point, what is the SR? In English, it's a matrix where each entry gives you value
containing information about the current occupancy of state (e.g. is the state now $s$ equal to $s'$, if so add a value of 1).
Add to this a discounted (via $\gamma$) future occupancy: $\sum_a \pi(a|s) \sum_{s''}P(s''|s,a) M_{\pi}(s'', s')$, so if we consider the SR
for a single timestep:
\[
M_{\pi}(s, s') = \mathbb{I}\big[s=s'\big] + \gamma \sum_{a \in A} \pi(a|s) \sum_{s'' \in S} P(s''|s, a)M_{\pi}(s'', s') \; \; \; \; (15)
\]
and the real gem in equation (15), is the recursion. By noting this recursion we can understand it to encode information for
all future steps / time points.
<p>
OK, I may or may not expand on this in the future, but for now, I hope this is somewhat helpful.
</p>
</div>
<h2>References</h2>
<ul>
<li>
Sutton, R. S., & Barto, A. G. (2018).
<em>Reinforcement Learning: An Introduction</em>.
MIT Press.
<a href="http://incompleteideas.net/book/the-book.html" target="_blank">http://incompleteideas.net/book/the-book.html</a>
</li>
<li>
Dayan, P. (1993).
Using general value functions to represent knowledge.
In <em>Proceedings of the 14th Annual Conference on Neural Information Processing Systems</em> (pp. 1123-1129).
</li>
<li>
Gershman, S. (2018).
The Successor Representation: Its Computational Logic and Neural Substrates
<em>J. Neurosci</em>, 38(33):7193-7200.
</li>
</ul>
</div>
</div>
</body>
</html>