Built site for gh-pages

alexchen4ai · Mar 11, 2024 · 36758ef · 36758ef
1 parent e10a95a
commit 36758ef
Show file tree

Hide file tree

Showing 9 changed files with 126 additions and 6 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-e28e16e1
+300494ba
diff --git a/news/.DS_Store b/news/.DS_Store
diff --git a/notes.html b/notes.html
@@ -250,7 +250,7 @@ <h3 class="no-anchor listing-title">
 </a>
 </div>
 </div>
-<div class="quarto-post image-right" data-index="1" data-categories="Math Theories" data-listing-date-sort="1708848000000" data-listing-file-modified-sort="1708933374067" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="1" data-listing-word-count-sort="99">
+<div class="quarto-post image-right" data-index="1" data-categories="Math Theories" data-listing-date-sort="1708848000000" data-listing-file-modified-sort="1710136781487" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="669">
 <div class="thumbnail">
 <p><a href="./notes/Math Theories/complexanalysis.html" class="no-external"></a></p><a href="./notes/Math Theories/complexanalysis.html" class="no-external">
 <div class="listing-item-img-placeholder card-img-top" >&nbsp;</div>
@@ -275,7 +275,7 @@ <h3 class="no-anchor listing-title">
 <div class="metadata">
 <p><a href="./notes/Math Theories/complexanalysis.html" class="no-external"></a></p><a href="./notes/Math Theories/complexanalysis.html" class="no-external">
 <div class="listing-reading-time">
-1 min
+4 min
 </div>
 </a>
 </div>

diff --git a/notes.xml b/notes.xml
@@ -378,6 +378,42 @@ Tip
 <p>In this section, we just mention some critical formulas. In the case of complex number, we have</p>
 <p><img src="https://latex.codecogs.com/png.latex?%0Ae%5E%7Bi%20%5Ctheta%7D=%5Ccos%20%5Ctheta+i%20%5Csin%20%5Ctheta%0A"></p>
 <p>The equation can be proved if we use the Taylor series of <img src="https://latex.codecogs.com/png.latex?e%5Ex">, <img src="https://latex.codecogs.com/png.latex?%5Coperatorname%7Bcos%7D%20x"> and <img src="https://latex.codecogs.com/png.latex?%5Coperatorname%7Bsin%7D%20x"> to prove it. This formula will be highlighted when we use complex analysis in the ML.</p>
+<p>Additionally, we note that the complex number can be viewed as a special dimension so that we have <img src="https://latex.codecogs.com/png.latex?i%5E2=-1">. This <img src="https://latex.codecogs.com/png.latex?i"> will be helpful for many special computation.</p>
+</section>
+<section id="consider-the-rotary-embedding-using-complex-analysis" class="level2">
+<h2 class="anchored" data-anchor-id="consider-the-rotary-embedding-using-complex-analysis">Consider the rotary embedding using complex analysis</h2>
+<p>The token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.</p>
+<div class="callout callout-style-default callout-tip callout-titled">
+<div class="callout-header d-flex align-content-center">
+<div class="callout-icon-container">
+<i class="callout-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Tip
+</div>
+</div>
+<div class="callout-body-container callout-body">
+<p>📝 <strong>Paper</strong>: <a href="https://arxiv.org/pdf/2104.09864.pdf">ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING</a></p>
+</div>
+</div>
+<p>The present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of <img src="https://latex.codecogs.com/png.latex?m"> and <img src="https://latex.codecogs.com/png.latex?n">:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Cboldsymbol%7Bq%7D_m%20&amp;%20=f_q%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20m%5Cright)%20%5C%5C%0A%5Cboldsymbol%7Bk%7D_n%20&amp;%20=f_k%5Cleft(%5Cboldsymbol%7Bx%7D_n,%20n%5Cright)%20%5C%5C%0A%5Cboldsymbol%7Bv%7D_n%20&amp;%20=f_v%5Cleft(%5Cboldsymbol%7Bx%7D_n,%20n%5Cright).%0A%5Cend%7Baligned%7D%0A"></p>
+<p>When you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0Af_%7Bt:%20t%20%5Cin%5C%7Bq,%20k,%20v%5C%7D%7D%5Cleft(%5Cboldsymbol%7Bx%7D_i,%20i%5Cright):=%5Cboldsymbol%7BW%7D_%7Bt:%20t%20%5Cin%5C%7Bq,%20k,%20v%5C%7D%7D%5Cleft(%5Cboldsymbol%7Bx%7D_i+%5Cboldsymbol%7Bp%7D_i%5Cright),%0A"></p>
+<p>and</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bcases%7D%5Cboldsymbol%7Bp%7D_%7Bi,%202%20t%7D%20&amp;%20=%5Csin%20%5Cleft(k%20/%2010000%5E%7B2%20t%20/%20d%7D%5Cright)%20%5C%5C%20%5Cboldsymbol%7Bp%7D_%7Bi,%202%20t+1%7D%20&amp;%20=%5Ccos%20%5Cleft(k%20/%2010000%5E%7B2%20t%20/%20d%7D%5Cright)%5Cend%7Bcases%7D%0A"></p>
+<p>If we think about this structure further, we found that the <code>sin</code> and <code>cos</code> function is the periodic functions, which means for the same relative distance, we could observe similar embedding.</p>
+<p>Another relative positional embedding is to note that the relative position between the token <img src="https://latex.codecogs.com/png.latex?m"> and <img src="https://latex.codecogs.com/png.latex?n"> is <img src="https://latex.codecogs.com/png.latex?m-n">, and the embedding is dependent on the <img src="https://latex.codecogs.com/png.latex?m-n">, the difference. Note that we need to use <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7Bq%7D_m%5ET%5Cboldsymbol%7Bk%7D_n">, and this should be able to reflect the relative position information between the two tokens. And the <strong>current research indicates that the relative position embedding is important for the positional information</strong>. We wish the inner product encodes position information by:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5Clangle%20f_q%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20m%5Cright),%20f_k%5Cleft(%5Cboldsymbol%7Bx%7D_n,%20n%5Cright)%5Cright%5Crangle=g%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20%5Cboldsymbol%7Bx%7D_n,%20m-n%5Cright)%20.%0A"></p>
+<p>The idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose <img src="https://latex.codecogs.com/png.latex?d=2">, and we can assume the embedding information as:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0Af_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%20&amp;%20=R_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%20e%5E%7Bi%20%5CTheta_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%7D%20%5C%5C%0Af_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)%20&amp;%20=R_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)%20e%5E%7Bi%20%5CTheta_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)%7D%20%5C%5C%0Ag%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright)%20&amp;%20=R_g%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright)%20e%5E%7Bi%20%5CTheta_g%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright)%7D%0A%5Cend%7Baligned%7D%0A"></p>
+<p>Thus, using this information, we have</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0AR_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%20R_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)%20&amp;%20=R_g%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright),%20%5C%5C%0A%5CTheta_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)-%5CTheta_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%20&amp;%20=%5CTheta_g%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright),%0A%5Cend%7Baligned%7D%0A"></p>
+<p>After derivation, we found that if we choose the following expression, we can satisfy the condition above:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0Af_q%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20m%5Cright)%20&amp;%20=%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%20e%5E%7Bi%20m%20%5Ctheta%7D%20%5C%5C%0Af_k%5Cleft(%5Cboldsymbol%7Bx%7D_n,%20n%5Cright)%20&amp;%20=%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%20e%5E%7Bi%20n%20%5Ctheta%7D%20%5C%5C%0Ag%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20%5Cboldsymbol%7Bx%7D_n,%20m-n%5Cright)%20&amp;%20=%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%5ET%20e%5E%7Bi(m-n)%20%5Ctheta%7D%0A%5Cend%7Baligned%7D%0A"></p>
+<p>The derivation is as the following:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Clangle%20f_q,%20f_k%5Crangle%20&amp;=%20f_q%5E*%20f_k%20%5C%5C%0A%20%20%20%20%20%20&amp;=%20%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5E*%20e%5E%7B-i%20m%20%5Ctheta%7D%20%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%20e%5E%7Bi%20n%20%5Ctheta%7D%20%5C%5C%0A%20%20%20%20%20%20&amp;=%20%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5E*%20%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%20e%5E%7Bi(n-m)%20%5Ctheta%7D%20%5C%5C%0A%20%20%20%20%20%20&amp;=%20%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5ET%20%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%20e%5E%7Bi(n-m)%20%5Ctheta%7D.%0A%5Cend%7Baligned%7D%0A"></p>
+<p>From the expression of <img src="https://latex.codecogs.com/png.latex?%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5ET%20e%5E%7B-i%20m%20%5Ctheta%7D">, we can design the rotary embedding setup in the llama2.</p>
 
 
 </section>

diff --git a/notes/.DS_Store b/notes/.DS_Store
diff --git a/notes/Math Theories/complexanalysis.html b/notes/Math Theories/complexanalysis.html
@@ -301,6 +301,78 @@ <h2 class="anchored" data-anchor-id="basics-or-formulas-required.">Basics or for
 e^{i \theta}=\cos \theta+i \sin \theta
 \]</span></p>
 <p>The equation can be proved if we use the Taylor series of <span class="math inline">\(e^x\)</span>, <span class="math inline">\(\operatorname{cos} x\)</span> and <span class="math inline">\(\operatorname{sin} x\)</span> to prove it. This formula will be highlighted when we use complex analysis in the ML.</p>
+<p>Additionally, we note that the complex number can be viewed as a special dimension so that we have <span class="math inline">\(i^2=-1\)</span>. This <span class="math inline">\(i\)</span> will be helpful for many special computation.</p>
+</section>
+<section id="consider-the-rotary-embedding-using-complex-analysis" class="level2">
+<h2 class="anchored" data-anchor-id="consider-the-rotary-embedding-using-complex-analysis">Consider the rotary embedding using complex analysis</h2>
+<p>The token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.</p>
+<div class="callout callout-style-default callout-tip callout-titled">
+<div class="callout-header d-flex align-content-center">
+<div class="callout-icon-container">
+<i class="callout-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Tip
+</div>
+</div>
+<div class="callout-body-container callout-body">
+<p>📝 <strong>Paper</strong>: <a href="https://arxiv.org/pdf/2104.09864.pdf">ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING</a></p>
+</div>
+</div>
+<p>The present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of <span class="math inline">\(m\)</span> and <span class="math inline">\(n\)</span>:</p>
+<p><span class="math display">\[
+\begin{aligned}
+\boldsymbol{q}_m &amp; =f_q\left(\boldsymbol{x}_m, m\right) \\
+\boldsymbol{k}_n &amp; =f_k\left(\boldsymbol{x}_n, n\right) \\
+\boldsymbol{v}_n &amp; =f_v\left(\boldsymbol{x}_n, n\right).
+\end{aligned}
+\]</span></p>
+<p>When you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:</p>
+<p><span class="math display">\[
+f_{t: t \in\{q, k, v\}}\left(\boldsymbol{x}_i, i\right):=\boldsymbol{W}_{t: t \in\{q, k, v\}}\left(\boldsymbol{x}_i+\boldsymbol{p}_i\right),
+\]</span></p>
+<p>and</p>
+<p><span class="math display">\[
+\begin{cases}\boldsymbol{p}_{i, 2 t} &amp; =\sin \left(k / 10000^{2 t / d}\right) \\ \boldsymbol{p}_{i, 2 t+1} &amp; =\cos \left(k / 10000^{2 t / d}\right)\end{cases}
+\]</span></p>
+<p>If we think about this structure further, we found that the <code>sin</code> and <code>cos</code> function is the periodic functions, which means for the same relative distance, we could observe similar embedding.</p>
+<p>Another relative positional embedding is to note that the relative position between the token <span class="math inline">\(m\)</span> and <span class="math inline">\(n\)</span> is <span class="math inline">\(m-n\)</span>, and the embedding is dependent on the <span class="math inline">\(m-n\)</span>, the difference. Note that we need to use <span class="math inline">\(\boldsymbol{q}_m^T\boldsymbol{k}_n\)</span>, and this should be able to reflect the relative position information between the two tokens. And the <strong>current research indicates that the relative position embedding is important for the positional information</strong>. We wish the inner product encodes position information by:</p>
+<p><span class="math display">\[
+\left\langle f_q\left(\boldsymbol{x}_m, m\right), f_k\left(\boldsymbol{x}_n, n\right)\right\rangle=g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) .
+\]</span></p>
+<p>The idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose <span class="math inline">\(d=2\)</span>, and we can assume the embedding information as:</p>
+<p><span class="math display">\[
+\begin{aligned}
+f_q\left(\boldsymbol{x}_q, m\right) &amp; =R_q\left(\boldsymbol{x}_q, m\right) e^{i \Theta_q\left(\boldsymbol{x}_q, m\right)} \\
+f_k\left(\boldsymbol{x}_k, n\right) &amp; =R_k\left(\boldsymbol{x}_k, n\right) e^{i \Theta_k\left(\boldsymbol{x}_k, n\right)} \\
+g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right) &amp; =R_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right) e^{i \Theta_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right)}
+\end{aligned}
+\]</span></p>
+<p>Thus, using this information, we have</p>
+<p><span class="math display">\[
+\begin{aligned}
+R_q\left(\boldsymbol{x}_q, m\right) R_k\left(\boldsymbol{x}_k, n\right) &amp; =R_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right), \\
+\Theta_k\left(\boldsymbol{x}_k, n\right)-\Theta_q\left(\boldsymbol{x}_q, m\right) &amp; =\Theta_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right),
+\end{aligned}
+\]</span></p>
+<p>After derivation, we found that if we choose the following expression, we can satisfy the condition above:</p>
+<p><span class="math display">\[
+\begin{aligned}
+f_q\left(\boldsymbol{x}_m, m\right) &amp; =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta} \\
+f_k\left(\boldsymbol{x}_n, n\right) &amp; =\left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\
+g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) &amp; =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)\left(\boldsymbol{W}_k \boldsymbol{x}_n\right)^T e^{i(m-n) \theta}
+\end{aligned}
+\]</span></p>
+<p>The derivation is as the following:</p>
+<p><span class="math display">\[
+\begin{aligned}
+\langle f_q, f_k\rangle &amp;= f_q^* f_k \\
+      &amp;= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^* e^{-i m \theta} \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\
+      &amp;= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^* \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i(n-m) \theta} \\
+      &amp;= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^T \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i(n-m) \theta}.
+\end{aligned}
+\]</span></p>
+<p>From the expression of <span class="math inline">\(\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^T e^{-i m \theta}\)</span>, we can design the rotary embedding setup in the llama2.</p>
 
 
 </section>