From 36758eff38f528f229413e78ceaa835e0bb5ddf7 Mon Sep 17 00:00:00 2001
From: AlexCHEN <weichen6@stanford.edu>
Date: Sun, 10 Mar 2024 22:59:59 -0700
Subject: [PATCH] Built site for gh-pages

---
 .DS_Store                                | Bin 6148 -> 6148 bytes
 .nojekyll                                |   2 +-
 news/.DS_Store                           | Bin 0 -> 6148 bytes
 notes.html                               |   4 +-
 notes.xml                                |  36 ++++++++++++
 notes/.DS_Store                          | Bin 6148 -> 6148 bytes
 notes/Math Theories/complexanalysis.html |  72 +++++++++++++++++++++++
 search.json                              |  16 ++++-
 sitemap.xml                              |   2 +-
 9 files changed, 126 insertions(+), 6 deletions(-)
 create mode 100644 news/.DS_Store
diff --git a/.DS_Store b/.DS_Store
index 77b2cd2502f127971dd0430ff8d785f1930fa94c..5742cdfe3bd3de287aa7bf2e6462dda6a597776e 100644
GIT binary patch
delta 202
zcmZoMXffEJ%2dx}|CfP*frTNDA(f$=p*T0+#U&{xKM5$t5!|6wa^~DoM^yO~yz&JZ
zhQZ1CxdlKy3=B+QiWO*fK0^u6{NkK+Bw3cYvkVNAmoUXKEnt{j$1FBkikXjzcRHBq
nz|6yD;3r=icYbmWv&7^qWUd<Hgv}e7!&oL3@NH)2_{$Ff)cZAp

delta 202
zcmZoMXffEJ%2dy=xtoE3frTNDA(f$=p*T0+#U&{xKM5$tVLrcb^6qm-9Z}^|@X8lt
z7zQWj=N16<0Kowe!3s1xpP>Y3esNAZk}OLA3j@RCB}{Ql6Bs7fF^f%>V&-EKoepL?
nF!QjTJ-otU*SX0x%o3BckhyA%3pQ_L4r7^Ez_*#5<1aq|f;~AS

diff --git a/.nojekyll b/.nojekyll
index 21a12f5..31f15cb 100644
--- a/.nojekyll
+++ b/.nojekyll
@@ -1 +1 @@
-e28e16e1
\ No newline at end of file
+300494ba
\ No newline at end of file
diff --git a/news/.DS_Store b/news/.DS_Store
new file mode 100644
index 0000000000000000000000000000000000000000..168be32efa0defd4523aee09da7d992e6fa01737
GIT binary patch
literal 6148
zcmeHKOHRWu5S^i2L_$bqgIM|q^aiF1d!$|fN>Q^&Nkz#%_u&*QSi+fj^8qCyi;4w8
zXhyQ%*q-OuPm1j!B3`uXInj)W3KT&WXT+p?bm+)~r$E*@8hW5xswvUa$V7j!OV)l$
z*K|*hw5R_1E3{hLP=&qh>&<+5*KBvX1wLyJV{MzVT-D7A(Z$K+?fK>X_3}{W`iIP2
z)nDm_E;y=CAQT7%LV-}ArvUD3vEtY;>QEpQ2nD_rkn<s-2xiC5P>&8&ngRgj8Lb9i
zatYxi$L!b{Vg$xU1sat-#bBeOKY3hs><o=g?8yiFmp_vi)~{p!q~XNbFzQes6c{Qn
zvF1$f|1115qe=dd60J}m6!>Qf@U&c(3v9~o)^FRByEdWRpomFaCk_n$;1Pg^oFnHt
cXyZwI#AV0MP-c;Sr32$3pn^md3jBfsAF1L!fB*mh

literal 0
HcmV?d00001

diff --git a/notes.html b/notes.html
index 6556085..6fb0244 100644
--- a/notes.html
+++ b/notes.html
@@ -250,7 +250,7 @@ <h3 class="no-anchor listing-title">
 </a>
 </div>
 </div>
-<div class="quarto-post image-right" data-index="1" data-categories="Math Theories" data-listing-date-sort="1708848000000" data-listing-file-modified-sort="1708933374067" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="1" data-listing-word-count-sort="99">
+<div class="quarto-post image-right" data-index="1" data-categories="Math Theories" data-listing-date-sort="1708848000000" data-listing-file-modified-sort="1710136781487" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="669">
 <div class="thumbnail">
 <p><a href="./notes/Math Theories/complexanalysis.html" class="no-external"></a></p><a href="./notes/Math Theories/complexanalysis.html" class="no-external">
 <div class="listing-item-img-placeholder card-img-top" >&nbsp;</div>
@@ -275,7 +275,7 @@ <h3 class="no-anchor listing-title">
 <div class="metadata">
 <p><a href="./notes/Math Theories/complexanalysis.html" class="no-external"></a></p><a href="./notes/Math Theories/complexanalysis.html" class="no-external">
 <div class="listing-reading-time">
-1 min
+4 min
 </div>
 </a>
 </div>
diff --git a/notes.xml b/notes.xml
index 424d651..4821724 100644
--- a/notes.xml
+++ b/notes.xml
@@ -378,6 +378,42 @@ Tip
 <p>In this section, we just mention some critical formulas. In the case of complex number, we have</p>
 <p><img src="https://latex.codecogs.com/png.latex?%0Ae%5E%7Bi%20%5Ctheta%7D=%5Ccos%20%5Ctheta+i%20%5Csin%20%5Ctheta%0A"></p>
 <p>The equation can be proved if we use the Taylor series of <img src="https://latex.codecogs.com/png.latex?e%5Ex">, <img src="https://latex.codecogs.com/png.latex?%5Coperatorname%7Bcos%7D%20x"> and <img src="https://latex.codecogs.com/png.latex?%5Coperatorname%7Bsin%7D%20x"> to prove it. This formula will be highlighted when we use complex analysis in the ML.</p>
+<p>Additionally, we note that the complex number can be viewed as a special dimension so that we have <img src="https://latex.codecogs.com/png.latex?i%5E2=-1">. This <img src="https://latex.codecogs.com/png.latex?i"> will be helpful for many special computation.</p>
+</section>
+<section id="consider-the-rotary-embedding-using-complex-analysis" class="level2">
+<h2 class="anchored" data-anchor-id="consider-the-rotary-embedding-using-complex-analysis">Consider the rotary embedding using complex analysis</h2>
+<p>The token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.</p>
+<div class="callout callout-style-default callout-tip callout-titled">
+<div class="callout-header d-flex align-content-center">
+<div class="callout-icon-container">
+<i class="callout-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Tip
+</div>
+</div>
+<div class="callout-body-container callout-body">
+<p>📝 <strong>Paper</strong>: <a href="https://arxiv.org/pdf/2104.09864.pdf">ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING</a></p>
+</div>
+</div>
+<p>The present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of <img src="https://latex.codecogs.com/png.latex?m"> and <img src="https://latex.codecogs.com/png.latex?n">:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Cboldsymbol%7Bq%7D_m%20&amp;%20=f_q%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20m%5Cright)%20%5C%5C%0A%5Cboldsymbol%7Bk%7D_n%20&amp;%20=f_k%5Cleft(%5Cboldsymbol%7Bx%7D_n,%20n%5Cright)%20%5C%5C%0A%5Cboldsymbol%7Bv%7D_n%20&amp;%20=f_v%5Cleft(%5Cboldsymbol%7Bx%7D_n,%20n%5Cright).%0A%5Cend%7Baligned%7D%0A"></p>
+<p>When you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0Af_%7Bt:%20t%20%5Cin%5C%7Bq,%20k,%20v%5C%7D%7D%5Cleft(%5Cboldsymbol%7Bx%7D_i,%20i%5Cright):=%5Cboldsymbol%7BW%7D_%7Bt:%20t%20%5Cin%5C%7Bq,%20k,%20v%5C%7D%7D%5Cleft(%5Cboldsymbol%7Bx%7D_i+%5Cboldsymbol%7Bp%7D_i%5Cright),%0A"></p>
+<p>and</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bcases%7D%5Cboldsymbol%7Bp%7D_%7Bi,%202%20t%7D%20&amp;%20=%5Csin%20%5Cleft(k%20/%2010000%5E%7B2%20t%20/%20d%7D%5Cright)%20%5C%5C%20%5Cboldsymbol%7Bp%7D_%7Bi,%202%20t+1%7D%20&amp;%20=%5Ccos%20%5Cleft(k%20/%2010000%5E%7B2%20t%20/%20d%7D%5Cright)%5Cend%7Bcases%7D%0A"></p>
+<p>If we think about this structure further, we found that the <code>sin</code> and <code>cos</code> function is the periodic functions, which means for the same relative distance, we could observe similar embedding.</p>
+<p>Another relative positional embedding is to note that the relative position between the token <img src="https://latex.codecogs.com/png.latex?m"> and <img src="https://latex.codecogs.com/png.latex?n"> is <img src="https://latex.codecogs.com/png.latex?m-n">, and the embedding is dependent on the <img src="https://latex.codecogs.com/png.latex?m-n">, the difference. Note that we need to use <img src="https://latex.codecogs.com/png.latex?%5Cboldsymbol%7Bq%7D_m%5ET%5Cboldsymbol%7Bk%7D_n">, and this should be able to reflect the relative position information between the two tokens. And the <strong>current research indicates that the relative position embedding is important for the positional information</strong>. We wish the inner product encodes position information by:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft%5Clangle%20f_q%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20m%5Cright),%20f_k%5Cleft(%5Cboldsymbol%7Bx%7D_n,%20n%5Cright)%5Cright%5Crangle=g%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20%5Cboldsymbol%7Bx%7D_n,%20m-n%5Cright)%20.%0A"></p>
+<p>The idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose <img src="https://latex.codecogs.com/png.latex?d=2">, and we can assume the embedding information as:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0Af_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%20&amp;%20=R_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%20e%5E%7Bi%20%5CTheta_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%7D%20%5C%5C%0Af_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)%20&amp;%20=R_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)%20e%5E%7Bi%20%5CTheta_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)%7D%20%5C%5C%0Ag%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright)%20&amp;%20=R_g%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright)%20e%5E%7Bi%20%5CTheta_g%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright)%7D%0A%5Cend%7Baligned%7D%0A"></p>
+<p>Thus, using this information, we have</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0AR_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%20R_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)%20&amp;%20=R_g%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright),%20%5C%5C%0A%5CTheta_k%5Cleft(%5Cboldsymbol%7Bx%7D_k,%20n%5Cright)-%5CTheta_q%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20m%5Cright)%20&amp;%20=%5CTheta_g%5Cleft(%5Cboldsymbol%7Bx%7D_q,%20%5Cboldsymbol%7Bx%7D_k,%20n-m%5Cright),%0A%5Cend%7Baligned%7D%0A"></p>
+<p>After derivation, we found that if we choose the following expression, we can satisfy the condition above:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0Af_q%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20m%5Cright)%20&amp;%20=%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%20e%5E%7Bi%20m%20%5Ctheta%7D%20%5C%5C%0Af_k%5Cleft(%5Cboldsymbol%7Bx%7D_n,%20n%5Cright)%20&amp;%20=%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%20e%5E%7Bi%20n%20%5Ctheta%7D%20%5C%5C%0Ag%5Cleft(%5Cboldsymbol%7Bx%7D_m,%20%5Cboldsymbol%7Bx%7D_n,%20m-n%5Cright)%20&amp;%20=%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%5ET%20e%5E%7Bi(m-n)%20%5Ctheta%7D%0A%5Cend%7Baligned%7D%0A"></p>
+<p>The derivation is as the following:</p>
+<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Baligned%7D%0A%5Clangle%20f_q,%20f_k%5Crangle%20&amp;=%20f_q%5E*%20f_k%20%5C%5C%0A%20%20%20%20%20%20&amp;=%20%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5E*%20e%5E%7B-i%20m%20%5Ctheta%7D%20%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%20e%5E%7Bi%20n%20%5Ctheta%7D%20%5C%5C%0A%20%20%20%20%20%20&amp;=%20%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5E*%20%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%20e%5E%7Bi(n-m)%20%5Ctheta%7D%20%5C%5C%0A%20%20%20%20%20%20&amp;=%20%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5ET%20%5Cleft(%5Cboldsymbol%7BW%7D_k%20%5Cboldsymbol%7Bx%7D_n%5Cright)%20e%5E%7Bi(n-m)%20%5Ctheta%7D.%0A%5Cend%7Baligned%7D%0A"></p>
+<p>From the expression of <img src="https://latex.codecogs.com/png.latex?%5Cleft(%5Cboldsymbol%7BW%7D_q%20%5Cboldsymbol%7Bx%7D_m%5Cright)%5ET%20e%5E%7B-i%20m%20%5Ctheta%7D">, we can design the rotary embedding setup in the llama2.</p>
 
 
 </section>
diff --git a/notes/.DS_Store b/notes/.DS_Store
index 2ee28dae1ea7bdc4f4972f14be435c48af6ba3c8..37a49b6778672de2738daf255ece650c6e145c18 100644
GIT binary patch
delta 123
zcmZoMXffDe!@`uDJK2s!kE31s?COK(jyeK4DwAih)G?)aPrk#VBa)l%;*yk;pTxkx
zz_Fh597GLJVsZ;>4O6|~<TN(9$qH;dY-W)U9FopYPGM6J%FR!KYY{1VaxM-etjE}}
Nv9N@3Gdss$egKYjD<c2^

delta 114
zcmZoMXffDe!@`vGX|f%Q-sD9rT9c=-G%%%4oP38xM<h4j#U&{xKZ${Xf#b;A?bf@`
z9d!grOm1PVValzYoW>?MS%Hm*t%V~%Tkh246gCB+-24={7Pzn;W5LG462{H!9Dn%%
D<>w{v

diff --git a/notes/Math Theories/complexanalysis.html b/notes/Math Theories/complexanalysis.html
index 57f087f..5cbfa6b 100644
--- a/notes/Math Theories/complexanalysis.html	
+++ b/notes/Math Theories/complexanalysis.html	
@@ -301,6 +301,78 @@ <h2 class="anchored" data-anchor-id="basics-or-formulas-required.">Basics or for
 e^{i \theta}=\cos \theta+i \sin \theta
 \]</span></p>
 <p>The equation can be proved if we use the Taylor series of <span class="math inline">\(e^x\)</span>, <span class="math inline">\(\operatorname{cos} x\)</span> and <span class="math inline">\(\operatorname{sin} x\)</span> to prove it. This formula will be highlighted when we use complex analysis in the ML.</p>
+<p>Additionally, we note that the complex number can be viewed as a special dimension so that we have <span class="math inline">\(i^2=-1\)</span>. This <span class="math inline">\(i\)</span> will be helpful for many special computation.</p>
+</section>
+<section id="consider-the-rotary-embedding-using-complex-analysis" class="level2">
+<h2 class="anchored" data-anchor-id="consider-the-rotary-embedding-using-complex-analysis">Consider the rotary embedding using complex analysis</h2>
+<p>The token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.</p>
+<div class="callout callout-style-default callout-tip callout-titled">
+<div class="callout-header d-flex align-content-center">
+<div class="callout-icon-container">
+<i class="callout-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Tip
+</div>
+</div>
+<div class="callout-body-container callout-body">
+<p>📝 <strong>Paper</strong>: <a href="https://arxiv.org/pdf/2104.09864.pdf">ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING</a></p>
+</div>
+</div>
+<p>The present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of <span class="math inline">\(m\)</span> and <span class="math inline">\(n\)</span>:</p>
+<p><span class="math display">\[
+\begin{aligned}
+\boldsymbol{q}_m &amp; =f_q\left(\boldsymbol{x}_m, m\right) \\
+\boldsymbol{k}_n &amp; =f_k\left(\boldsymbol{x}_n, n\right) \\
+\boldsymbol{v}_n &amp; =f_v\left(\boldsymbol{x}_n, n\right).
+\end{aligned}
+\]</span></p>
+<p>When you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:</p>
+<p><span class="math display">\[
+f_{t: t \in\{q, k, v\}}\left(\boldsymbol{x}_i, i\right):=\boldsymbol{W}_{t: t \in\{q, k, v\}}\left(\boldsymbol{x}_i+\boldsymbol{p}_i\right),
+\]</span></p>
+<p>and</p>
+<p><span class="math display">\[
+\begin{cases}\boldsymbol{p}_{i, 2 t} &amp; =\sin \left(k / 10000^{2 t / d}\right) \\ \boldsymbol{p}_{i, 2 t+1} &amp; =\cos \left(k / 10000^{2 t / d}\right)\end{cases}
+\]</span></p>
+<p>If we think about this structure further, we found that the <code>sin</code> and <code>cos</code> function is the periodic functions, which means for the same relative distance, we could observe similar embedding.</p>
+<p>Another relative positional embedding is to note that the relative position between the token <span class="math inline">\(m\)</span> and <span class="math inline">\(n\)</span> is <span class="math inline">\(m-n\)</span>, and the embedding is dependent on the <span class="math inline">\(m-n\)</span>, the difference. Note that we need to use <span class="math inline">\(\boldsymbol{q}_m^T\boldsymbol{k}_n\)</span>, and this should be able to reflect the relative position information between the two tokens. And the <strong>current research indicates that the relative position embedding is important for the positional information</strong>. We wish the inner product encodes position information by:</p>
+<p><span class="math display">\[
+\left\langle f_q\left(\boldsymbol{x}_m, m\right), f_k\left(\boldsymbol{x}_n, n\right)\right\rangle=g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) .
+\]</span></p>
+<p>The idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose <span class="math inline">\(d=2\)</span>, and we can assume the embedding information as:</p>
+<p><span class="math display">\[
+\begin{aligned}
+f_q\left(\boldsymbol{x}_q, m\right) &amp; =R_q\left(\boldsymbol{x}_q, m\right) e^{i \Theta_q\left(\boldsymbol{x}_q, m\right)} \\
+f_k\left(\boldsymbol{x}_k, n\right) &amp; =R_k\left(\boldsymbol{x}_k, n\right) e^{i \Theta_k\left(\boldsymbol{x}_k, n\right)} \\
+g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right) &amp; =R_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right) e^{i \Theta_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right)}
+\end{aligned}
+\]</span></p>
+<p>Thus, using this information, we have</p>
+<p><span class="math display">\[
+\begin{aligned}
+R_q\left(\boldsymbol{x}_q, m\right) R_k\left(\boldsymbol{x}_k, n\right) &amp; =R_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right), \\
+\Theta_k\left(\boldsymbol{x}_k, n\right)-\Theta_q\left(\boldsymbol{x}_q, m\right) &amp; =\Theta_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right),
+\end{aligned}
+\]</span></p>
+<p>After derivation, we found that if we choose the following expression, we can satisfy the condition above:</p>
+<p><span class="math display">\[
+\begin{aligned}
+f_q\left(\boldsymbol{x}_m, m\right) &amp; =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta} \\
+f_k\left(\boldsymbol{x}_n, n\right) &amp; =\left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\
+g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) &amp; =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)\left(\boldsymbol{W}_k \boldsymbol{x}_n\right)^T e^{i(m-n) \theta}
+\end{aligned}
+\]</span></p>
+<p>The derivation is as the following:</p>
+<p><span class="math display">\[
+\begin{aligned}
+\langle f_q, f_k\rangle &amp;= f_q^* f_k \\
+      &amp;= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^* e^{-i m \theta} \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\
+      &amp;= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^* \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i(n-m) \theta} \\
+      &amp;= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^T \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i(n-m) \theta}.
+\end{aligned}
+\]</span></p>
+<p>From the expression of <span class="math inline">\(\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^T e^{-i m \theta}\)</span>, we can design the rotary embedding setup in the llama2.</p>
 
 
 </section>
diff --git a/search.json b/search.json
index 2a7a77a..9def958 100644
--- a/search.json
+++ b/search.json
@@ -162,7 +162,7 @@
     "href": "notes.html",
     "title": "Research notes",
     "section": "",
-    "text": "Large language model distributed training\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nComplex analysis for machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model evaluation\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n11 min\n\n\n\n\n\n\n\n\n\n\n\n\nMixture of expert\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nScalable diffusion models with transformers\n\n\n\n\n\n\nDiffusion Model\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nReinforcement learning for large language model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n19 min\n\n\n\n\n\n\nNo matching items"
+    "text": "Large language model distributed training\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nComplex analysis for machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n4 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model evaluation\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n11 min\n\n\n\n\n\n\n\n\n\n\n\n\nMixture of expert\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nScalable diffusion models with transformers\n\n\n\n\n\n\nDiffusion Model\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nReinforcement learning for large language model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n19 min\n\n\n\n\n\n\nNo matching items"
   },
   {
     "objectID": "index.html",
@@ -272,7 +272,19 @@
     "href": "notes/Math Theories/complexanalysis.html#basics-or-formulas-required.",
     "title": "Complex analysis for machine learning",
     "section": "Basics or formulas required.",
-    "text": "Basics or formulas required.\nIn this section, we just mention some critical formulas. In the case of complex number, we have\n\\[\ne^{i \\theta}=\\cos \\theta+i \\sin \\theta\n\\]\nThe equation can be proved if we use the Taylor series of \\(e^x\\), \\(\\operatorname{cos} x\\) and \\(\\operatorname{sin} x\\) to prove it. This formula will be highlighted when we use complex analysis in the ML.",
+    "text": "Basics or formulas required.\nIn this section, we just mention some critical formulas. In the case of complex number, we have\n\\[\ne^{i \\theta}=\\cos \\theta+i \\sin \\theta\n\\]\nThe equation can be proved if we use the Taylor series of \\(e^x\\), \\(\\operatorname{cos} x\\) and \\(\\operatorname{sin} x\\) to prove it. This formula will be highlighted when we use complex analysis in the ML.\nAdditionally, we note that the complex number can be viewed as a special dimension so that we have \\(i^2=-1\\). This \\(i\\) will be helpful for many special computation.",
+    "crumbs": [
+      "Home",
+      "♾ **Math Theories**",
+      "Complex analysis for machine learning"
+    ]
+  },
+  {
+    "objectID": "notes/Math Theories/complexanalysis.html#consider-the-rotary-embedding-using-complex-analysis",
+    "href": "notes/Math Theories/complexanalysis.html#consider-the-rotary-embedding-using-complex-analysis",
+    "title": "Complex analysis for machine learning",
+    "section": "Consider the rotary embedding using complex analysis",
+    "text": "Consider the rotary embedding using complex analysis\nThe token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.\n\n\n\n\n\n\nTip\n\n\n\n📝 Paper: ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING\n\n\nThe present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of \\(m\\) and \\(n\\):\n\\[\n\\begin{aligned}\n\\boldsymbol{q}_m & =f_q\\left(\\boldsymbol{x}_m, m\\right) \\\\\n\\boldsymbol{k}_n & =f_k\\left(\\boldsymbol{x}_n, n\\right) \\\\\n\\boldsymbol{v}_n & =f_v\\left(\\boldsymbol{x}_n, n\\right).\n\\end{aligned}\n\\]\nWhen you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:\n\\[\nf_{t: t \\in\\{q, k, v\\}}\\left(\\boldsymbol{x}_i, i\\right):=\\boldsymbol{W}_{t: t \\in\\{q, k, v\\}}\\left(\\boldsymbol{x}_i+\\boldsymbol{p}_i\\right),\n\\]\nand\n\\[\n\\begin{cases}\\boldsymbol{p}_{i, 2 t} & =\\sin \\left(k / 10000^{2 t / d}\\right) \\\\ \\boldsymbol{p}_{i, 2 t+1} & =\\cos \\left(k / 10000^{2 t / d}\\right)\\end{cases}\n\\]\nIf we think about this structure further, we found that the sin and cos function is the periodic functions, which means for the same relative distance, we could observe similar embedding.\nAnother relative positional embedding is to note that the relative position between the token \\(m\\) and \\(n\\) is \\(m-n\\), and the embedding is dependent on the \\(m-n\\), the difference. Note that we need to use \\(\\boldsymbol{q}_m^T\\boldsymbol{k}_n\\), and this should be able to reflect the relative position information between the two tokens. And the current research indicates that the relative position embedding is important for the positional information. We wish the inner product encodes position information by:\n\\[\n\\left\\langle f_q\\left(\\boldsymbol{x}_m, m\\right), f_k\\left(\\boldsymbol{x}_n, n\\right)\\right\\rangle=g\\left(\\boldsymbol{x}_m, \\boldsymbol{x}_n, m-n\\right) .\n\\]\nThe idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose \\(d=2\\), and we can assume the embedding information as:\n\\[\n\\begin{aligned}\nf_q\\left(\\boldsymbol{x}_q, m\\right) & =R_q\\left(\\boldsymbol{x}_q, m\\right) e^{i \\Theta_q\\left(\\boldsymbol{x}_q, m\\right)} \\\\\nf_k\\left(\\boldsymbol{x}_k, n\\right) & =R_k\\left(\\boldsymbol{x}_k, n\\right) e^{i \\Theta_k\\left(\\boldsymbol{x}_k, n\\right)} \\\\\ng\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right) & =R_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right) e^{i \\Theta_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right)}\n\\end{aligned}\n\\]\nThus, using this information, we have\n\\[\n\\begin{aligned}\nR_q\\left(\\boldsymbol{x}_q, m\\right) R_k\\left(\\boldsymbol{x}_k, n\\right) & =R_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right), \\\\\n\\Theta_k\\left(\\boldsymbol{x}_k, n\\right)-\\Theta_q\\left(\\boldsymbol{x}_q, m\\right) & =\\Theta_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right),\n\\end{aligned}\n\\]\nAfter derivation, we found that if we choose the following expression, we can satisfy the condition above:\n\\[\n\\begin{aligned}\nf_q\\left(\\boldsymbol{x}_m, m\\right) & =\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right) e^{i m \\theta} \\\\\nf_k\\left(\\boldsymbol{x}_n, n\\right) & =\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i n \\theta} \\\\\ng\\left(\\boldsymbol{x}_m, \\boldsymbol{x}_n, m-n\\right) & =\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right)^T e^{i(m-n) \\theta}\n\\end{aligned}\n\\]\nThe derivation is as the following:\n\\[\n\\begin{aligned}\n\\langle f_q, f_k\\rangle &= f_q^* f_k \\\\\n      &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^* e^{-i m \\theta} \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i n \\theta} \\\\\n      &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^* \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i(n-m) \\theta} \\\\\n      &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^T \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i(n-m) \\theta}.\n\\end{aligned}\n\\]\nFrom the expression of \\(\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^T e^{-i m \\theta}\\), we can design the rotary embedding setup in the llama2.",
     "crumbs": [
       "Home",
       "♾ **Math Theories**",
diff --git a/sitemap.xml b/sitemap.xml
index 220c21b..7619d45 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -42,7 +42,7 @@
   </url>
   <url>
     <loc>https://alexchen4ai.github.io/blog/notes/Math Theories/complexanalysis.html</loc>
-    <lastmod>2024-02-26T07:42:54.067Z</lastmod>
+    <lastmod>2024-03-11T05:59:41.487Z</lastmod>
   </url>
   <url>
     <loc>https://alexchen4ai.github.io/blog/news/Research news/researchnews.html</loc>