From 36758eff38f528f229413e78ceaa835e0bb5ddf7 Mon Sep 17 00:00:00 2001 From: AlexCHEN Date: Sun, 10 Mar 2024 22:59:59 -0700 Subject: [PATCH] Built site for gh-pages --- .DS_Store | Bin 6148 -> 6148 bytes .nojekyll | 2 +- news/.DS_Store | Bin 0 -> 6148 bytes notes.html | 4 +- notes.xml | 36 ++++++++++++ notes/.DS_Store | Bin 6148 -> 6148 bytes notes/Math Theories/complexanalysis.html | 72 +++++++++++++++++++++++ search.json | 16 ++++- sitemap.xml | 2 +- 9 files changed, 126 insertions(+), 6 deletions(-) create mode 100644 news/.DS_Store diff --git a/.DS_Store b/.DS_Store index 77b2cd2502f127971dd0430ff8d785f1930fa94c..5742cdfe3bd3de287aa7bf2e6462dda6a597776e 100644 GIT binary patch delta 202 zcmZoMXffEJ%2dx}|CfP*frTNDA(f$=p*T0+#U&{xKM5$t5!|6wa^~DoM^yO~yz&JZ zhQZ1CxdlKy3=B+QiWO*fK0^u6{NkK+Bw3cYvkVNAmoUXKEnt{j$1FBkikXjzcRHBq nz|6yD;3r=icYbmWv&7^qWUdY3esNAZk}OLA3j@RCB}{Ql6Bs7fF^f%>V&-EKoepL? nF!QjTJ-otU*SX0x%o3BckhyA%3pQ_L4r7^Ez_*#5<1aq|f;~AS diff --git a/.nojekyll b/.nojekyll index 21a12f5..31f15cb 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -e28e16e1 \ No newline at end of file +300494ba \ No newline at end of file diff --git a/news/.DS_Store b/news/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..168be32efa0defd4523aee09da7d992e6fa01737 GIT binary patch literal 6148 zcmeHKOHRWu5S^i2L_$bqgIM|q^aiF1d!$|fN>Q^&Nkz#%_u&*QSi+fj^8qCyi;4w8 zXhyQ%*q-OuPm1j!B3`uXInj)W3KT&WXT+p?bm+)~r$E*@8hW5xswvUa$V7j!OV)l$ z*K|*hw5R_1E3{hLP=&qh>&<+5*KBvX1wLyJV{MzVT-D7A(Z$K+?fK>X_3}{W`iIP2 z)nDm_E;y=CAQT7%LV-}ArvUD3vEtY;>QEpQ2nD_rkn&8&ngRgj8Lb9i zatYxi$L!b{Vg$xU1sat-#bBeOKY3hs>Qf@U&c(3v9~o)^FRByEdWRpomFaCk_n$;1Pg^oFnHt cXyZwI#AV0MP-c;Sr32$3pn^md3jBfsAF1L!fB*mh literal 0 HcmV?d00001 diff --git a/notes.html b/notes.html index 6556085..6fb0244 100644 --- a/notes.html +++ b/notes.html @@ -250,7 +250,7 @@

-
+

 
@@ -275,7 +275,7 @@

diff --git a/notes.xml b/notes.xml index 424d651..4821724 100644 --- a/notes.xml +++ b/notes.xml @@ -378,6 +378,42 @@ Tip

In this section, we just mention some critical formulas. In the case of complex number, we have

The equation can be proved if we use the Taylor series of , and to prove it. This formula will be highlighted when we use complex analysis in the ML.

+

Additionally, we note that the complex number can be viewed as a special dimension so that we have . This will be helpful for many special computation.

+ +
+

Consider the rotary embedding using complex analysis

+

The token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.

+
+
+
+ +
+
+Tip +
+
+ +
+

The present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of and :

+

+

When you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:

+

+

and

+

+

If we think about this structure further, we found that the sin and cos function is the periodic functions, which means for the same relative distance, we could observe similar embedding.

+

Another relative positional embedding is to note that the relative position between the token and is , and the embedding is dependent on the , the difference. Note that we need to use , and this should be able to reflect the relative position information between the two tokens. And the current research indicates that the relative position embedding is important for the positional information. We wish the inner product encodes position information by:

+

+

The idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose , and we can assume the embedding information as:

+

+

Thus, using this information, we have

+

+

After derivation, we found that if we choose the following expression, we can satisfy the condition above:

+

+

The derivation is as the following:

+

+

From the expression of , we can design the rotary embedding setup in the llama2.

diff --git a/notes/.DS_Store b/notes/.DS_Store index 2ee28dae1ea7bdc4f4972f14be435c48af6ba3c8..37a49b6778672de2738daf255ece650c6e145c18 100644 GIT binary patch delta 123 zcmZoMXffDe!@`uDJK2s!kE31s?COK(jyeK4DwAih)G?)aPrk#VBa)l%;*yk;pTxkx zz_Fh597GLJVsZ;>4O6|~?MS%Hm*t%V~%Tkh246gCB+-24={7Pzn;W5LG462{H!9Dn%% D<>w{v diff --git a/notes/Math Theories/complexanalysis.html b/notes/Math Theories/complexanalysis.html index 57f087f..5cbfa6b 100644 --- a/notes/Math Theories/complexanalysis.html +++ b/notes/Math Theories/complexanalysis.html @@ -301,6 +301,78 @@

Basics or for e^{i \theta}=\cos \theta+i \sin \theta \]

The equation can be proved if we use the Taylor series of \(e^x\), \(\operatorname{cos} x\) and \(\operatorname{sin} x\) to prove it. This formula will be highlighted when we use complex analysis in the ML.

+

Additionally, we note that the complex number can be viewed as a special dimension so that we have \(i^2=-1\). This \(i\) will be helpful for many special computation.

+ +
+

Consider the rotary embedding using complex analysis

+

The token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.

+
+
+
+ +
+
+Tip +
+
+ +
+

The present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of \(m\) and \(n\):

+

\[ +\begin{aligned} +\boldsymbol{q}_m & =f_q\left(\boldsymbol{x}_m, m\right) \\ +\boldsymbol{k}_n & =f_k\left(\boldsymbol{x}_n, n\right) \\ +\boldsymbol{v}_n & =f_v\left(\boldsymbol{x}_n, n\right). +\end{aligned} +\]

+

When you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:

+

\[ +f_{t: t \in\{q, k, v\}}\left(\boldsymbol{x}_i, i\right):=\boldsymbol{W}_{t: t \in\{q, k, v\}}\left(\boldsymbol{x}_i+\boldsymbol{p}_i\right), +\]

+

and

+

\[ +\begin{cases}\boldsymbol{p}_{i, 2 t} & =\sin \left(k / 10000^{2 t / d}\right) \\ \boldsymbol{p}_{i, 2 t+1} & =\cos \left(k / 10000^{2 t / d}\right)\end{cases} +\]

+

If we think about this structure further, we found that the sin and cos function is the periodic functions, which means for the same relative distance, we could observe similar embedding.

+

Another relative positional embedding is to note that the relative position between the token \(m\) and \(n\) is \(m-n\), and the embedding is dependent on the \(m-n\), the difference. Note that we need to use \(\boldsymbol{q}_m^T\boldsymbol{k}_n\), and this should be able to reflect the relative position information between the two tokens. And the current research indicates that the relative position embedding is important for the positional information. We wish the inner product encodes position information by:

+

\[ +\left\langle f_q\left(\boldsymbol{x}_m, m\right), f_k\left(\boldsymbol{x}_n, n\right)\right\rangle=g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) . +\]

+

The idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose \(d=2\), and we can assume the embedding information as:

+

\[ +\begin{aligned} +f_q\left(\boldsymbol{x}_q, m\right) & =R_q\left(\boldsymbol{x}_q, m\right) e^{i \Theta_q\left(\boldsymbol{x}_q, m\right)} \\ +f_k\left(\boldsymbol{x}_k, n\right) & =R_k\left(\boldsymbol{x}_k, n\right) e^{i \Theta_k\left(\boldsymbol{x}_k, n\right)} \\ +g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right) & =R_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right) e^{i \Theta_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right)} +\end{aligned} +\]

+

Thus, using this information, we have

+

\[ +\begin{aligned} +R_q\left(\boldsymbol{x}_q, m\right) R_k\left(\boldsymbol{x}_k, n\right) & =R_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right), \\ +\Theta_k\left(\boldsymbol{x}_k, n\right)-\Theta_q\left(\boldsymbol{x}_q, m\right) & =\Theta_g\left(\boldsymbol{x}_q, \boldsymbol{x}_k, n-m\right), +\end{aligned} +\]

+

After derivation, we found that if we choose the following expression, we can satisfy the condition above:

+

\[ +\begin{aligned} +f_q\left(\boldsymbol{x}_m, m\right) & =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta} \\ +f_k\left(\boldsymbol{x}_n, n\right) & =\left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\ +g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) & =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)\left(\boldsymbol{W}_k \boldsymbol{x}_n\right)^T e^{i(m-n) \theta} +\end{aligned} +\]

+

The derivation is as the following:

+

\[ +\begin{aligned} +\langle f_q, f_k\rangle &= f_q^* f_k \\ + &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^* e^{-i m \theta} \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\ + &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^* \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i(n-m) \theta} \\ + &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^T \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i(n-m) \theta}. +\end{aligned} +\]

+

From the expression of \(\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^T e^{-i m \theta}\), we can design the rotary embedding setup in the llama2.

diff --git a/search.json b/search.json index 2a7a77a..9def958 100644 --- a/search.json +++ b/search.json @@ -162,7 +162,7 @@ "href": "notes.html", "title": "Research notes", "section": "", - "text": "Large language model distributed training\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nComplex analysis for machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model evaluation\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n11 min\n\n\n\n\n\n\n\n\n\n\n\n\nMixture of expert\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nScalable diffusion models with transformers\n\n\n\n\n\n\nDiffusion Model\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nReinforcement learning for large language model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n19 min\n\n\n\n\n\n\nNo matching items" + "text": "Large language model distributed training\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nComplex analysis for machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n4 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model evaluation\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n11 min\n\n\n\n\n\n\n\n\n\n\n\n\nMixture of expert\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nScalable diffusion models with transformers\n\n\n\n\n\n\nDiffusion Model\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nReinforcement learning for large language model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n19 min\n\n\n\n\n\n\nNo matching items" }, { "objectID": "index.html", @@ -272,7 +272,19 @@ "href": "notes/Math Theories/complexanalysis.html#basics-or-formulas-required.", "title": "Complex analysis for machine learning", "section": "Basics or formulas required.", - "text": "Basics or formulas required.\nIn this section, we just mention some critical formulas. In the case of complex number, we have\n\\[\ne^{i \\theta}=\\cos \\theta+i \\sin \\theta\n\\]\nThe equation can be proved if we use the Taylor series of \\(e^x\\), \\(\\operatorname{cos} x\\) and \\(\\operatorname{sin} x\\) to prove it. This formula will be highlighted when we use complex analysis in the ML.", + "text": "Basics or formulas required.\nIn this section, we just mention some critical formulas. In the case of complex number, we have\n\\[\ne^{i \\theta}=\\cos \\theta+i \\sin \\theta\n\\]\nThe equation can be proved if we use the Taylor series of \\(e^x\\), \\(\\operatorname{cos} x\\) and \\(\\operatorname{sin} x\\) to prove it. This formula will be highlighted when we use complex analysis in the ML.\nAdditionally, we note that the complex number can be viewed as a special dimension so that we have \\(i^2=-1\\). This \\(i\\) will be helpful for many special computation.", + "crumbs": [ + "Home", + "ā™¾ **Math Theories**", + "Complex analysis for machine learning" + ] + }, + { + "objectID": "notes/Math Theories/complexanalysis.html#consider-the-rotary-embedding-using-complex-analysis", + "href": "notes/Math Theories/complexanalysis.html#consider-the-rotary-embedding-using-complex-analysis", + "title": "Complex analysis for machine learning", + "section": "Consider the rotary embedding using complex analysis", + "text": "Consider the rotary embedding using complex analysis\nThe token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.\n\n\n\n\n\n\nTip\n\n\n\nšŸ“ Paper: ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING\n\n\nThe present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of \\(m\\) and \\(n\\):\n\\[\n\\begin{aligned}\n\\boldsymbol{q}_m & =f_q\\left(\\boldsymbol{x}_m, m\\right) \\\\\n\\boldsymbol{k}_n & =f_k\\left(\\boldsymbol{x}_n, n\\right) \\\\\n\\boldsymbol{v}_n & =f_v\\left(\\boldsymbol{x}_n, n\\right).\n\\end{aligned}\n\\]\nWhen you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:\n\\[\nf_{t: t \\in\\{q, k, v\\}}\\left(\\boldsymbol{x}_i, i\\right):=\\boldsymbol{W}_{t: t \\in\\{q, k, v\\}}\\left(\\boldsymbol{x}_i+\\boldsymbol{p}_i\\right),\n\\]\nand\n\\[\n\\begin{cases}\\boldsymbol{p}_{i, 2 t} & =\\sin \\left(k / 10000^{2 t / d}\\right) \\\\ \\boldsymbol{p}_{i, 2 t+1} & =\\cos \\left(k / 10000^{2 t / d}\\right)\\end{cases}\n\\]\nIf we think about this structure further, we found that the sin and cos function is the periodic functions, which means for the same relative distance, we could observe similar embedding.\nAnother relative positional embedding is to note that the relative position between the token \\(m\\) and \\(n\\) is \\(m-n\\), and the embedding is dependent on the \\(m-n\\), the difference. Note that we need to use \\(\\boldsymbol{q}_m^T\\boldsymbol{k}_n\\), and this should be able to reflect the relative position information between the two tokens. And the current research indicates that the relative position embedding is important for the positional information. We wish the inner product encodes position information by:\n\\[\n\\left\\langle f_q\\left(\\boldsymbol{x}_m, m\\right), f_k\\left(\\boldsymbol{x}_n, n\\right)\\right\\rangle=g\\left(\\boldsymbol{x}_m, \\boldsymbol{x}_n, m-n\\right) .\n\\]\nThe idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose \\(d=2\\), and we can assume the embedding information as:\n\\[\n\\begin{aligned}\nf_q\\left(\\boldsymbol{x}_q, m\\right) & =R_q\\left(\\boldsymbol{x}_q, m\\right) e^{i \\Theta_q\\left(\\boldsymbol{x}_q, m\\right)} \\\\\nf_k\\left(\\boldsymbol{x}_k, n\\right) & =R_k\\left(\\boldsymbol{x}_k, n\\right) e^{i \\Theta_k\\left(\\boldsymbol{x}_k, n\\right)} \\\\\ng\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right) & =R_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right) e^{i \\Theta_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right)}\n\\end{aligned}\n\\]\nThus, using this information, we have\n\\[\n\\begin{aligned}\nR_q\\left(\\boldsymbol{x}_q, m\\right) R_k\\left(\\boldsymbol{x}_k, n\\right) & =R_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right), \\\\\n\\Theta_k\\left(\\boldsymbol{x}_k, n\\right)-\\Theta_q\\left(\\boldsymbol{x}_q, m\\right) & =\\Theta_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right),\n\\end{aligned}\n\\]\nAfter derivation, we found that if we choose the following expression, we can satisfy the condition above:\n\\[\n\\begin{aligned}\nf_q\\left(\\boldsymbol{x}_m, m\\right) & =\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right) e^{i m \\theta} \\\\\nf_k\\left(\\boldsymbol{x}_n, n\\right) & =\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i n \\theta} \\\\\ng\\left(\\boldsymbol{x}_m, \\boldsymbol{x}_n, m-n\\right) & =\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right)^T e^{i(m-n) \\theta}\n\\end{aligned}\n\\]\nThe derivation is as the following:\n\\[\n\\begin{aligned}\n\\langle f_q, f_k\\rangle &= f_q^* f_k \\\\\n &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^* e^{-i m \\theta} \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i n \\theta} \\\\\n &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^* \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i(n-m) \\theta} \\\\\n &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^T \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i(n-m) \\theta}.\n\\end{aligned}\n\\]\nFrom the expression of \\(\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^T e^{-i m \\theta}\\), we can design the rotary embedding setup in the llama2.", "crumbs": [ "Home", "ā™¾ **Math Theories**", diff --git a/sitemap.xml b/sitemap.xml index 220c21b..7619d45 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -42,7 +42,7 @@ https://alexchen4ai.github.io/blog/notes/Math Theories/complexanalysis.html - 2024-02-26T07:42:54.067Z + 2024-03-11T05:59:41.487Z https://alexchen4ai.github.io/blog/news/Research news/researchnews.html