From 0e634e48c87fdafd19f11d074bfeca3d19c685f1 Mon Sep 17 00:00:00 2001 From: AlexCHEN Date: Mon, 11 Mar 2024 00:11:49 -0700 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- notes.html | 4 ++-- notes.xml | 10 ++++++---- notes/Math Theories/complexanalysis.html | 15 ++++++++------- search.json | 4 ++-- sitemap.xml | 2 +- 6 files changed, 20 insertions(+), 17 deletions(-) diff --git a/.nojekyll b/.nojekyll index 31f15cb..9255e65 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -300494ba \ No newline at end of file +a64115be \ No newline at end of file diff --git a/notes.html b/notes.html index 6fb0244..a375249 100644 --- a/notes.html +++ b/notes.html @@ -250,7 +250,7 @@

-
+

 
@@ -275,7 +275,7 @@

diff --git a/notes.xml b/notes.xml index 4821724..bb8f725 100644 --- a/notes.xml +++ b/notes.xml @@ -410,10 +410,12 @@ Tip

Thus, using this information, we have

After derivation, we found that if we choose the following expression, we can satisfy the condition above:

-

-

The derivation is as the following:

-

-

From the expression of , we can design the rotary embedding setup in the llama2.

+

+

The derivation is as the following (This is not shown in the paper, you can use the derivation below to understand the paper better):

+

Note that if we express two vectors as and , the inner product is . How would this be related to the multiplication of the two vectors? We actually have: .

+

+

From the expression of , we can design the rotary embedding setup in the llama2. It is important to note here that we introduce the complex number since we wish to integrate the meaning of the magnitude and the angle. The embedding itself represents the magnitude, and the angle is from the position. In real matrix multiplication, we can only do the real calculation, thus, we need the mapping above.

+

We can put it another way. The real and imaginary part of the complex number are useful information for us. We can express the vectors using complex theory, so that we can incorporate the angle or phase information from the vectors. Finally, we still need to map back to the real operations and proceed it with useful information. It is like a auxiliary method to help process information using some intermediate state.

diff --git a/notes/Math Theories/complexanalysis.html b/notes/Math Theories/complexanalysis.html index 5cbfa6b..42f3fc6 100644 --- a/notes/Math Theories/complexanalysis.html +++ b/notes/Math Theories/complexanalysis.html @@ -360,19 +360,20 @@

\(z_1 = a + bi\) and \(z_2 = c + di\), the inner product is \(ac + bd\). How would this be related to the multiplication of the two vectors? We actually have: \(ac-bd = \operatorname{Re}(z_1 * \overline{z_2})\).

\[ \begin{aligned} -\langle f_q, f_k\rangle &= f_q^* f_k \\ - &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^* e^{-i m \theta} \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\ - &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^* \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i(n-m) \theta} \\ - &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^T \left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i(n-m) \theta}. +\langle f_q, f_k\rangle &= \operatorname{Re}(f_q * \overline{f_k}) \\ + &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta} \overline{\left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta}} \\ + &= \left(\boldsymbol{W}_q \boldsymbol{x}_m\right) \overline{\left(\boldsymbol{W}_k \boldsymbol{x}_n\right)} e^{i(n-m) \theta} \end{aligned} \]

-

From the expression of \(\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)^T e^{-i m \theta}\), we can design the rotary embedding setup in the llama2.

+

From the expression of \(f_q\left(\boldsymbol{x}_m, m\right) =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta}\), we can design the rotary embedding setup in the llama2. It is important to note here that we introduce the complex number since we wish to integrate the meaning of the magnitude and the angle. The embedding itself represents the magnitude, and the angle is from the position. In real matrix multiplication, we can only do the real calculation, thus, we need the mapping above.

+

We can put it another way. The real and imaginary part of the complex number are useful information for us. We can express the vectors using complex theory, so that we can incorporate the angle or phase information from the vectors. Finally, we still need to map back to the real operations and proceed it with useful information. It is like a auxiliary method to help process information using some intermediate state.

diff --git a/search.json b/search.json index 9def958..7aaa6b7 100644 --- a/search.json +++ b/search.json @@ -162,7 +162,7 @@ "href": "notes.html", "title": "Research notes", "section": "", - "text": "Large language model distributed training\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nComplex analysis for machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n4 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model evaluation\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n11 min\n\n\n\n\n\n\n\n\n\n\n\n\nMixture of expert\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nScalable diffusion models with transformers\n\n\n\n\n\n\nDiffusion Model\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nReinforcement learning for large language model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n19 min\n\n\n\n\n\n\nNo matching items" + "text": "Large language model distributed training\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nComplex analysis for machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model evaluation\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n11 min\n\n\n\n\n\n\n\n\n\n\n\n\nMixture of expert\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nScalable diffusion models with transformers\n\n\n\n\n\n\nDiffusion Model\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nReinforcement learning for large language model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n19 min\n\n\n\n\n\n\nNo matching items" }, { "objectID": "index.html", @@ -284,7 +284,7 @@ "href": "notes/Math Theories/complexanalysis.html#consider-the-rotary-embedding-using-complex-analysis", "title": "Complex analysis for machine learning", "section": "Consider the rotary embedding using complex analysis", - "text": "Consider the rotary embedding using complex analysis\nThe token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.\n\n\n\n\n\n\nTip\n\n\n\nšŸ“ Paper: ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING\n\n\nThe present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of \\(m\\) and \\(n\\):\n\\[\n\\begin{aligned}\n\\boldsymbol{q}_m & =f_q\\left(\\boldsymbol{x}_m, m\\right) \\\\\n\\boldsymbol{k}_n & =f_k\\left(\\boldsymbol{x}_n, n\\right) \\\\\n\\boldsymbol{v}_n & =f_v\\left(\\boldsymbol{x}_n, n\\right).\n\\end{aligned}\n\\]\nWhen you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:\n\\[\nf_{t: t \\in\\{q, k, v\\}}\\left(\\boldsymbol{x}_i, i\\right):=\\boldsymbol{W}_{t: t \\in\\{q, k, v\\}}\\left(\\boldsymbol{x}_i+\\boldsymbol{p}_i\\right),\n\\]\nand\n\\[\n\\begin{cases}\\boldsymbol{p}_{i, 2 t} & =\\sin \\left(k / 10000^{2 t / d}\\right) \\\\ \\boldsymbol{p}_{i, 2 t+1} & =\\cos \\left(k / 10000^{2 t / d}\\right)\\end{cases}\n\\]\nIf we think about this structure further, we found that the sin and cos function is the periodic functions, which means for the same relative distance, we could observe similar embedding.\nAnother relative positional embedding is to note that the relative position between the token \\(m\\) and \\(n\\) is \\(m-n\\), and the embedding is dependent on the \\(m-n\\), the difference. Note that we need to use \\(\\boldsymbol{q}_m^T\\boldsymbol{k}_n\\), and this should be able to reflect the relative position information between the two tokens. And the current research indicates that the relative position embedding is important for the positional information. We wish the inner product encodes position information by:\n\\[\n\\left\\langle f_q\\left(\\boldsymbol{x}_m, m\\right), f_k\\left(\\boldsymbol{x}_n, n\\right)\\right\\rangle=g\\left(\\boldsymbol{x}_m, \\boldsymbol{x}_n, m-n\\right) .\n\\]\nThe idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose \\(d=2\\), and we can assume the embedding information as:\n\\[\n\\begin{aligned}\nf_q\\left(\\boldsymbol{x}_q, m\\right) & =R_q\\left(\\boldsymbol{x}_q, m\\right) e^{i \\Theta_q\\left(\\boldsymbol{x}_q, m\\right)} \\\\\nf_k\\left(\\boldsymbol{x}_k, n\\right) & =R_k\\left(\\boldsymbol{x}_k, n\\right) e^{i \\Theta_k\\left(\\boldsymbol{x}_k, n\\right)} \\\\\ng\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right) & =R_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right) e^{i \\Theta_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right)}\n\\end{aligned}\n\\]\nThus, using this information, we have\n\\[\n\\begin{aligned}\nR_q\\left(\\boldsymbol{x}_q, m\\right) R_k\\left(\\boldsymbol{x}_k, n\\right) & =R_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right), \\\\\n\\Theta_k\\left(\\boldsymbol{x}_k, n\\right)-\\Theta_q\\left(\\boldsymbol{x}_q, m\\right) & =\\Theta_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right),\n\\end{aligned}\n\\]\nAfter derivation, we found that if we choose the following expression, we can satisfy the condition above:\n\\[\n\\begin{aligned}\nf_q\\left(\\boldsymbol{x}_m, m\\right) & =\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right) e^{i m \\theta} \\\\\nf_k\\left(\\boldsymbol{x}_n, n\\right) & =\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i n \\theta} \\\\\ng\\left(\\boldsymbol{x}_m, \\boldsymbol{x}_n, m-n\\right) & =\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right)^T e^{i(m-n) \\theta}\n\\end{aligned}\n\\]\nThe derivation is as the following:\n\\[\n\\begin{aligned}\n\\langle f_q, f_k\\rangle &= f_q^* f_k \\\\\n &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^* e^{-i m \\theta} \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i n \\theta} \\\\\n &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^* \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i(n-m) \\theta} \\\\\n &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^T \\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i(n-m) \\theta}.\n\\end{aligned}\n\\]\nFrom the expression of \\(\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)^T e^{-i m \\theta}\\), we can design the rotary embedding setup in the llama2.", + "text": "Consider the rotary embedding using complex analysis\nThe token positional embedding is used to capture the features of token because of its position in the sequence. To put it simple, the token in the position 0 has different contribution from the token to the position 10.\n\n\n\n\n\n\nTip\n\n\n\nšŸ“ Paper: ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING\n\n\nThe present research indicate that we wish to add the absolute embedding and the relative embedding to the token based on its position. The absolute embedding is just decided by the position of the token. And the relative embedding is decided by the relative position of the token. Note that for the embedding computation, we need to compute the attendence score between the token at the position of \\(m\\) and \\(n\\):\n\\[\n\\begin{aligned}\n\\boldsymbol{q}_m & =f_q\\left(\\boldsymbol{x}_m, m\\right) \\\\\n\\boldsymbol{k}_n & =f_k\\left(\\boldsymbol{x}_n, n\\right) \\\\\n\\boldsymbol{v}_n & =f_v\\left(\\boldsymbol{x}_n, n\\right).\n\\end{aligned}\n\\]\nWhen you consider this problem, try to simulate the LLM inference (not training). The query is the new token to be predicted, and the key and value are the existing tokens. Now, we also need to consider the position info between them. The original transformer paper uses the absolute position embedding:\n\\[\nf_{t: t \\in\\{q, k, v\\}}\\left(\\boldsymbol{x}_i, i\\right):=\\boldsymbol{W}_{t: t \\in\\{q, k, v\\}}\\left(\\boldsymbol{x}_i+\\boldsymbol{p}_i\\right),\n\\]\nand\n\\[\n\\begin{cases}\\boldsymbol{p}_{i, 2 t} & =\\sin \\left(k / 10000^{2 t / d}\\right) \\\\ \\boldsymbol{p}_{i, 2 t+1} & =\\cos \\left(k / 10000^{2 t / d}\\right)\\end{cases}\n\\]\nIf we think about this structure further, we found that the sin and cos function is the periodic functions, which means for the same relative distance, we could observe similar embedding.\nAnother relative positional embedding is to note that the relative position between the token \\(m\\) and \\(n\\) is \\(m-n\\), and the embedding is dependent on the \\(m-n\\), the difference. Note that we need to use \\(\\boldsymbol{q}_m^T\\boldsymbol{k}_n\\), and this should be able to reflect the relative position information between the two tokens. And the current research indicates that the relative position embedding is important for the positional information. We wish the inner product encodes position information by:\n\\[\n\\left\\langle f_q\\left(\\boldsymbol{x}_m, m\\right), f_k\\left(\\boldsymbol{x}_n, n\\right)\\right\\rangle=g\\left(\\boldsymbol{x}_m, \\boldsymbol{x}_n, m-n\\right) .\n\\]\nThe idea here is that we express the relative position as the information of angle rather than the position in a linear segment. And we can use complex analysis for it. It is like the signal processing where each signal has the frequency and the magnitude. Suppose \\(d=2\\), and we can assume the embedding information as:\n\\[\n\\begin{aligned}\nf_q\\left(\\boldsymbol{x}_q, m\\right) & =R_q\\left(\\boldsymbol{x}_q, m\\right) e^{i \\Theta_q\\left(\\boldsymbol{x}_q, m\\right)} \\\\\nf_k\\left(\\boldsymbol{x}_k, n\\right) & =R_k\\left(\\boldsymbol{x}_k, n\\right) e^{i \\Theta_k\\left(\\boldsymbol{x}_k, n\\right)} \\\\\ng\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right) & =R_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right) e^{i \\Theta_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right)}\n\\end{aligned}\n\\]\nThus, using this information, we have\n\\[\n\\begin{aligned}\nR_q\\left(\\boldsymbol{x}_q, m\\right) R_k\\left(\\boldsymbol{x}_k, n\\right) & =R_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right), \\\\\n\\Theta_k\\left(\\boldsymbol{x}_k, n\\right)-\\Theta_q\\left(\\boldsymbol{x}_q, m\\right) & =\\Theta_g\\left(\\boldsymbol{x}_q, \\boldsymbol{x}_k, n-m\\right),\n\\end{aligned}\n\\]\nAfter derivation, we found that if we choose the following expression, we can satisfy the condition above:\n\\[\n\\begin{aligned}\nf_q\\left(\\boldsymbol{x}_m, m\\right) & =\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right) e^{i m \\theta} \\\\\nf_k\\left(\\boldsymbol{x}_n, n\\right) & =\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i n \\theta} \\\\\ng\\left(\\boldsymbol{x}_m, \\boldsymbol{x}_n, m-n\\right) & =\\operatorname{Re}\\left[\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right)\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right)^* e^{i(m-n) \\theta}\\right]\n\\end{aligned}\n\\]\nThe derivation is as the following (This is not shown in the paper, you can use the derivation below to understand the paper better):\nNote that if we express two vectors as \\(z_1 = a + bi\\) and \\(z_2 = c + di\\), the inner product is \\(ac + bd\\). How would this be related to the multiplication of the two vectors? We actually have: \\(ac-bd = \\operatorname{Re}(z_1 * \\overline{z_2})\\).\n\\[\n\\begin{aligned}\n\\langle f_q, f_k\\rangle &= \\operatorname{Re}(f_q * \\overline{f_k}) \\\\\n &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right) e^{i m \\theta} \\overline{\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right) e^{i n \\theta}} \\\\\n &= \\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right) \\overline{\\left(\\boldsymbol{W}_k \\boldsymbol{x}_n\\right)} e^{i(n-m) \\theta}\n\\end{aligned}\n\\]\nFrom the expression of \\(f_q\\left(\\boldsymbol{x}_m, m\\right) =\\left(\\boldsymbol{W}_q \\boldsymbol{x}_m\\right) e^{i m \\theta}\\), we can design the rotary embedding setup in the llama2. It is important to note here that we introduce the complex number since we wish to integrate the meaning of the magnitude and the angle. The embedding itself represents the magnitude, and the angle is from the position. In real matrix multiplication, we can only do the real calculation, thus, we need the mapping above.\nWe can put it another way. The real and imaginary part of the complex number are useful information for us. We can express the vectors using complex theory, so that we can incorporate the angle or phase information from the vectors. Finally, we still need to map back to the real operations and proceed it with useful information. It is like a auxiliary method to help process information using some intermediate state.", "crumbs": [ "Home", "ā™¾ **Math Theories**", diff --git a/sitemap.xml b/sitemap.xml index 7619d45..bdd3bac 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -42,7 +42,7 @@ https://alexchen4ai.github.io/blog/notes/Math Theories/complexanalysis.html - 2024-03-11T05:59:41.487Z + 2024-03-11T07:11:25.495Z https://alexchen4ai.github.io/blog/news/Research news/researchnews.html