Built site for gh-pages

alexchen4ai · Nov 12, 2024 · f275151 · f275151
1 parent 702dfe4
commit f275151
Show file tree

Hide file tree

Showing 6 changed files with 6 additions and 4 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-11a1aa0f
+5a52c4a5
diff --git a/notes.html b/notes.html
@@ -220,7 +220,7 @@ <h1 class="title">Research notes</h1>
 
 <div class="quarto-listing quarto-listing-container-default" id="listing-listing">
 <div class="list quarto-listing-default">
-<div class="quarto-post image-right" data-index="0" data-categories="Large Language Models" data-listing-date-sort="1713596400000" data-listing-file-modified-sort="1731433677233" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="5" data-listing-word-count-sort="810">
+<div class="quarto-post image-right" data-index="0" data-categories="Large Language Models" data-listing-date-sort="1713596400000" data-listing-file-modified-sort="1731433836344" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="5" data-listing-word-count-sort="895">
 <div class="thumbnail">
 <p><a href="./notes/Large Language Model/inference_optimize.html" class="no-external"></a></p><a href="./notes/Large Language Model/inference_optimize.html" class="no-external">
 <div class="listing-item-img-placeholder card-img-top" >&nbsp;</div>

diff --git a/notes.xml b/notes.xml
@@ -60,6 +60,7 @@ Tip
 <p>Quantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a <code>32-bit</code> floating-point number (<code>FP32</code>) to an <code>8-bit</code> or <code>4-bit</code> integer (<code>INT4</code> or <code>INT8</code>). A good blog from internet is <a href="https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/">here</a>. We note that the conversion will decrease the memory and disk usage considerably. We note that for real calculation, we still need to <strong>dequantize</strong> the data to the original data type like <code>float32</code> or <code>bfloat16</code>. The trick is that we only dequantize the data when we need to calculate the data while keeping the most of data in the quantized format. Therefore, we still save the memory and disk usage.</p>
 </div>
 </div>
+<p>::: {.Quantization techniques} We note there are many methods to quantize the model. The main methods are <em>Post Training Quantization</em> and <em>Quantization Aware Training</em>. The main difference between the two methods is that the <em>Post Training Quantization</em> quantizes the model after the training, while the <em>Quantization Aware Training</em> quantizes the model during the training. The <em>Post Training Quantization</em> includes the <em>Dynamic Quantization</em> and <em>Static Quantization</em>. The <em>Dynamic Quantization</em> quantizes the model dynamically during the inference, while the <em>Static Quantization</em> quantizes the model statically before the inference. :::</p>
 <p>Let’s first revisit the representation of data in computer. We mainly study the <code>float32</code>, <code>float16</code> and <code>bfloat16</code> type.</p>
 <ul>
 <li><strong>float32</strong>: 32 bits. We have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa. To form a float number in computer, we need the sign, the number before the exponent and the exponent number over 2. For example, we have <img src="https://latex.codecogs.com/png.latex?6.75=+1.1011%5Ctimes%202%5E2">. Thus, we can conclude that the range of the representation is between <img src="https://latex.codecogs.com/png.latex?1%5Ctimes%2010%5E%7B-38%7D"> and <img src="https://latex.codecogs.com/png.latex?3%5Ctimes%2010%5E%7B38%7D"> (you can add sign freely, though).</li>

diff --git a/notes/Large Language Model/inference_optimize.html b/notes/Large Language Model/inference_optimize.html
@@ -332,6 +332,7 @@ <h2 class="anchored" data-anchor-id="quantization">Quantization</h2>
 <p>Quantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a <code>32-bit</code> floating-point number (<code>FP32</code>) to an <code>8-bit</code> or <code>4-bit</code> integer (<code>INT4</code> or <code>INT8</code>). A good blog from internet is <a href="https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/">here</a>. We note that the conversion will decrease the memory and disk usage considerably. We note that for real calculation, we still need to <strong>dequantize</strong> the data to the original data type like <code>float32</code> or <code>bfloat16</code>. The trick is that we only dequantize the data when we need to calculate the data while keeping the most of data in the quantized format. Therefore, we still save the memory and disk usage.</p>
 </div>
 </div>
+<p>::: {.Quantization techniques} We note there are many methods to quantize the model. The main methods are <em>Post Training Quantization</em> and <em>Quantization Aware Training</em>. The main difference between the two methods is that the <em>Post Training Quantization</em> quantizes the model after the training, while the <em>Quantization Aware Training</em> quantizes the model during the training. The <em>Post Training Quantization</em> includes the <em>Dynamic Quantization</em> and <em>Static Quantization</em>. The <em>Dynamic Quantization</em> quantizes the model dynamically during the inference, while the <em>Static Quantization</em> quantizes the model statically before the inference. :::</p>
 <p>Let’s first revisit the representation of data in computer. We mainly study the <code>float32</code>, <code>float16</code> and <code>bfloat16</code> type.</p>
 <ul>
 <li><strong>float32</strong>: 32 bits. We have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa. To form a float number in computer, we need the sign, the number before the exponent and the exponent number over 2. For example, we have <span class="math inline">\(6.75=+1.1011\times 2^2\)</span>. Thus, we can conclude that the range of the representation is between <span class="math inline">\(1\times 10^{-38}\)</span> and <span class="math inline">\(3\times 10^{38}\)</span> (you can add sign freely, though).</li>

diff --git a/search.json b/search.json
@@ -136,7 +136,7 @@
     "href": "notes/Large Language Model/inference_optimize.html#quantization",
     "title": "Optimization for Inference of Large Language Model",
     "section": "Quantization",
-    "text": "Quantization\n\n\n\n\n\n\nTip\n\n\n\nQuantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here. We note that the conversion will decrease the memory and disk usage considerably. We note that for real calculation, we still need to dequantize the data to the original data type like float32 or bfloat16. The trick is that we only dequantize the data when we need to calculate the data while keeping the most of data in the quantized format. Therefore, we still save the memory and disk usage.\n\n\nLet’s first revisit the representation of data in computer. We mainly study the float32, float16 and bfloat16 type.\n\nfloat32: 32 bits. We have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa. To form a float number in computer, we need the sign, the number before the exponent and the exponent number over 2. For example, we have \\(6.75=+1.1011\\times 2^2\\). Thus, we can conclude that the range of the representation is between \\(1\\times 10^{-38}\\) and \\(3\\times 10^{38}\\) (you can add sign freely, though).\nfloat16: 16 bits. We have 1 bit for the sign, 5 bits for the exponent and 10 bits for the mantissa. The range of the representation is between \\(6\\times 10^{-8}\\) and \\(6\\times 10^{4}\\).\nbfloat16: 16 bits. We have 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa. The range of the representation is between \\(1\\times 10^{-38}\\) and \\(3\\times 10^{38}\\).\n\nWe can see that float16 and bfloat16 take up the same memory space. But they are different in the bits allocation. The float16 has better precision than bfloat16, but the bfloat16 has better range than float16. For deep neural network, we may need to consider the use of the bfloat16 type since the range is more important than the precision for the deep neural network. The common quantization type are INT8 and INT4. Note that INT8 and INT4 can only represent the integer numbers, not for the float numbers. Thus, INT8 can only represent the numbers between \\(-128\\) and \\(127\\), and INT4 can only represent the numbers between \\(-8\\) and \\(7\\).\nWe use the affine quantization scheme to convert the model:\n\\[\nx_q = \\operatorname{round}\\left(x/S + Z\\right)\n\\]\nwhere we have: - \\(x_q\\): the quantized value - \\(x\\): the original value - \\(S\\): the scale factor - \\(Z\\): the zero point - \\(\\operatorname{round}\\): the rounding function.\nUsually, we will set multiple blocks to quantize the model. It means that we need multiple scale factors and zero points. Note that not all layers are quantized. For some important layers, we still consider the use of the float32 type.\nFor LLM quantization, we have two different methods called post-training quantization and quantization-aware training. If we finally use the quantization model, quantization-aware training is better.\n\nExisiting solutions\nWe can use quantization library provied in huggingface transformers. For more foundamental optimization, we should consider to use GGML (GPT-Generated Model Language) and GGUF (GPT-Generated Unified Format). For on-device deployment, we should consider the usage of GGUF since it is more efficient. Refer to github to use it. We can consider another library called ollama which is built based on the llama cpp.",
+    "text": "Quantization\n\n\n\n\n\n\nTip\n\n\n\nQuantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here. We note that the conversion will decrease the memory and disk usage considerably. We note that for real calculation, we still need to dequantize the data to the original data type like float32 or bfloat16. The trick is that we only dequantize the data when we need to calculate the data while keeping the most of data in the quantized format. Therefore, we still save the memory and disk usage.\n\n\n::: {.Quantization techniques} We note there are many methods to quantize the model. The main methods are Post Training Quantization and Quantization Aware Training. The main difference between the two methods is that the Post Training Quantization quantizes the model after the training, while the Quantization Aware Training quantizes the model during the training. The Post Training Quantization includes the Dynamic Quantization and Static Quantization. The Dynamic Quantization quantizes the model dynamically during the inference, while the Static Quantization quantizes the model statically before the inference. :::\nLet’s first revisit the representation of data in computer. We mainly study the float32, float16 and bfloat16 type.\n\nfloat32: 32 bits. We have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa. To form a float number in computer, we need the sign, the number before the exponent and the exponent number over 2. For example, we have \\(6.75=+1.1011\\times 2^2\\). Thus, we can conclude that the range of the representation is between \\(1\\times 10^{-38}\\) and \\(3\\times 10^{38}\\) (you can add sign freely, though).\nfloat16: 16 bits. We have 1 bit for the sign, 5 bits for the exponent and 10 bits for the mantissa. The range of the representation is between \\(6\\times 10^{-8}\\) and \\(6\\times 10^{4}\\).\nbfloat16: 16 bits. We have 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa. The range of the representation is between \\(1\\times 10^{-38}\\) and \\(3\\times 10^{38}\\).\n\nWe can see that float16 and bfloat16 take up the same memory space. But they are different in the bits allocation. The float16 has better precision than bfloat16, but the bfloat16 has better range than float16. For deep neural network, we may need to consider the use of the bfloat16 type since the range is more important than the precision for the deep neural network. The common quantization type are INT8 and INT4. Note that INT8 and INT4 can only represent the integer numbers, not for the float numbers. Thus, INT8 can only represent the numbers between \\(-128\\) and \\(127\\), and INT4 can only represent the numbers between \\(-8\\) and \\(7\\).\nWe use the affine quantization scheme to convert the model:\n\\[\nx_q = \\operatorname{round}\\left(x/S + Z\\right)\n\\]\nwhere we have: - \\(x_q\\): the quantized value - \\(x\\): the original value - \\(S\\): the scale factor - \\(Z\\): the zero point - \\(\\operatorname{round}\\): the rounding function.\nUsually, we will set multiple blocks to quantize the model. It means that we need multiple scale factors and zero points. Note that not all layers are quantized. For some important layers, we still consider the use of the float32 type.\nFor LLM quantization, we have two different methods called post-training quantization and quantization-aware training. If we finally use the quantization model, quantization-aware training is better.\n\nExisiting solutions\nWe can use quantization library provied in huggingface transformers. For more foundamental optimization, we should consider to use GGML (GPT-Generated Model Language) and GGUF (GPT-Generated Unified Format). For on-device deployment, we should consider the usage of GGUF since it is more efficient. Refer to github to use it. We can consider another library called ollama which is built based on the llama cpp.",
     "crumbs": [
       "Home",
       "🗣️ **Large language models**",

diff --git a/sitemap.xml b/sitemap.xml
@@ -14,7 +14,7 @@
   </url>
   <url>
     <loc>https://alexchen4ai.github.io/blog/notes/Large Language Model/inference_optimize.html</loc>
-    <lastmod>2024-11-12T17:47:57.233Z</lastmod>
+    <lastmod>2024-11-12T17:50:36.344Z</lastmod>
   </url>
   <url>
     <loc>https://alexchen4ai.github.io/blog/notes/Large Language Model/rl_llm.html</loc>