diff --git a/.nojekyll b/.nojekyll index 8794e57..83323b1 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -e177b579 \ No newline at end of file +11a1aa0f \ No newline at end of file diff --git a/notes.html b/notes.html index 0e523d6..9805e33 100644 --- a/notes.html +++ b/notes.html @@ -220,7 +220,7 @@

Research notes

-
+

 
@@ -245,7 +245,7 @@

diff --git a/notes.xml b/notes.xml index 21e27e2..31125cd 100644 --- a/notes.xml +++ b/notes.xml @@ -57,14 +57,14 @@ Tip

-

Quantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here.

+

Quantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here. We note that the conversion will decrease the memory and disk usage considerably. We note that for real calculation, we still need to dequantize the data to the original data type like float32 or bfloat16. The trick is that we only dequantize the data when we need to calculate the data while keeping the most of data in the quantized format. Therefore, we still save the memory and disk usage.

Let’s first revisit the representation of data in computer. We mainly study the float32, float16 and bfloat16 type.

We can see that float16 and bfloat16 take up the same memory space. But they are different in the bits allocation. The float16 has better precision than bfloat16, but the bfloat16 has better range than float16. For deep neural network, we may need to consider the use of the bfloat16 type since the range is more important than the precision for the deep neural network. The common quantization type are INT8 and INT4. Note that INT8 and INT4 can only represent the integer numbers, not for the float numbers. Thus, INT8 can only represent the numbers between and , and INT4 can only represent the numbers between and .

We use the affine quantization scheme to convert the model:

diff --git a/notes/Large Language Model/inference_optimize.html b/notes/Large Language Model/inference_optimize.html index fe69d41..a313e19 100644 --- a/notes/Large Language Model/inference_optimize.html +++ b/notes/Large Language Model/inference_optimize.html @@ -329,14 +329,14 @@

Quantization

-

Quantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here.

+

Quantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here. We note that the conversion will decrease the memory and disk usage considerably. We note that for real calculation, we still need to dequantize the data to the original data type like float32 or bfloat16. The trick is that we only dequantize the data when we need to calculate the data while keeping the most of data in the quantized format. Therefore, we still save the memory and disk usage.

Let’s first revisit the representation of data in computer. We mainly study the float32, float16 and bfloat16 type.

We can see that float16 and bfloat16 take up the same memory space. But they are different in the bits allocation. The float16 has better precision than bfloat16, but the bfloat16 has better range than float16. For deep neural network, we may need to consider the use of the bfloat16 type since the range is more important than the precision for the deep neural network. The common quantization type are INT8 and INT4. Note that INT8 and INT4 can only represent the integer numbers, not for the float numbers. Thus, INT8 can only represent the numbers between \(-128\) and \(127\), and INT4 can only represent the numbers between \(-8\) and \(7\).

We use the affine quantization scheme to convert the model:

diff --git a/search.json b/search.json index a7aa8d2..fe79dfe 100644 --- a/search.json +++ b/search.json @@ -136,7 +136,7 @@ "href": "notes/Large Language Model/inference_optimize.html#quantization", "title": "Optimization for Inference of Large Language Model", "section": "Quantization", - "text": "Quantization\n\n\n\n\n\n\nTip\n\n\n\nQuantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here.\n\n\nLet’s first revisit the representation of data in computer. We mainly study the float32, float16 and bfloat16 type.\n\nfloat32: 32 bits. We have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa. To form a float number in computer, we need the sign, the number before the exponent and the exponent number over 2. For example, we have \\(6.75=+1.1011\\times 2^2\\). Thus, we can conclude that the range of the representation is between \\(1e^{-38}\\) and \\(3e^{38}\\) (you can add sign freely, though).\nfloat16: 16 bits. We have 1 bit for the sign, 5 bits for the exponent and 10 bits for the mantissa. The range of the representation is between \\(6e^{-8}\\) and \\(6e^{4}\\).\nbfloat16: 16 bits. We have 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa. The range of the representation is between \\(1e^{-38}\\) and \\(3e^{38}\\).\n\nWe can see that float16 and bfloat16 take up the same memory space. But they are different in the bits allocation. The float16 has better precision than bfloat16, but the bfloat16 has better range than float16. For deep neural network, we may need to consider the use of the bfloat16 type since the range is more important than the precision for the deep neural network. The common quantization type are INT8 and INT4. Note that INT8 and INT4 can only represent the integer numbers, not for the float numbers. Thus, INT8 can only represent the numbers between \\(-128\\) and \\(127\\), and INT4 can only represent the numbers between \\(-8\\) and \\(7\\).\nWe use the affine quantization scheme to convert the model:\n\\[\nx_q = \\operatorname{round}\\left(x/S + Z\\right)\n\\]\nwhere we have: - \\(x_q\\): the quantized value - \\(x\\): the original value - \\(S\\): the scale factor - \\(Z\\): the zero point - \\(\\operatorname{round}\\): the rounding function.\nUsually, we will set multiple blocks to quantize the model. It means that we need multiple scale factors and zero points. Note that not all layers are quantized. For some important layers, we still consider the use of the float32 type.\nFor LLM quantization, we have two different methods called post-training quantization and quantization-aware training. If we finally use the quantization model, quantization-aware training is better.\n\nExisiting solutions\nWe can use quantization library provied in huggingface transformers. For more foundamental optimization, we should consider to use GGML (GPT-Generated Model Language) and GGUF (GPT-Generated Unified Format). For on-device deployment, we should consider the usage of GGUF since it is more efficient. Refer to github to use it. We can consider another library called ollama which is built based on the llama cpp.", + "text": "Quantization\n\n\n\n\n\n\nTip\n\n\n\nQuantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here. We note that the conversion will decrease the memory and disk usage considerably. We note that for real calculation, we still need to dequantize the data to the original data type like float32 or bfloat16. The trick is that we only dequantize the data when we need to calculate the data while keeping the most of data in the quantized format. Therefore, we still save the memory and disk usage.\n\n\nLet’s first revisit the representation of data in computer. We mainly study the float32, float16 and bfloat16 type.\n\nfloat32: 32 bits. We have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa. To form a float number in computer, we need the sign, the number before the exponent and the exponent number over 2. For example, we have \\(6.75=+1.1011\\times 2^2\\). Thus, we can conclude that the range of the representation is between \\(1\\times 10^{-38}\\) and \\(3\\times 10^{38}\\) (you can add sign freely, though).\nfloat16: 16 bits. We have 1 bit for the sign, 5 bits for the exponent and 10 bits for the mantissa. The range of the representation is between \\(6\\times 10^{-8}\\) and \\(6\\times 10^{4}\\).\nbfloat16: 16 bits. We have 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa. The range of the representation is between \\(1\\times 10^{-38}\\) and \\(3\\times 10^{38}\\).\n\nWe can see that float16 and bfloat16 take up the same memory space. But they are different in the bits allocation. The float16 has better precision than bfloat16, but the bfloat16 has better range than float16. For deep neural network, we may need to consider the use of the bfloat16 type since the range is more important than the precision for the deep neural network. The common quantization type are INT8 and INT4. Note that INT8 and INT4 can only represent the integer numbers, not for the float numbers. Thus, INT8 can only represent the numbers between \\(-128\\) and \\(127\\), and INT4 can only represent the numbers between \\(-8\\) and \\(7\\).\nWe use the affine quantization scheme to convert the model:\n\\[\nx_q = \\operatorname{round}\\left(x/S + Z\\right)\n\\]\nwhere we have: - \\(x_q\\): the quantized value - \\(x\\): the original value - \\(S\\): the scale factor - \\(Z\\): the zero point - \\(\\operatorname{round}\\): the rounding function.\nUsually, we will set multiple blocks to quantize the model. It means that we need multiple scale factors and zero points. Note that not all layers are quantized. For some important layers, we still consider the use of the float32 type.\nFor LLM quantization, we have two different methods called post-training quantization and quantization-aware training. If we finally use the quantization model, quantization-aware training is better.\n\nExisiting solutions\nWe can use quantization library provied in huggingface transformers. For more foundamental optimization, we should consider to use GGML (GPT-Generated Model Language) and GGUF (GPT-Generated Unified Format). For on-device deployment, we should consider the usage of GGUF since it is more efficient. Refer to github to use it. We can consider another library called ollama which is built based on the llama cpp.", "crumbs": [ "Home", "🗣️ **Large language models**", @@ -198,7 +198,7 @@ "href": "notes.html", "title": "Research notes", "section": "", - "text": "Optimization for Inference of Large Language Model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n4 min\n\n\n\n\n\n\n\n\n\n\n\n\nOptimization in machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n13 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model distributed training\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nComplex analysis for machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model evaluation\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n11 min\n\n\n\n\n\n\n\n\n\n\n\n\nMixture of expert\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nScalable diffusion models with transformers\n\n\n\n\n\n\nDiffusion Model\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nReinforcement learning for large language model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n19 min\n\n\n\n\n\n\nNo matching items" + "text": "Optimization for Inference of Large Language Model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nOptimization in machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n13 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model distributed training\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nComplex analysis for machine learning\n\n\n\n\n\n\nMath Theories\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nLarge language model evaluation\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n11 min\n\n\n\n\n\n\n\n\n\n\n\n\nMixture of expert\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n5 min\n\n\n\n\n\n\n\n\n\n\n\n\nScalable diffusion models with transformers\n\n\n\n\n\n\nDiffusion Model\n\n\n\n\n\n\n\n\n\n1 min\n\n\n\n\n\n\n\n\n\n\n\n\nReinforcement learning for large language model\n\n\n\n\n\n\nLarge Language Models\n\n\n\n\n\n\n\n\n\n19 min\n\n\n\n\n\n\nNo matching items" }, { "objectID": "index.html", diff --git a/sitemap.xml b/sitemap.xml index afd7be7..b43ec69 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -14,7 +14,7 @@ https://alexchen4ai.github.io/blog/notes/Large Language Model/inference_optimize.html - 2024-11-12T08:02:58.483Z + 2024-11-12T17:47:57.233Z https://alexchen4ai.github.io/blog/notes/Large Language Model/rl_llm.html