diff --git a/.nojekyll b/.nojekyll index 2640a25..f7a8443 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -61f03ba7 \ No newline at end of file +cc4d3b70 \ No newline at end of file diff --git a/images/profile.jpeg b/images/profile.jpeg index 3f56d27..6ef0667 100644 Binary files a/images/profile.jpeg and b/images/profile.jpeg differ diff --git a/listings.json b/listings.json index afacafb..f7703ce 100644 --- a/listings.json +++ b/listings.json @@ -1,7 +1,15 @@ [ + { + "listing": "/news.html", + "items": [ + "/news/Research news/researchnews.html", + "/news/Personal thoughts/myread.html" + ] + }, { "listing": "/notes.html", "items": [ + "/notes/Large Language Model/inference_optimize.html", "/notes/Math Theories/ml_optimizer.html", "/notes/Large Language Model/llm_train.html", "/notes/Math Theories/complexanalysis.html", @@ -10,12 +18,5 @@ "/notes/Diffusion Model/sd.html", "/notes/Large Language Model/rl_llm.html" ] - }, - { - "listing": "/news.html", - "items": [ - "/news/Research news/researchnews.html", - "/news/Personal thoughts/myread.html" - ] } ] \ No newline at end of file diff --git a/notes.html b/notes.html index 60aa72c..0e267e3 100644 --- a/notes.html +++ b/notes.html @@ -220,7 +220,37 @@

Research notes

-
+
+
+

+
 
+

+
+ + +
+
-
+
-
+
-
+
-
+
-
+
-
+
+
+

Overview of the optimization

+

To optimize the inference for language model, we mainly have the following methods:

+
    +
  1. Quantization: Quantization is the process of reducing the precision of the weights and activations of the model. This reduces the memory footprint and increases the speed of the model.
  2. +
  3. Pruning: Pruning is the process of removing the weights that are close to zero. This reduces the number of parameters in the model and increases the speed of the model.
  4. +
  5. Lower level implementation: Implementing the model in a lower level language like C++ or Rust can increase the speed of the model.
  6. +
  7. KV Cache: Key-Value cache is a technique to cache the intermediate results of the model. This reduces the computation and increases the speed of the model. For some certain devices, we may need to support the KV cache specially.
  8. +
  9. Optimization based on hardware: Like the flash attention for NVIDIA GPU, we can optimize the model based on the hardware. The main method would be to use the memory-access pattern method to optimize the model.
  10. +
+
+
+

Quantization

+
+

Let’s first revisit the representation of data in computer. We mainly study the float32, float16 and bfloat16 type.

+
    +
  • float32: 32 bits. We have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa. To form a float number in computer, we need the sign, the number before the exponent and the exponent number over 2. For example, we have . Thus, we can conclude that the range of the representation is between and (you can add sign freely, though).
  • +
  • float16: 16 bits. We have 1 bit for the sign, 5 bits for the exponent and 10 bits for the mantissa. The range of the representation is between and .
  • +
  • bfloat16: 16 bits. We have 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa. The range of the representation is between and .
  • +
+

We can see that float16 and bfloat16 take up the same memory space. But they are different in the bits allocation. The float16 has better precision than bfloat16, but the bfloat16 has better range than float16. For deep neural network, we may need to consider the use of the bfloat16 type since the range is more important than the precision for the deep neural network. The common quantization type are INT8 and INT4. Note that INT8 and INT4 can only represent the integer numbers, not for the float numbers. Thus, INT8 can only represent the numbers between and , and INT4 can only represent the numbers between and .

+

We use the affine quantization scheme to convert the model:

+

+

where we have: - : the quantized value - : the original value - : the scale factor - : the zero point - : the rounding function.

+

Usually, we will set multiple blocks to quantize the model. It means that we need multiple scale factors and zero points. Note that not all layers are quantized. For some important layers, we still consider the use of the float32 type.

+

For LLM quantization, we have two different methods called post-training quantization and quantization-aware training. If we finally use the quantization model, quantization-aware training is better.

+
+

Exisiting solutions

+

We can use quantization library provied in huggingface transformers. For more foundamental optimization, we should consider to use GGML (GPT-Generated Model Language) and GGUF (GPT-Generated Unified Format). For on-device deployment, we should consider the usage of GGUF since it is more efficient. Refer to github to use it. We can consider another library called ollama which is built based on the llama cpp.

+ + +
+
+ + ]]> + Large Language Models + https://alexchen4ai.github.io/blog/notes/Large Language Model/inference_optimize.html + Sat, 20 Apr 2024 07:00:00 GMT + Optimization in machine learning Alex Chen @@ -1473,8 +1547,12 @@ Tip

📝 Paper: https://arxiv.org/abs/2212.09748

+
+

Introduction

+

Coming soom

+
]]> Diffusion Model diff --git a/notes/Diffusion Model/sd.html b/notes/Diffusion Model/sd.html index 32d003f..c6734dc 100644 --- a/notes/Diffusion Model/sd.html +++ b/notes/Diffusion Model/sd.html @@ -180,6 +180,12 @@

Scalable diffusion models with transformers<

📝 Paper: https://arxiv.org/abs/2212.09748

+
+

Introduction

+

Coming soom

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+ +
+
+
+

The inference optimization

+
+
Large Language Models
+
+
+
+ + +
+ +
+
Author
+
+

Alex Chen

+
+
+ +
+
Published
+
+

April 20, 2024

+
+
+ + +
+ + +
+ + +
+ + + +
+ + + + + +
+
+
+ +
+
+Tip +
+
+
+

To run the language model faster and especially on the edge devices, we need to optimize the model. This optimization can be done in different ways. In this article, we will discuss some of the optimization techniques. In this blog, I will introduce different methods and the existing solutions to enable the language model to run faster. Note that my focus is for the on-device language model.

+
+
+
+

Overview of the optimization

+

To optimize the inference for language model, we mainly have the following methods:

+
    +
  1. Quantization: Quantization is the process of reducing the precision of the weights and activations of the model. This reduces the memory footprint and increases the speed of the model.
  2. +
  3. Pruning: Pruning is the process of removing the weights that are close to zero. This reduces the number of parameters in the model and increases the speed of the model.
  4. +
  5. Lower level implementation: Implementing the model in a lower level language like C++ or Rust can increase the speed of the model.
  6. +
  7. KV Cache: Key-Value cache is a technique to cache the intermediate results of the model. This reduces the computation and increases the speed of the model. For some certain devices, we may need to support the KV cache specially.
  8. +
  9. Optimization based on hardware: Like the flash attention for NVIDIA GPU, we can optimize the model based on the hardware. The main method would be to use the memory-access pattern method to optimize the model.
  10. +
+
+
+

Quantization

+
+
+
+ +
+
+Tip +
+
+
+

Quantization is a model compression technique that converts the weights and activations within an LLM from a high-precision data representation to a lower-precision data representation, i.e., from a data type that can hold more information to one that holds less. A typical example of this is the conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer (INT4 or INT8). A good blog from internet is here.

+
+
+

Let’s first revisit the representation of data in computer. We mainly study the float32, float16 and bfloat16 type.

+
    +
  • float32: 32 bits. We have 1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa. To form a float number in computer, we need the sign, the number before the exponent and the exponent number over 2. For example, we have \(6.75=+1.1011\times 2^2\). Thus, we can conclude that the range of the representation is between \(1e^{-38}\) and \(3e^{38}\) (you can add sign freely, though).
  • +
  • float16: 16 bits. We have 1 bit for the sign, 5 bits for the exponent and 10 bits for the mantissa. The range of the representation is between \(6e^{-8}\) and \(6e^{4}\).
  • +
  • bfloat16: 16 bits. We have 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa. The range of the representation is between \(1e^{-38}\) and \(3e^{38}\).
  • +
+

We can see that float16 and bfloat16 take up the same memory space. But they are different in the bits allocation. The float16 has better precision than bfloat16, but the bfloat16 has better range than float16. For deep neural network, we may need to consider the use of the bfloat16 type since the range is more important than the precision for the deep neural network. The common quantization type are INT8 and INT4. Note that INT8 and INT4 can only represent the integer numbers, not for the float numbers. Thus, INT8 can only represent the numbers between \(-128\) and \(127\), and INT4 can only represent the numbers between \(-8\) and \(7\).

+

We use the affine quantization scheme to convert the model:

+

\[ +x_q = \operatorname{round}\left(x/S + Z\right) +\]

+

where we have: - \(x_q\): the quantized value - \(x\): the original value - \(S\): the scale factor - \(Z\): the zero point - \(\operatorname{round}\): the rounding function.

+

Usually, we will set multiple blocks to quantize the model. It means that we need multiple scale factors and zero points. Note that not all layers are quantized. For some important layers, we still consider the use of the float32 type.

+

For LLM quantization, we have two different methods called post-training quantization and quantization-aware training. If we finally use the quantization model, quantization-aware training is better.

+
+

Exisiting solutions

+

We can use quantization library provied in huggingface transformers. For more foundamental optimization, we should consider to use GGML (GPT-Generated Model Language) and GGUF (GPT-Generated Unified Format). For on-device deployment, we should consider the usage of GGUF since it is more efficient. Refer to github to use it. We can consider another library called ollama which is built based on the llama cpp.

+ + +
+
+ +
+ +
+ + + + + \ No newline at end of file diff --git a/notes/Large Language Model/llm_eval.html b/notes/Large Language Model/llm_eval.html index 980bba6..1813f85 100644 --- a/notes/Large Language Model/llm_eval.html +++ b/notes/Large Language Model/llm_eval.html @@ -180,6 +180,12 @@

Large language model evaluation