From c4047f976be42715539d6a1b0c60642dc69b08b6 Mon Sep 17 00:00:00 2001 From: "Xiaoxia (Shirley) Wu" <94406484+xiaoxiawu-microsoft@users.noreply.github.com> Date: Wed, 6 Mar 2024 11:04:01 -0800 Subject: [PATCH] Update README.md --- blogs/deepspeed-fp6/03-05-2024/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/blogs/deepspeed-fp6/03-05-2024/README.md b/blogs/deepspeed-fp6/03-05-2024/README.md index 9961918875e0..8cbb9f259065 100755 --- a/blogs/deepspeed-fp6/03-05-2024/README.md +++ b/blogs/deepspeed-fp6/03-05-2024/README.md @@ -41,14 +41,14 @@ To cite DeepSpeed-FP6, please cite the following two arxiv reports - ZeroQuant(4 # 1. Why 6-bit Floating Point (FP6) The realm of Large Language Models (LLMs) like GPT has been evolving rapidly, with a focus on enhancing performance while managing the computational and storage demands. -*Diving Deep into 4-Bit Quantization's Challenges.* In our recent research — ZeroQuant(4+2) [1], we examine the drawbacks of using 4-bit quantization techniques such as GPTQ in large language models (LLMs). While these techniques hold the potential to decrease model size and computational requirements, they often fall short in critical more general tasks due to overfitting issues. We extend the examination to include more generative tasks like code generation and summarization, areas where standard quantization methods have not been thoroughly explored. We found that INT4 weight quantization does not perform well in these broader applications, underscoring the urgent need for new approaches that improve both the efficiency and effectiveness of LLMs. +**Diving Deep into 4-Bit Quantization's Challenges.** In our recent research — ZeroQuant(4+2) [1], we examine the drawbacks of using 4-bit quantization techniques such as GPTQ in large language models (LLMs). While these techniques hold the potential to decrease model size and computational requirements, they often fall short in critical more general tasks due to overfitting issues. We extend the examination to include more generative tasks like code generation and summarization, areas where standard quantization methods have not been thoroughly explored. We found that INT4 weight quantization does not perform well in these broader applications, underscoring the urgent need for new approaches that improve both the efficiency and effectiveness of LLMs. -*Breakthrough with FP6.* Our exploration of different quantization methods brought us to the FP6 precision standard. Despite the difficulties in integrating and speeding up FP6 with current AI hardware — a challenge we will address in the following section — this format excels in performance and flexibility for a variety of tasks. Notably, models quantized with FP6, like the StarCoder-15B, achieve results comparable to their FP16 equivalents in code generation, and smaller models (like BART-406M) meet standard FP16 performance levels in summarization. To improve the efficiency of AI hardware and equal the best performance seen with INT4 quantization, we propose a novel 4+2 FP6 scheme. This innovation makes FP6 a promising avenue for enhancing the efficiency of LLMs, marking a significant leap in the progress of AI technologies. For more details, please refer to our research paper — ZeroQuant(4+2) [1]. +**Breakthrough with FP6.** Our exploration of different quantization methods brought us to the FP6 precision standard. Despite the difficulties in integrating and speeding up FP6 with current AI hardware — a challenge we will address in the following section — this format excels in performance and flexibility for a variety of tasks. Notably, models quantized with FP6, like the StarCoder-15B, achieve results comparable to their FP16 equivalents in code generation, and smaller models (like BART-406M) meet standard FP16 performance levels in summarization. To improve the efficiency of AI hardware and equal the best performance seen with INT4 quantization, we propose a novel 4+2 FP6 scheme. This innovation makes FP6 a promising avenue for enhancing the efficiency of LLMs, marking a significant leap in the progress of AI technologies. For more details, please refer to our research paper — ZeroQuant(4+2) [1]. # 2. System Support for FP6 -*Pioneering Full-Stack GPU Kernel Design.* One challenge of FP6 quantization is that there lacks an efficient GPU kernel design for this irregular bit-width. In our recent research — FP6-LLM [2], we introduce TC-FPx, the first full-stack GPU system design scheme with unified Tensor Core support of floating point weights for FP6 and various quantization bit-width (6-bit, 5-bit, 3-bit, etc.), mitigating the "memory wall" issues during LLM inference. TC-FPx breaks the limitations of the underlying GPU hardware, allowing the GPU to support linear layer calculations involving model weights of arbitrary bit width. In TC-FPx, Tensor Cores are utilized for intensive computation of matrix multiplications, while SIMT cores are effectively leveraged for weight dequantization, transforming the x-bit model weights to FP16 type during runtime before feeding them to Tensor Cores. It has the following key innovations: +**Pioneering Full-Stack GPU Kernel Design.** One challenge of FP6 quantization is that there lacks an efficient GPU kernel design for this irregular bit-width. In our recent research — FP6-LLM [2], we introduce TC-FPx, the first full-stack GPU system design scheme with unified Tensor Core support of floating point weights for FP6 and various quantization bit-width (6-bit, 5-bit, 3-bit, etc.), mitigating the "memory wall" issues during LLM inference. TC-FPx breaks the limitations of the underlying GPU hardware, allowing the GPU to support linear layer calculations involving model weights of arbitrary bit width. In TC-FPx, Tensor Cores are utilized for intensive computation of matrix multiplications, while SIMT cores are effectively leveraged for weight dequantization, transforming the x-bit model weights to FP16 type during runtime before feeding them to Tensor Cores. It has the following key innovations: