From a8ad445020263147d9667240ef9fcad02cef54a7 Mon Sep 17 00:00:00 2001 From: "Xiaoxia (Shirley) Wu" <94406484+xiaoxiawu-microsoft@users.noreply.github.com> Date: Wed, 6 Mar 2024 18:25:49 -0800 Subject: [PATCH] Update README.md --- blogs/deepspeed-fp6/03-05-2024/README.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/blogs/deepspeed-fp6/03-05-2024/README.md b/blogs/deepspeed-fp6/03-05-2024/README.md index 8cbb9f259065..4037e708ba9d 100755 --- a/blogs/deepspeed-fp6/03-05-2024/README.md +++ b/blogs/deepspeed-fp6/03-05-2024/README.md @@ -39,16 +39,18 @@ To cite DeepSpeed-FP6, please cite the following two arxiv reports - ZeroQuant(4 6. [Acknowledgments and Contributions](#ac) # 1. Why 6-bit Floating Point (FP6) -The realm of Large Language Models (LLMs) like GPT has been evolving rapidly, with a focus on enhancing performance while managing the computational and storage demands. -**Diving Deep into 4-Bit Quantization's Challenges.** In our recent research — ZeroQuant(4+2) [1], we examine the drawbacks of using 4-bit quantization techniques such as GPTQ in large language models (LLMs). While these techniques hold the potential to decrease model size and computational requirements, they often fall short in critical more general tasks due to overfitting issues. We extend the examination to include more generative tasks like code generation and summarization, areas where standard quantization methods have not been thoroughly explored. We found that INT4 weight quantization does not perform well in these broader applications, underscoring the urgent need for new approaches that improve both the efficiency and effectiveness of LLMs. -**Breakthrough with FP6.** Our exploration of different quantization methods brought us to the FP6 precision standard. Despite the difficulties in integrating and speeding up FP6 with current AI hardware — a challenge we will address in the following section — this format excels in performance and flexibility for a variety of tasks. Notably, models quantized with FP6, like the StarCoder-15B, achieve results comparable to their FP16 equivalents in code generation, and smaller models (like BART-406M) meet standard FP16 performance levels in summarization. To improve the efficiency of AI hardware and equal the best performance seen with INT4 quantization, we propose a novel 4+2 FP6 scheme. This innovation makes FP6 a promising avenue for enhancing the efficiency of LLMs, marking a significant leap in the progress of AI technologies. For more details, please refer to our research paper — ZeroQuant(4+2) [1]. +In the evolving landscape of Large Language Models (LLMs) like GPT, our research aims to boost computational efficiency and storage while preserving model quality. This focus brings us to tackle the complex challenges of 4-bit quantization, where optimizing performance, efficiency, and accuracy is crucial. + +**Exploring the Challenges of 4-bit Quantization** In our recent research findings -- ZeroQuant (4+2)[1], we explore the capabilities of INT4 quantization techniques (like the GPTQ algorithm) in Large Language Models (LLMs). While these techniques reduce model size and computational needs, they often perform poorly on a more general array of tasks due to overfitting issues, including more generative tasks like code generation and summarization. This highlights the urgent need for new methods to improve the efficiency and effectiveness of LLMs. + +**Breakthroughs with FP6 Precision** Our exploration of different quantization methods led us to the FP6 precision standard. Despite the challenges in integrating and accelerating FP6 with current AI hardware -- which we will address in the next section - this format excels in performance and flexibility across various tasks. Notably, models quantized with FP6, such as the StarCoder-15B, achieved results in code generation comparable to models using FP16 precision, while smaller models, like BART-406M, reached standard FP16 performance levels in summarization. To enhance the efficiency of AI hardware and achieve the best performance seen with INT4 quantization, we propose a novel 4+2 FP6 scheme. This innovation makes FP6 a promising avenue for improving the efficiency of LLMs, marking a significant leap in AI technology advancement. For more details, please refer to our research paper - ZeroQuant (4+2)[1]. # 2. System Support for FP6 -**Pioneering Full-Stack GPU Kernel Design.** One challenge of FP6 quantization is that there lacks an efficient GPU kernel design for this irregular bit-width. In our recent research — FP6-LLM [2], we introduce TC-FPx, the first full-stack GPU system design scheme with unified Tensor Core support of floating point weights for FP6 and various quantization bit-width (6-bit, 5-bit, 3-bit, etc.), mitigating the "memory wall" issues during LLM inference. TC-FPx breaks the limitations of the underlying GPU hardware, allowing the GPU to support linear layer calculations involving model weights of arbitrary bit width. In TC-FPx, Tensor Cores are utilized for intensive computation of matrix multiplications, while SIMT cores are effectively leveraged for weight dequantization, transforming the x-bit model weights to FP16 type during runtime before feeding them to Tensor Cores. It has the following key innovations: +**Pioneering Full-Stack GPU Kernel Design** One challenge of FP6 quantization is that there lacks an efficient GPU kernel design for this irregular bit-width. In our recent research — FP6-LLM [2], we introduce TC-FPx, the first full-stack GPU system design scheme with unified Tensor Core support of floating point weights for FP6 and various quantization bit-width (6-bit, 5-bit, 3-bit, etc.), mitigating the "memory wall" issues during LLM inference. TC-FPx breaks the limitations of the underlying GPU hardware, allowing the GPU to support linear layer calculations involving model weights of arbitrary bit width. In TC-FPx, Tensor Cores are utilized for intensive computation of matrix multiplications, while SIMT cores are effectively leveraged for weight dequantization, transforming the x-bit model weights to FP16 type during runtime before feeding them to Tensor Cores. It has the following key innovations: