From e1eb658411a1e4c4c2ccb45576b1a2e8714c8bd7 Mon Sep 17 00:00:00 2001 From: regisss <15324346+regisss@users.noreply.github.com> Date: Sat, 12 Aug 2023 22:00:42 +0200 Subject: [PATCH] Fix GPTQ doc (#1267) --- .github/workflows/upload_pr_documentation.yml | 2 +- docs/source/concept_guides/quantization.mdx | 2 +- .../usage_guides/quantization.mdx | 20 ++++++------------- 3 files changed, 8 insertions(+), 16 deletions(-) diff --git a/.github/workflows/upload_pr_documentation.yml b/.github/workflows/upload_pr_documentation.yml index 1fff33517a..49491b2bfb 100644 --- a/.github/workflows/upload_pr_documentation.yml +++ b/.github/workflows/upload_pr_documentation.yml @@ -2,7 +2,7 @@ name: Upload PR Documentation on: workflow_run: - workflows: ["Build PR Documentation"] + workflows: ["Build PR documentation"] types: - completed diff --git a/docs/source/concept_guides/quantization.mdx b/docs/source/concept_guides/quantization.mdx index b9aca25ee9..5580a13e2a 100644 --- a/docs/source/concept_guides/quantization.mdx +++ b/docs/source/concept_guides/quantization.mdx @@ -185,7 +185,7 @@ models while respecting accuracy and latency constraints. [PyTorch quantization functions](https://pytorch.org/docs/stable/quantization-support.html#torch-quantization-quantize-fx) to allow graph-mode quantization of 🤗 Transformers models in PyTorch. This is a lower-level API compared to the two mentioned above, giving more flexibility, but requiring more work on your end. -- The `optimum.llm_quantization` package allows to [quantize and run LLM models](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) +- The `optimum.gptq` package allows to [quantize and run LLM models](../llm_quantization/usage_guides/quantization) with GPTQ. ## Going further: How do machines represent numbers? diff --git a/docs/source/llm_quantization/usage_guides/quantization.mdx b/docs/source/llm_quantization/usage_guides/quantization.mdx index 58b85d514c..87f21bce01 100644 --- a/docs/source/llm_quantization/usage_guides/quantization.mdx +++ b/docs/source/llm_quantization/usage_guides/quantization.mdx @@ -4,16 +4,16 @@ 🤗 Optimum collaborated with [AutoGPTQ library](https://github.com/PanQiWei/AutoGPTQ) to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite language model to 8, 6, 4 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares. -If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization). +If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization). -To learn more about the quantization technique used in GPTQ, please refer to: +To learn more about the quantization technique used in GPTQ, please refer to: - the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper - the [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library used as the backend Note that the AutoGPTQ library provides more advanced usage (triton backend, fused attention, fused MLP) that are not integrated with Optimum. For now, we leverage only the CUDA kernel for GPTQ. ### Requirements -You need to have the following requirements installed to run the code below: +You need to have the following requirements installed to run the code below: - AutoGPTQ library: `pip install auto-gptq` @@ -21,7 +21,7 @@ You need to have the following requirements installed to run the code below: - Optimum library: `pip install --upgrade optimum` -- Install latest `transformers` library from source: +- Install latest `transformers` library from source: `pip install --upgrade git+https://github.com/huggingface/transformers.git` - Install latest `accelerate` library: @@ -90,15 +90,7 @@ quantized_model = load_quantized_model(empty_model, save_folder=save_folder, dev Note that only 4-bit models are supported with exllama kernels for now. Furthermore, it is recommended to disable the exllama kernel when you are finetuning your model with peft. -#### Fine-tune a quantized model +#### Fine-tune a quantized model -With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ. +With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ. Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details. - -### References - -[[autodoc]] gtpq.GPTQQuantizer - - all - -[[autodoc]] gtpq.load_quantized_model - - all \ No newline at end of file