Skip to content

Commit

Permalink
Fix GPTQ doc (#1267)
Browse files Browse the repository at this point in the history
  • Loading branch information
regisss authored Aug 12, 2023
1 parent a86f334 commit e1eb658
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 16 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/upload_pr_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: Upload PR Documentation

on:
workflow_run:
workflows: ["Build PR Documentation"]
workflows: ["Build PR documentation"]
types:
- completed

Expand Down
2 changes: 1 addition & 1 deletion docs/source/concept_guides/quantization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ models while respecting accuracy and latency constraints.
[PyTorch quantization functions](https://pytorch.org/docs/stable/quantization-support.html#torch-quantization-quantize-fx)
to allow graph-mode quantization of 🤗 Transformers models in PyTorch. This is a lower-level API compared to the two
mentioned above, giving more flexibility, but requiring more work on your end.
- The `optimum.llm_quantization` package allows to [quantize and run LLM models](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization)
- The `optimum.gptq` package allows to [quantize and run LLM models](../llm_quantization/usage_guides/quantization) with GPTQ.

## Going further: How do machines represent numbers?

Expand Down
20 changes: 6 additions & 14 deletions docs/source/llm_quantization/usage_guides/quantization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,24 @@

🤗 Optimum collaborated with [AutoGPTQ library](https://github.com/PanQiWei/AutoGPTQ) to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite language model to 8, 6, 4 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.

If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization).
If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization).

To learn more about the quantization technique used in GPTQ, please refer to:
To learn more about the quantization technique used in GPTQ, please refer to:
- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
- the [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library used as the backend
Note that the AutoGPTQ library provides more advanced usage (triton backend, fused attention, fused MLP) that are not integrated with Optimum. For now, we leverage only the CUDA kernel for GPTQ.

### Requirements

You need to have the following requirements installed to run the code below:
You need to have the following requirements installed to run the code below:

- AutoGPTQ library:
`pip install auto-gptq`

- Optimum library:
`pip install --upgrade optimum`

- Install latest `transformers` library from source:
- Install latest `transformers` library from source:
`pip install --upgrade git+https://github.com/huggingface/transformers.git`

- Install latest `accelerate` library:
Expand Down Expand Up @@ -90,15 +90,7 @@ quantized_model = load_quantized_model(empty_model, save_folder=save_folder, dev

Note that only 4-bit models are supported with exllama kernels for now. Furthermore, it is recommended to disable the exllama kernel when you are finetuning your model with peft.

#### Fine-tune a quantized model
#### Fine-tune a quantized model

With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details.

### References

[[autodoc]] gtpq.GPTQQuantizer
- all

[[autodoc]] gtpq.load_quantized_model
- all

0 comments on commit e1eb658

Please sign in to comment.