Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial fused GPTQ implementation #141

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
110 changes: 110 additions & 0 deletions benchmarks/Profiling.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
## Profiling

Script for running simple profiling of `unsloth` models and default `HuggingFace` peft training.

Operates in 2 modes:

- Default: Runs `SFTTrainer` for small number of training steps and prints resulting metrics. Good sanity check of training time and loss.

- Profiling: Consists of running a sample batch of data through a given model instrumented with `torch.profiler` and also `cudaProfilerAPI` for `nsys` analysis.

- Every forward / backward pass of every module of the model are appropriately marked (using `torch.profiler.record_function` or `torch.cuda.nvtx`), which helps when viewing in `torch.profiler` and `nsys` traces.

### Usage

Below command will run the Huggingface `SFTTrainer` for 20 steps and print the resulting training metrics to `stdout`

```
python benchmark.py --model_name llama --model_type unsloth-bnb --dtype float16 --dataset guanaco --output_dir ./results

```

Alternatively, the following will run the profiler using `torch.profiler`:

```
python benchmark.py --model_name llama --model_type unsloth-bnb --dtype float16 --profile --output_dir ./results
```

Submitting the above command to `nsys` will also work.

### Script args

#### Model names:

- llama

- maps to either of the following based on the model type (see below):
- "TheBloke/Llama-2-7B-GPTQ",
- "unsloth/llama-2-7b-bnb-4bit",

- mistral
- maps to either of the following based on the model type (see below):
- "TheBloke/Mistral-7B-v0.1-GPTQ"
- "unsloth/mistral-7b-bnb-4bit"

#### Model types

- unsloth-gptq-triton
- Fast LoRA implementation that uses `auto_gptq` `triton` quantized matmul kernels fused with LoRA adapters for custom autograd
- hf-gptq-default
- Default `HuggingFace` GPTQ peft implementation. Note that HF uses auto_gptq `cuda` quant linear layers under the hood.
However, this `auto_gptq` layer automatically disables the cuda kernel when the layer is trainable and falls back to a pure torch implementation (see https://github.com/AutoGPTQ/AutoGPTQ/blob/d2662b18bb91e1864b29e4e05862712382b8a076/auto_gptq/nn_modules/qlinear/qlinear_cuda.py#L40-L41)
- hf-gptq-triton-patch

- This is a patch of `HuggingFace` GPTQ peft model that replaces the default `cuda` `qlinear` layers with `auto_gptq` `triton` `qlinear` layers.

- unsloth-bnb
- Fast LoRA implementation with fused bitsandbytes quantization and LoRA adapters for custom autograd.

#### Additional

For additional options:

```python
python benchmark.py --help
```

**NOTE**: If profiling huggingface default models, need to change the following lines for backward profiling hooks to work
see [here](https://github.com/huggingface/peft/blob/bfc102c0c095dc9094cdd3523b729583bfad4688/src/peft/tuners/lora/gptq.py#L70)

_Original_

```python
result += output
return result
```

_New_

```python
return result + output
```

Also need to patch the following [lines](https://github.com/huggingface/peft/blob/bfc102c0c095dc9094cdd3523b729583bfad4688/src/peft/tuners/lora/layer.py#L320)

```python
def forward(self, x: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:
previous_dtype = x.dtype

if self.disable_adapters:
if self.merged:
self.unmerge()
result = self.base_layer(x, *args, **kwargs)
elif self.merged:
result = self.base_layer(x, *args, **kwargs)
else:
result = self.base_layer(x, *args, **kwargs)
for active_adapter in self.active_adapters:
if active_adapter not in self.lora_A.keys():
continue
lora_A = self.lora_A[active_adapter]
lora_B = self.lora_B[active_adapter]
dropout = self.lora_dropout[active_adapter]
scaling = self.scaling[active_adapter]
x = x.to(lora_A.weight.dtype)
# result += lora_B(lora_A(dropout(x))) * scaling
out = result + lora_B(lora_A(dropout(x))) * scaling
return out.to(previous_dtype)
result = result.to(previous_dtype)
return result
```
Loading