Releases: unslothai/unsloth
Llama 3.3 + Dynamic 4bit Quants
We provide dynamic 4bit quants which uses a bit more memory, but vastly improves accuracy for finetuning and inference. Unsloth will now default to these versions! See https://unsloth.ai/blog/dynamic-4bit for more details.
Llama 3.3 is out now! Read our blog: https://unsloth.ai/blog/llama3-3
- You can now fine-tune Llama 3.3 (70B) up to 90,000 context lengths with Unsloth, which is 13x longer than what Hugging Face + FA2 supports at 6,900 on a 80GB GPU.
- For Llama 3.1 (8B), Unsloth can now do a whopping 342,000 context length, which exceeds the 128K context lengths Llama 3.1 natively supported. HF + FA2 can only do 28,000 on a 80GB GPU, so Unsloth supports 12x context lengths.
- 70B models can now fit on 41GB of VRAM - nearly 40GB!
All notebooks now use these dynamic quants:
- Llama 3.2 Vision finetuning - Radiography use case. Free Colab Kaggle Notebook
- Qwen 2 VL Vision finetuning - Maths OCR to LaTeX. Free Colab Kaggle Notebook
- Pixtral 12B Vision finetuning - General QA datasets. Free Colab
- Please run
pip install --upgrade --no-cache-dir unsloth unsloth_zoo
Experiments
Quantizing Qwen2-VL-2B Instruct down to 4 bits breaks the model entirely.
Qwen2-VL-2B-Instruct | Description | Size | Result |
---|---|---|---|
16bit | The image shows a train traveling on tracks. | 4.11GB | ✅ |
Default 4bit all layers | The image depicts a vibrant and colorful scene of a coastal area. | 1.36GB | ❌ |
Unsloth quant | The image shows a train traveling on tracks. | 1.81GB | ✅ |
Merging to 16bit now works as expected.
Fixed a major bug which caused merges to not function correctly for vision models.
Llama.cpp GGUF saving now uses cmake
.
All saving modules are also updated inside of Unsloth!
Apple Cut Cross Entropy
We worked with Apple to add Cut Cross Entropy into Unsloth which reduces VRAM use and increase context length further.
QwQ 4bit quants and GGUFs
Try a O1 test time compute LLM out! See https://huggingface.co/unsloth
What's Changed
- Vision by @danielhanchen in #1318
- Bug fixes for vision by @danielhanchen in #1340
- Update README.md by @shimmyshimmer in #1374
- Fix llama.cpp GGUF by @danielhanchen in #1375
- Dynamic quants by @danielhanchen in #1379
Full Changelog: November-2024...December-2024
Vision finetuning
- We support Llama 3.2 Vision 11B, 90B; Pixtral; Qwen2VL 2B, 7B, 72B; and any Llava variants like Llava NeXT!
- We support 16bit LoRA or 4bit QLoRA. Both are accelerated and use much less memory!
- Llama 3.2 Vision finetuning - Radiography use case. Free Colab Kaggle Notebook
- Qwen 2 VL Vision finetuning - Maths OCR to LaTeX. Free Colab Kaggle Notebook
- Pixtral 12B Vision finetuning - General QA datasets. Free Colab
- Please run
pip install --upgrade --no-cache-dir unsloth unsloth_zoo
from unsloth import FastVisionModel # NEW instead of FastLanguageModel
import torch
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Llama-3.2-11B-Vision-Instruct",
load_in_4bit = True, # Use 4bit quantization to reduce memory usage. Can be False.
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True, # False if not finetuning vision part
finetune_language_layers = True, # False if not finetuning language part
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = 16, # The larger, the higher the accuracy, but might overfit
lora_alpha = 16, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
# target_modules = "all-linear", # Optional now! Can specify a list if needed
)
from datasets import load_dataset
dataset = load_dataset("unsloth/llava-instruct-mix-vsft-mini", split = "train")
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
FastVisionModel.for_training(model) # Enable for training!
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
train_dataset = dataset,
args = SFTConfig(
per_device_train_batch_size = 1, # Reduce to 1 to make Pixtral fit!
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 30,
# num_train_epochs = 1, # Set this instead of max_steps for full training runs
learning_rate = 2e-4,
fp16 = not is_bf16_supported(),
bf16 = is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # For Weights and Biases
# You MUST put the below items for vision finetuning:
remove_unused_columns = False,
dataset_text_field = "",
dataset_kwargs = {"skip_prepare_dataset": True},
dataset_num_proc = 4,
max_seq_length = 2048,
),
)
trainer_stats = trainer.train()
After finetuning, you can also do inference:
FastVisionModel.for_inference(model) # Enable for inference!
image = dataset[2]["images"][0]
instruction = "Is there something interesting about this image?"
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1)
We also support merging QLoRA / LoRA directly into 16bit weights for serving:
# Select ONLY 1 to save! (Both not needed!)
# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", tokenizer,)
# To export and save to your Hugging Face account
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", tokenizer, token = "PUT_HERE")
What's Changed
- Llama 3.2 by @danielhanchen in #1058
- Fix merges by @danielhanchen in #1079
- Handle absolute paths for save_to_gguf using pathlib by @giuliabaldini in #1120
- Only remove folder in sentencepiece check if it was created by @giuliabaldini in #1121
- Gradient Accumulation Fix by @danielhanchen in #1134
- Gradient Accumulation Fix by @danielhanchen in #1146
- fix: compute_loss bug by @vo1d-ai in #1151
- Windows installation guide in README by @timothelaborie in #1165
- chore: update chat_templates.py by @eltociear in #1166
- Many bug fixes by @danielhanchen in #1162
- Fix/patch tokenizer by @Erland366 in #1171
- Fix DPO, ORPO by @danielhanchen in #1177
- fix/transformers-unpack by @Erland366 in #1180
- Fix 4.47 issue by @danielhanchen in #1182
- 25% less mem and 10% faster training: Do not upcast lm_head and embedding to float32 by @Datta0 in #1186
- Cleanup upcast logs by @Datta0 in #1188
- Fix/phi-longrope by @Erland366 in #1193
- Bug fixes by @danielhanchen in #1195
- Fix/casting continue pretraining by @Erland366 in #1200
- Feat/all tmp by @danielhanchen in #1219
- Bug fixes by @danielhanchen in #1245
- Bug fix by @danielhanchen in #1249
- Bug fixes by @danielhanchen in #1255
- Fix: cast logits to float32 in cross_entropy_forward to prevent errors by @Erland366 in #1254
- Throw error when inferencing longer than max_popsition_embeddings by @Datta0 in #1236
- CLI now handles user input strings for dtype correctly by @Rabbidon in #1235
- Bug fixes by @danielhanchen in #1259
- Qwen 2.5 by @danielhanchen in #1280
- Fix/export mistral by @Erland366 in #1281
- DOC Update - Update README.md with os.environ in example by @udaygirish in #1269
- fix/get_chat_template by @Erland366 in #1246
- fix/sft-trainer by @Erland366 in #1276
- Bug fixes by @danielhanchen in #1288
- fix/sfttrainer-compatibility by @Erland366 in #1293
New Contributors
- @giuliabaldini made their first contribution in #1120
- @vo1d-ai made their first contribution in #1151
- @timothelaborie made their first contribution in #1165
- @eltociear made their first contribution in #1166
- @Erland366 made their first contribution in #1171
- @Datta0 made their first contribution in #1186
- @Rabbidon made their first contribution in #1235
- @udaygirish made their first contribution in #1269
Full Changelog: September-2024...November-2024
Gradient Accumulation Fix
We fixed a gradient accumulation bug which was actually discovered since 2021 here, and rediscovered here. Read more in our blog post: https://unsloth.ai/blog/gradient
We have a Colab Notebook for Llama 3.2 using the fixed trainer and a Kaggle Notebook as well.
Essentially theoretically bsz * ga
should be equivalent to full batch training with no gradient accumulation, but weirdly the training losses do no match up:
To use Unsloth's fixed trainer with gradient accumulation, use:
from unsloth import unsloth_train
# trainer_stats = trainer.train() << Buggy if using gradient accumulation
trainer_stats = unsloth_train(trainer) # << Fixed gradient accumulation
Please update Unsloth on local machines (no need for Colab / Kaggle) via:
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
Read our blog post: https://unsloth.ai/blog/gradient for more details!
What's Changed
- Llama 3.2 by @danielhanchen in #1058
- Fix merges by @danielhanchen in #1079
- Handle absolute paths for save_to_gguf using pathlib by @giuliabaldini in #1120
- Only remove folder in sentencepiece check if it was created by @giuliabaldini in #1121
- Gradient Accumulation Fix by @danielhanchen in #1134
New Contributors
- @giuliabaldini made their first contribution in #1120
Full Changelog: September-2024...October-2024
Qwen 2.5 Support
Qwen 2.5 Support is here!
There are some issues with Qwen 2.5 models which Unsloth has fixed!
- Kaggle Base model finetuning notebook: https://www.kaggle.com/code/danielhanchen/kaggle-qwen-2-5-unsloth-notebook/notebook
- Kaggle Instruct model finetuning notebook: https://www.kaggle.com/code/danielhanchen/kaggle-qwen-2-5-conversational-unsloth
- Colab finetuning notebook: https://colab.research.google.com/drive/1Kose-ucXO1IBaZq5BvbwWieuubP7hxvQ?usp=sharing
- Colab conversational notebook: https://colab.research.google.com/drive/1qN1CEalC70EO1wGKhNxs1go1W9So61R5?usp=sharing
EOS token issues
Qwen 2.5 Base models (0.5b all the way until 72b) - EOS token should be <|endoftext|> not <|im_end|>. The base models <|im_end|> is actually untrained, so it'll cause NaN gradients if you use it. You should re-pull the tokenizer from source, or you can download fixed base models from https://huggingface.co/unsloth if that helps.
Chat template issues
- Qwen 2.5 Base models should NOT have a chat_template, this will actually cause errors especially in Unsloth's finetuning notebooks, since I check if untrained tokens exist in the chat template to counteract NaN gradients.
- Do NOT use Qwen 2.5's chat template for the base models. This will cause NaN gradients!
4bit uploaded models
Qwen 2.5 0.5b 4bit 0.5b Instruct 0.5b 4bit Instruct 0.5b
Qwen 2.5 1.5b 4bit 1.5b Instruct 1.5b 4bit Instruct 1.5b
Qwen 2.5 3b 4bit 3b Instruct 3b 4bit Instruct 3b
Qwen 2.5 7b 4bit 7b Instruct 7b 4bit Instruct 7b
Qwen 2.5 14b 4bit 14b Instruct 14b 4bit Instruct 14b
Qwen 2.5 32b 4bit 32b Instruct 32b 4bit Instruct 32b
Qwen 2.5 72b 4bit 72b Instruct 72b 4bit Instruct 72b
What's Changed
- Phi 3.5 by @danielhanchen in #940
- Phi 3.5 by @danielhanchen in #941
- Fix DPO by @danielhanchen in #947
- Phi 3.5 bug fix by @danielhanchen in #955
- Cohere, Bug fixes by @danielhanchen in #984
- Gemma faster inference by @danielhanchen in #987
- Bug fixes by @danielhanchen in #1004
- Update README.md by @danielhanchen in #1033
- Update README.md by @danielhanchen in #1036
- fix: chat_templates.py bug by @NazimHAli in #1048
New Contributors
- @NazimHAli made their first contribution in #1048
Full Changelog: August-2024...September-2024
Phi 3.5
Phi 3.5 is here!
Try it out here: https://colab.research.google.com/drive/1lN6hPQveB_mHSnTOYifygFcrO8C1bxq4?usp=sharing
What's Changed
- Llama 3.1 by @danielhanchen in #797
- Better debugging by @danielhanchen in #826
- fix UnboundLocalError by @xyangk in #834
- Gemma by @danielhanchen in #843
- Fix ROPE extension issue and device mismatch by @xyangk in #840
- Fix RoPE extension by @danielhanchen in #846
- fix: fix config.torch_dtype bug by @relic-yuexi in #874
- pascal support by @emuchogu in #870
- Fix tokenizers by @danielhanchen in #887
- Torch 2.4, Xformers>0.0.27, TRL>0.9, Python 3.12 + bug fixes by @danielhanchen in #902
- Fix DPO stats by @danielhanchen in #906
- Fix Chat Templates by @danielhanchen in #916
- Fix chat templates by @danielhanchen in #917
- Bug Fixes by @danielhanchen in #920
- Fix mapping by @danielhanchen in #921
- untrained tokens llama 3.1 base by @danielhanchen in #929
- Bug #930 by @danielhanchen in #931
- Fix NEFTune by @danielhanchen in #937
- Update README.md by @danielhanchen in #938
New Contributors
- @relic-yuexi made their first contribution in #874
- @emuchogu made their first contribution in #870
Full Changelog: July-Mistral-2024...August-2024
Llama 3.1 Support
Llama 3.1 Support
Excited to announce Unsloth makes finetuning Llama 3.1 2.1x faster and use 60% less VRAM! Read up on our release here: https://unsloth.ai/blog/llama3-1
We uploaded a Google Colab notebook to finetune Llama 3.1 (8B) on a free Tesla T4: Llama 3.1 (8B) Notebook. We also have a new UI on Google Colab for chatting with your Llama 3.1 Instruct models which uses our own 2x faster inference engine.
Run UI Preview
We created a new chat UI using Gradio where users can upload and chat with their Llama 3.1 Instruct models online for free on Google Colab.
We uploaded 4bit bitsandbytes quants here: https://huggingface.co/unsloth
To finetune Llama 3.1, please update Unsloth:
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
July-Mistral-2024
Mistral NeMo, Ollama & CSV support
See https://unsloth.ai/blog/mistral-nemo for more details. 4 bit pre-quantized weights at https://huggingface.co/unsloth
2x faster 60% less VRAM Colab finetuning notebook here and also our Kaggle notebook is here
Export to Ollama & CSV Support
To use, create and customize your chat template with a dataset and Unsloth will automatically export the finetune to Ollama with automatic Modelfile creation. We also created a 'Step-by-Step Tutorial on How to Finetune Llama-3 and Deploy to Ollama'. Check out our Ollama Llama-3 Alpaca and CSV/Excel Ollama Guide notebooks.
Unlike regular chat templates that use 3 columns, Ollama simplifies the process with just 2 columns: instruction and output. And with Ollama, you can save, run, and deploy your finetuned models locally on your own device.
Train on Completions / Inputs
We now support training only on the output tokens and not the inputs, which can increase accuracy. Try it with:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
...
args = TrainingArguments(
...
),
)
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(trainer)
RoPE Scaling for all models
We now allow you to finetune Gemma 2, Mistral, Mistral NeMo, Qwen2 and more models with “unlimited” context lengths through RoPE linear scaling through Unsloth. Coupled with our 4x longer context support, Unsloth can do extremely long context support!
New Docs!
Introducing our new Documentation site which has all the most important info about Unsloth in one place. If you'd like to contribute, please contact us! Docs: https://docs.unsloth.ai/
Update instructions
Please update Unsloth in local machines (Colab and Kaggle just refresh and reload notebooks) via:
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
2x faster Gemma 2
Gemma 2 support
We now support Gemma 2! It's 2x faster and uses 63% less VRAM than HF+FA2!
We have a Gemma 2 9b notebook here: https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing
To use Gemma 2, please update Unsloth:
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
Head over to our blog post: https://unsloth.ai/blog/gemma2 for more details.
We uploaded 4bit quants for 4x faster downloading to:
https://huggingface.co/unsloth/gemma-2-9b-bnb-4bit
https://huggingface.co/unsloth/gemma-2-27b-bnb-4bit
https://huggingface.co/unsloth/gemma-2-9b-it-bnb-4bit
https://huggingface.co/unsloth/gemma-2-27b-it-bnb-4bit
Continued pretraining
You can now do continued pretraining with Unsloth. See https://unsloth.ai/blog/contpretraining for more details!
Continued pretraining is 2x faster and uses 50% less VRAM than HF + FA2 QLoRA. We offload embed_tokens
and lm_head
to disk to save VRAM!
You can now simply use both in the target modules like below:
model = FastLanguageModel.get_peft_model(
model,
r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
"embed_tokens", "lm_head",], # Add for continual pretraining
lora_alpha = 32,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = True, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
We also allow 2 learning rates - one for the embedding matrices and another for the LoRA adapters:
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments
trainer = UnslothTrainer(
args = UnslothTrainingArguments(
....
learning_rate = 5e-5,
embedding_learning_rate = 5e-6,
),
)
We also share a free Colab to finetune Mistral v3 to learn Korean (you can select any language you like) using Wikipedia and the Aya Dataset: https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing
And we're sharing our free Colab notebook for continued pretraining for text completion: https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing
What's Changed
- Ollama Chat Templates by @danielhanchen in #582
- Fix case where GGUF saving fails when model_dtype is torch.float16 ("f16") by @chrehall68 in #630
- Support revision parameter in FastLanguageModel.from_pretrained by @chrehall68 in #629
- clears any selected_adapters before calling internal_model.save_pretr… by @neph1 in #609
- Check for incompatible modules before importing unsloth by @xyangk in #602
- Fix #603 handling of formatting_func in tokenizer_utils for assitant/chat/completion training by @Oseltamivir in #604
- Add GGML saving option to Unsloth for easier Ollama model creation and testing. by @mahiatlinux in #345
- Add Documentation for LoraConfig Parameters by @sebdg in #619
- llama.cpp failing by @bet0x in #371
- fix libcuda_dirs import for triton 3.0 by @t-vi in #227
- Nightly by @danielhanchen in #632
- README: Fix minor typo. by @shaper in #559
- Qwen bug fixes by @danielhanchen in #639
- Fix segfaults by @danielhanchen in #641
- Nightly by @danielhanchen in #646
- Nightly by @danielhanchen in #648
- Nightly by @danielhanchen in #649
- Fix breaking bug in save.py with interpreting quantization_method as a string when saving to gguf by @ArcadaLabs-Jason in #651
- Revert "Fix breaking bug in save.py with interpreting quantization_method as a string when saving to gguf" by @danielhanchen in #652
- Revert "Revert "Fix breaking bug in save.py with interpreting quantization_method as a string when saving to gguf"" by @danielhanchen in #653
- Fix GGUF by @danielhanchen in #654
- Fix continuing LoRA finetuning by @danielhanchen in #656
New Contributors
- @chrehall68 made their first contribution in #630
- @neph1 made their first contribution in #609
- @xyangk made their first contribution in #602
- @Oseltamivir made their first contribution in #604
- @mahiatlinux made their first contribution in #345
- @sebdg made their first contribution in #619
- @bet0x made their first contribution in #371
- @t-vi made their first contribution in #227
- @shaper made their first contribution in #559
- @ArcadaLabs-Jason made their first contribution in #651
Full Changelog: https://github.com/unslothai/unsloth/commits/June-2024