-
Install Olive
This installs Olive from main. Replace with version 0.8.0 when it is released.
pip install git+https://github.com/microsoft/olive
-
Install ONNX Runtime generate()
pip install onnxruntime-genai
-
Install other dependencies
pip install optimum peft
-
Downgrade torch and transformers
TODO: There is an export bug with torch 2.5.0 and an incompatibility with transformers>=4.45.0
pip uninstall torch pip install torch==2.4 pip uninstall transformers pip install transformers==4.44
-
Choose a model
In this example we'll use Llama-3-8b
You need to register with Meta for a license to use this model. You can do this by accessing the above page, signing in, and registering for access. Access should be granted quickly. Esnure that the huggingface-cli is installed (
pip install huggingface-hub[cli]
) and you are logged in viahuggingface-cli login
. -
Locate datasets and/or existing adapters
In this example, we will two pre-tuned adapters
Note the output path cannot have any period (.
) characters.
Note also that this step requires 63GB of memory on the machine on which it is running.
-
Export the model to ONNX format
Note: add --use_model_builder when this is ready
olive capture-onnx-graph -m meta-llama/Llama-3.1-8B-Instruct --adapter_path Coldstart/Llama-3.1-8B-Instruct-Surfer-Dude-Personality -o models\Llama-3-1-8B-Instruct-LoRA --torch_dtype float32 --use_ort_genai
-
(Optional) Quantize the model
olive quantize -m models\Llama-3-1-8B-Instruct-LoRA --algorithm rtn --implementation matmul4 -o models\Llama-3-1-8B-Instruct-LoRA-int4
-
Adapt model
olive generate-adapter -m models\Llama-3-1-8B-Instruct-LoRA-int4 -o models\Llama-3-1-8B-Instruct-LoRA-int4\adapted
-
Convert adapters to ONNX
This steps assumes you quantized the model in Step 2. If you skipped step 2, then remove the
--quantize_int4
argument.olive convert-adapters --adapter_path Coldstart/Llama-3.1-8B-Instruct-Surfer-Dude-Personality --output_path adapters\Llama-3.1-8B-Instruct-Surfer-Dude-Personality --dtype float32 --quantize_int4
olive convert-adapters --adapter_path Coldstart/Llama-3.1-8B-Instruct-Hillbilly-Personality --output_path adapters\Llama-3.1-8B-Instruct-Hillbilly-Personality --dtype float32 --quantize_int4
See app.py as an example.
python app.py -m models\Llama-3-1-8B-Instruct-LoRA-int4\adapted\model -a adapters\Llama-3.1-8B-Instruct-Hillbilly-Personality.onnx_adapter -t "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -s "You are a friendly chatbot" -p "Hi, how are you today?"
Note: this requires CUDA
Use the olive fine-tune
command: https://microsoft.github.io/Olive/features/cli.html#finetune
Here is an example usage of the commmand:
olive finetune --method qlora -m meta-llama/Meta-Llama-3-8B -d nampdn-ai/tiny-codes --train_split "train[:4096]" --eval_split "train[4096:4224]" --text_template "### Language: {programming_language} \n### Question: {prompt} \n### Answer: {response}" --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --max_steps 150 --logging_steps 50 -o adapters\tiny-codes