Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise tutorial to separate out platforms #397

Merged
merged 3 commits into from
May 4, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 125 additions & 56 deletions examples/python/phi-3-tutorial.md
Original file line number Diff line number Diff line change
@@ -1,93 +1,162 @@
# Run the Phi-3 Mini models with the ONNX Runtime generate() API

## Steps
1. [Download Phi-3 Mini](#download-the-model)
2. [Install the generate() API](#install-the-generate()-api-package)
3. [Run Phi-3 Mini](#run-the-model)
1. [Setup](#setup)
2. [Choose your platform](#choose-your-platform)
3. [Run with DirectML](#run-with-directml)
4. [Run with NVDIA CUDA](#run-with-nvidia-cuda)
5. [Run on CPU](#run-on-cpu)

## Download the model
## Introduction

Download either or both of the [short](https://aka.ms/phi3-mini-4k-instruct-onnx) and [long](https://aka.ms/phi3-mini-128k-instruct-onnx) context Phi-3 mini models from Hugging Face.
There are two Phi-3 mini models to choose from: the short (4k) context version or the long (128k) context version. The long context version can accept much longer prompts and produce longer output text, but it does consume more memory.

To download the Phi-3 mini models, you will need to have git-lfs installed.
* MacOS: `brew install git-lfs`
* Linux: `apt-get install git-lfs`
* Windows: `winget install -e --id GitHub.GitLFS` (If you don't have winget, download and run the `exe` from the [official source](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage?platform=windows))
The Phi-3 ONNX models are hosted on HuggingFace: [short](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) and [long](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx).

Then run `git lfs install`
This tutorial downloads and runs the short context model. If you would like to use the long context model, change the `4k` to `128k` in the instructions below.

For the short context model.
## Setup

```bash
git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx
```
1. Install the git large file system extension

For the long context model
HuggingFace uses `git` for version control. To download the ONNX models you need `git lfs` to be installed, if you do not already have it.

```bash
git clone https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx
```
* Windows: `winget install -e --id GitHub.GitLFS` (If you don't have winget, download and run the `exe` from the [official source](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage?platform=windows))
* Linux: `apt-get install git-lfs`
* MacOS: `brew install git-lfs`

These model repositories have models that run with DirectML, CPU and CUDA.
Then run `git lfs install`

## Install the generate() API package
2. Install the HuggingFace CLI

**Unsure about which installation instructions to follow?** Here's a bit more guidance:
```bash
pip install huggingface-hub[cli]
```

Are you on Windows machine with GPU?
## Choose your platform

Are you on a Windows machine with GPU?
* I don't know → Review [this guide](https://www.microsoft.com/en-us/windows/learning-center/how-to-check-gpu) to see whether you have a GPU in your Windows machine.
* Yes → Follow the instructions for [DirectML](#directml).
* Yes → Follow the instructions for [DirectML](#run-with-directml).
* No → Do you have an NVIDIA GPU?
* I don't know → Review [this guide](https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#verify-you-have-a-cuda-capable-gpu) to see whether you have a CUDA-capable GPU.
* Yes → Follow the instructions for [NVIDIA CUDA GPU](#nvidia-cuda-gpu).
* No → Follow the instructions for [CPU](#cpu).
* Yes → Follow the instructions for [NVIDIA CUDA GPU](#run-with-nvidia-cuda).
* No → Follow the instructions for [CPU](#run-on-cpu).

*Note: Only one package is required based on your hardware.*
**Note: Only one package and model is required based on your hardware. That is, only execute the steps for one of the following sections**

## Run with DirectML

1. Download the model

```bash
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .
```

This command downloads the model into a folder called `directml`.


2. Install the generate() API

### DirectML
```
pip install numpy
pip install --pre onnxruntime-genai-directml
```

You should now see `onnxruntime-genai-directml` in your `pip list`.

```
pip install numpy
pip install --pre onnxruntime-genai-directml
```
3. Run the model

### NVIDIA CUDA GPU
Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py).

```cmd
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m directml\directml-int4-awq-block-128
```

```
pip install numpy
pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
```
Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:

### CPU
```bash
Input: Tell me a joke about GPUs

Certainly! Here\'s a light-hearted joke about GPUs:

```
pip install numpy
pip install --pre onnxruntime-genai
```

## Run the model
Why did the GPU go to school? Because it wanted to improve its "processing power"!

Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py).

The script accepts a model folder and takes the generation parameters from the config in that model folder. You can also override the parameters on the command line.
This joke plays on the double meaning of "processing power," referring both to the computational abilities of a GPU and the idea of a student wanting to improve their academic skills.
```

<!--This example is using the long context model running with DirectML on Windows.-->
## Run with NVIDIA CUDA

The `-m` argument is the path to the model you downloaded from HuggingFace above.
The `-l` argument is the length of output you would like to generate with the model.
1. Download the model

```bash
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m *replace your relative model_path here* -l 2048
```
```bash
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
```

Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:
This command downloads the model into a folder called `cuda`.

```bash
Input: Tell me a joke about creative writing
2. Install the generate() API

```
pip install numpy
pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
```

3. Run the model

Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py).

```bash
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32
```

Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:

```bash
Input: Tell me a joke about creative writing

Output: Why don't writers ever get lost? Because they always follow the plot!
```
Output: Why don\'t writers ever get lost? Because they always follow the plot!
```

## Run on CPU

1. Download the model

```bash
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
```

This command downloads the model into a folder called `cpu_and_mobile`

2. Install the generate() API for CPU

```
pip install numpy
pip install --pre onnxruntime-genai
```

3. Run the model

Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py).

```bash
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4
```

Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:

```bash
Input: Tell me a joke about generative AI

Output: Why did the generative AI go to school?

To improve its "creativity" algorithm!


This joke plays on the double meaning of "creativity" in the context of AI. Generative AI is often associated with its ability to produce creative content, but in this joke, it\'s humorously suggested that the AI is going to school to enhance its creative skills, as if it were a human student.
```
Loading