microsoft · natke · May 4, 2024 · May 3, 2024 · May 3, 2024 · May 4, 2024
diff --git a/examples/python/phi-3-tutorial.md b/examples/python/phi-3-tutorial.md
@@ -1,93 +1,162 @@
 # Run the Phi-3 Mini models with the ONNX Runtime generate() API
 
 ## Steps
-1. [Download Phi-3 Mini](#download-the-model)
-2. [Install the generate() API](#install-the-generate()-api-package)
-3. [Run Phi-3 Mini](#run-the-model)
+1. [Setup](#setup)
+2. [Choose your platform](#choose-your-platform)
+3. [Run with DirectML](#run-with-directml)
+4. [Run with NVDIA CUDA](#run-with-nvidia-cuda)
+5. [Run on CPU](#run-on-cpu)
 
-## Download the model 
+## Introduction
 
-Download either or both of the [short](https://aka.ms/phi3-mini-4k-instruct-onnx) and [long](https://aka.ms/phi3-mini-128k-instruct-onnx) context Phi-3 mini models from Hugging Face.
+There are two Phi-3 mini models to choose from: the short (4k) context version or the long (128k) context version. The long context version can accept much longer prompts and produce longer output text, but it does consume more memory.
 
-To download the Phi-3 mini models, you will need to have git-lfs installed.
-* MacOS: `brew install git-lfs`
-* Linux: `apt-get install git-lfs`
-* Windows: `winget install -e --id GitHub.GitLFS` (If you don't have winget, download and run the `exe` from the [official source](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage?platform=windows))
+The Phi-3 ONNX models are hosted on HuggingFace: [short](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) and [long](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx).
 
-Then run `git lfs install`
+This tutorial downloads and runs the short context model. If you would like to use the long context model, change the `4k` to `128k` in the instructions below.
 
-For the short context model.
+## Setup
 
-```bash
-git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx
-```
+1. Install the git large file system extension
 
-For the long context model
+   HuggingFace uses `git` for version control. To download the ONNX models you need `git lfs` to be installed, if you do not already have it.
 
-```bash
-git clone https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx
-```
+   * Windows: `winget install -e --id GitHub.GitLFS` (If you don't have winget, download and run the `exe` from the [official source](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage?platform=windows))
+   * Linux: `apt-get install git-lfs`
+   * MacOS: `brew install git-lfs`
 
-These model repositories have models that run with DirectML, CPU and CUDA.
+   Then run `git lfs install`
 
-## Install the generate() API package
+2. Install the HuggingFace CLI
 
-**Unsure about which installation instructions to follow?** Here's a bit more guidance:
+   ```bash
+   pip install huggingface-hub[cli]
+   ```
 
-Are you on Windows machine with GPU?
+## Choose your platform
+
+Are you on a Windows machine with GPU?
 * I don't know &rarr; Review [this guide](https://www.microsoft.com/en-us/windows/learning-center/how-to-check-gpu) to see whether you have a GPU in your Windows machine.
-* Yes &rarr; Follow the instructions for [DirectML](#directml).
+* Yes &rarr; Follow the instructions for [DirectML](#run-with-directml).
 * No &rarr; Do you have an NVIDIA GPU?
   * I don't know &rarr; Review [this guide](https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#verify-you-have-a-cuda-capable-gpu) to see whether you have a CUDA-capable GPU.
-  * Yes &rarr; Follow the instructions for [NVIDIA CUDA GPU](#nvidia-cuda-gpu).
-  * No &rarr; Follow the instructions for [CPU](#cpu).
+  * Yes &rarr; Follow the instructions for [NVIDIA CUDA GPU](#run-with-nvidia-cuda).
+  * No &rarr; Follow the instructions for [CPU](#run-on-cpu).
 
-*Note: Only one package is required based on your hardware.*
+**Note: Only one package and model is required based on your hardware. That is, only execute the steps for one of the following sections**
+
+## Run with DirectML
+
+1. Download the model
+
+   ```bash
+   huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .
+   ```
+
+   This command downloads the model into a folder called `directml`.
+
+
+2. Install the generate() API
 
-### DirectML
+   ```
+   pip install numpy
+   pip install --pre onnxruntime-genai-directml
+   ```
 
+   You should now see `onnxruntime-genai-directml` in your `pip list`.
 
-```
-pip install numpy
-pip install --pre onnxruntime-genai-directml
-```
+3. Run the model
 
-### NVIDIA CUDA GPU
+   Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py).
 
+   ```cmd
+   curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
+   python phi3-qa.py -m directml\directml-int4-awq-block-128
+   ```
 
-```
-pip install numpy
-pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
-```
+   Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:
 
-### CPU
+   ```bash
+   Input: Tell me a joke about GPUs
 
+   Certainly! Here\'s a light-hearted joke about GPUs:
 
-```
-pip install numpy
-pip install --pre onnxruntime-genai
-```
 
-## Run the model
+   Why did the GPU go to school? Because it wanted to improve its "processing power"!
 
-Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py).
 
-The script accepts a model folder and takes the generation parameters from the config in that model folder. You can also override the parameters on the command line.
+   This joke plays on the double meaning of "processing power," referring both to the computational abilities of a GPU and the idea of a student wanting to improve their academic skills.
+   ```
 
-<!--This example is using the long context model running with DirectML on Windows.-->
+## Run with NVIDIA CUDA
 
-The `-m` argument is the path to the model you downloaded from HuggingFace above.
-The `-l` argument is the length of output you would like to generate with the model.
+1. Download the model
 
-```bash
-curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
-python phi3-qa.py -m *replace your relative model_path here* -l 2048
-```
+   ```bash
+   huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
+   ```
 
-Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:
+   This command downloads the model into a folder called `cuda`.
 
-```bash
-Input: Tell me a joke about creative writing
+2. Install the generate() API
+
+   ```
+   pip install numpy
+   pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
+   ```
+
+3. Run the model
+
+   Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py).
+
+   ```bash
+   curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
+   python phi3-qa.py -m cuda/cuda-int4-rtn-block-32 
+   ```
+
+      Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:
+
+   ```bash
+   Input: Tell me a joke about creative writing
 
-Output:  Why don't writers ever get lost? Because they always follow the plot! 
-```
+   Output:  Why don\'t writers ever get lost? Because they always follow the plot! 
+   ```
+
+## Run on CPU
+
+1. Download the model
+
+   ```bash
+   huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
+   ```
+
+   This command downloads the model into a folder called `cpu_and_mobile`
+
+2. Install the generate() API for CPU
+
+   ```
+   pip install numpy
+   pip install --pre onnxruntime-genai
+   ```
+
+3. Run the model
+
+   Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py).
+
+   ```bash
+   curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
+   python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4
+   ```
+
+   Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:
+
+   ```bash
+   Input: Tell me a joke about generative AI
+
+   Output:  Why did the generative AI go to school?
+
+   To improve its "creativity" algorithm!
+
+
+   This joke plays on the double meaning of "creativity" in the context of AI. Generative AI is often associated with its ability to produce creative content, but in this joke, it\'s humorously suggested that the AI is going to school to enhance its creative skills, as if it were a human student. 
+   ```