diff --git a/src/images/blogs/accelerating-phi-3-medium-thumbnail.png b/src/images/blogs/accelerating-phi-3-medium-thumbnail.png new file mode 100644 index 0000000000000..dcfdcc599de88 Binary files /dev/null and b/src/images/blogs/accelerating-phi-3-medium-thumbnail.png differ diff --git a/src/routes/blogs/+page.svelte b/src/routes/blogs/+page.svelte index 93f95d2116059..f793196483d7a 100644 --- a/src/routes/blogs/+page.svelte +++ b/src/routes/blogs/+page.svelte @@ -16,6 +16,7 @@ import WebGPUImage from '../../images/blogs/webgpu_blog_thumbnail.jpg'; import WebTrainingImage from '../../images/blogs/webtraining_blog_thumbnail.png'; import Phi3OnDeviceImage from '../../images/blogs/phi-3-on-device_blog_thumbnail.png'; + import Phi3SmallMediumImage from '../../images/blogs/accelerating-phi-3-medium-thumbnail.png'; onMount(() => { anime({ targets: '.border-primary', @@ -43,6 +44,16 @@ dispatch('switchTab', tab); } let featuredblog = [ + { + title: 'Phi-3 Small and Medium Models are now Optimized with ONNX Runtime and DirectML', + date: 'May 21th, 2024', + blurb: + "You can now run the Phi-3 medium, small models on device of your choice.", + link: 'blogs/accelerating-phi-3-small-medium', + image: Phi3SmallMediumImage, + imgalt: + 'Chart comparing model size (in GB) of ONNX Phi-3-medium between PyTorch and ONNX Runtime' + }, { title: 'Enjoy the Power of Phi-3 with ONNX Runtime on your device', date: 'May 20th, 2024', @@ -62,7 +73,9 @@ image: Phi3Image, imgalt: 'Phi-3 + ONNX Runtime with the prompt "Tell me a joke" and Phi-3 answering: "Why don\'t scientists trust atoms?" "Because they make up everything!"' - }, + } + ]; + let blogs = [ { title: 'ONNX Runtime Web unleashes generative AI in the browser using WebGPU', date: 'February 29th, 2024', @@ -72,9 +85,7 @@ image: WebGPUImage, imgalt: 'Comparison of ONNX Runtime Web with WebGPU EP on GPU vs. WASM EP on CPU for segment anything example' - } - ]; - let blogs = [ + }, { title: 'ONNX Runtime 1.17: CUDA 12 support, Phi-2 optimizations, WebGPU, and more!', date: 'February 28th, 2024', diff --git a/src/routes/blogs/accelerating-phi-3-small-medium/+page.svx b/src/routes/blogs/accelerating-phi-3-small-medium/+page.svx new file mode 100644 index 0000000000000..17670c8a12e1e --- /dev/null +++ b/src/routes/blogs/accelerating-phi-3-small-medium/+page.svx @@ -0,0 +1,116 @@ +--- +title: 'Phi-3 Small and Medium Models are now optimized with ONNX Runtime and DirectML' +date: '21st May, 2024' +description: 'Introducing optimized ONNX variants of the new Phi-3 models' +keywords: 'ORT, ONNX Runtime, ONNX, machine learning, deep learning, phi 3, phi-3, phi-3-small, phi-3-medium, phi 3 small, phi 3 medium, phi-3 small, phi-3 medium' +authors: + [ + ] +authorsLink: + [ + ] +image: '' +url: 'https://onnxruntime.ai/blogs/accelerating-phi-3-small-medium' +--- + +# Phi-3 Small and Medium Models are now optimized with ONNX Runtime and DirectML + +We previously shared optimization support for [Phi-3 mini](https://onnxruntime.ai/blogs/accelerating-phi-3). We now introduce optimized [ONNX](https://onnx.ai/) variants of the [newly introduced Phi-3 models](https://aka.ms/Phi-3Build2024). The new **Phi-3-Small** and **Phi-3-Medium** outperform language models of the same size as well as those that are much larger. Phi-3-small beats GPT-3.5T across a variety of language, reasoning, coding and math benchmarks. The new models empower developers with a building block for generative AI applications which require strong reasoning, limited compute, and latency bound scenarios. + +**Phi-3-Medium** is a 14B parameter language model. It is available in short-(4K) and long-(128K) context variants. You can now find the **Phi-3-medium-4k-instruct-onnx** and **Phi-3-medium-128K-instruct-onnx** optimized models with **ONNX Runtime and DML** on Huggingface! Check the [Phi-3 Collection](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3) for the ONNX models. + +We also have added support for **Phi-3 Small** models for CUDA capable Nvidia GPUs, other variants coming soon. This includes support for Block Sparse kernel in the newly released [ONNX Runtime 1.18 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.18.0) via in ONNX Runtime generate() API. + +**ONNXRuntime 1.18** adds new features like improved 4bit quantization support, improved MultiheadAttention performance on CPU, and ONNX Runtime generate() API enhancements to enable easier and efficient run across devices. + + + +We are also happy to share that the new optimized ONNX Phi-3-mini for web deployment is available now. You can run Phi3-mini-4K entirely in the browser! Please check out the model [here](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx-web). What’s more, we now have updated the optimized ONNX version for [CPU and mobile](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile) with even better performance. And don’t miss [this blog](https://onnxruntime.ai/blogs/phi-3-on-device) about how to run Phi-3 on your phone and in the browser. + + +## How to run Phi-3-Medium with ONNX Runtime + +You can utilize the ONNX Runtime generate() API to run these models seamlessly on any hardware. You can see the detailed instructions [here](https://aka.ms/run-phi3-med-onnx). You can also run the [chat app](https://github.com/microsoft/onnxruntime-genai/tree/main/examples/chat_app) locally. + +Only one package and model combination is required based on your hardware. + +## 3 easy steps to run + +- 1. Download the model +- 2. Install the generate() API +- 3. Run the model with [phi3-qa.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py) + +Only execute the steps needed for your hardware. + +## Optimized for your Platform + +Mapping of which model to use based on hardware + +Phi-3 Small 8K ONNX Models: +- [microsoft/Phi-3-small-8k-instruct-onnx-cuda](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda) +- [microsoft/Phi-3-small-128k-instruct-onnx-cuda](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda) + +Phi-3 Medium 4k ONNX Models: +- [microsoft/Phi-3-medium-4k-instruct-onnx-cpu](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-cpu) +- [microsoft/Phi-3-medium-4k-instruct-onnx-cuda](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-cuda) +- [microsoft/Phi-3-medium-4k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-directml) + +Phi-3 Medium 128k ONNX Models: +- [microsoft/Phi-3-medium-128k-instruct-onnx-cpu](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cpu) +- [microsoft/Phi-3-medium-128k-instruct-onnx-cuda](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cuda) +- [microsoft/Phi-3-medium-128k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml) + + + + + + + + +## Performance + +The ONNX Runtime models can run up to 10X faster than the PyTorch variants. The Token Generation Throughput in tokens/sec is listed below for different variants. + +| Model | Batch Size, Prompt Length | Model Variant | Token Generation Throughput (tokens/sec) | +| ------------------------------- | ------------------------------ | ----------------------------------- | --------------------------------------------- | +| **Phi-3 Medium 4K** | | | | +| Phi-3 Medium 4K 14B ONNX CUDA | 1, 16 | FP16 CUDA GPU with ONNX Runtime | 47.32 | +| Phi-3 Medium 4K 14B ONNX CUDA | 16, 64 | FP16 CUDA GPU with ONNX Runtime | 698.22 | +| Phi-3 Medium 4K 14B ONNX CUDA | 1, 16 | INT4 RTN CUDA GPU with ONNX Runtime | 115.68 | +| Phi-3 Medium 4K 14B ONNX CUDA | 16, 64 | INT4 RTN CUDA GPU with ONNX Runtime | 339.45 | +| Phi-3 Medium 4K 14B ONNX DML | 1, 16 | DML INT4 AWQ with ONNX Runtime | 72.39 | +| Phi-3 Medium 4K 14B ONNX CPU | 16, 64 | INT4 RTN CPU with ONNX Runtime | 20.77 | +| **Phi-3 Medium 128K** | | | | +| Phi-3 Medium 128K 14B ONNX CUDA | 1, 16 | FP16 CUDA GPU with ONNX Runtime | 46.27 | +| Phi-3 Medium 128K 14B ONNX CUDA | 16, 64 | FP16 CUDA GPU with ONNX Runtime | 662.23 | +| Phi-3 Medium 128K 14B ONNX CUDA | 1, 16 | INT4 RTN CUDA GPU with ONNX Runtime | 108.59 | +| Phi-3 Medium 128K 14B ONNX CUDA | 16, 64 | INT4 RTN CUDA GPU with ONNX Runtime | 332.57 | +| Phi-3 Medium 128K 14B ONNX DML | 1, 16 | DML INT4 AWQ with ONNX Runtime | 72.26 | + +| Model | Batch Size, Prompt Length | Model Variant | Token Generation Throughput (tokens/sec) | +| ------------------------------- | ------------------------------ | ----------------------------------- | --------------------------------------------- | +| **Phi-3 Small 8k** | | | | +| Phi-3 Small 8K 7B ONNX CUDA | 1, 16 | FP16 CUDA GPU with ONNX Runtime | 74.62 | +| Phi-3 Small 8K 7B ONNX CUDA | 16, 64 | FP16 CUDA GPU with ONNX Runtime | 1036.93 | +| Phi-3 Small 8K 7B ONNX CUDA | 1, 16 | INT4 RTN CUDA GPU with ONNX Runtime | 140.68 | +| Phi-3 Small 8K 7B ONNX CUDA | 16, 64 | INT4 RTN CUDA GPU with ONNX Runtime | 582.07 | +| **Phi-3 Small 128k** | | | | +| Phi-3 Small 128K 7B ONNX CUDA | 1, 16 | FP16 CUDA GPU with ONNX Runtime | 68.26 | +| Phi-3 Small 128K 7B ONNX CUDA | 16, 64 | FP16 CUDA GPU with ONNX Runtime | 577.41 | +| Phi-3 Small 128K 7B ONNX CUDA | 1, 16 | INT4 RTN CUDA GPU with ONNX Runtime | 73.60 | +| Phi-3 Small 128K 7B ONNX CUDA | 16, 64 | INT4 RTN CUDA GPU with ONNX Runtime | 1008.35 | + +*Devices:* +- *CUDA: A100 GPU, SKU: Standard_ND96amsr_A100_v4* +- *DML: Nvidia GeForce RTX 4080 (Dedicated Mem 16GB/Shared Mem 24GB)* +- *CPU: Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz* + +*Packages:* +- *onnxruntime-gpu: 1.18.0* + +## Get started today + +To experience optimized [Phi-3](https://aka.ms/Phi-3Build2024) for yourself, you can now easily run these models using ONNX Runtime generate() [API instructions](https://aka.ms/run-phi3-med-onnx). To learn more, join us at ONNX Runtime, DML, and Phi-3 sessions at [Build](https://build.microsoft.com/en-US/sessions?search=ONNX&sortBy=relevance)! + + diff --git a/src/routes/blogs/accelerating-phi-3-small-medium/platform-optimization-map.png b/src/routes/blogs/accelerating-phi-3-small-medium/platform-optimization-map.png new file mode 100644 index 0000000000000..82043e3fd8679 Binary files /dev/null and b/src/routes/blogs/accelerating-phi-3-small-medium/platform-optimization-map.png differ