Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Phi3 Blog. #20435

Merged
merged 19 commits into from
Apr 23, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions src/routes/blogs/+page.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import LlamaImage from '../../images/blogs/accelerating-llama-2/Figure1-LLaMA-2-7B-E2E-Throughput.png';
import SDXLTurboImage from '../../images/blogs/sdxl_blog_thumbnail.png';
import Phi2Image from '../../routes/blogs/accelerating-phi-2/Phi2_Int4_TokenGenerationTP.png';
import Phi3Image from '../../routes/blogs/accelerating-phi-3/Phi3_Thumbnail.png';
import { createEventDispatcher } from 'svelte';
import ORT117Thumbnail from '../../images/blogs/ort-1-17-thumbnail.png';
import WebGPUImage from '../../images/blogs/webgpu_blog_thumbnail.jpg';
Expand Down Expand Up @@ -41,6 +42,16 @@
dispatch('switchTab', tab);
}
let featuredblog = [
{
title: 'ONNX Runtime supports Phi-3 mini models across platforms and devices',
date: 'April 22nd, 2024',
blurb:
"You can now run Microsoft's latest home-grown Phi-3 models across a huge range of devices and platforms thanks to ONNX Runtime and DirectML.",
link: 'blogs/accelerating-phi-3',
image: Phi3Image,
imgalt:
'Phi-3 + ONNX Runtime with the prompt "Tell me a joke" and Phi-3 answering: "Why don\'t scientists trust atoms?" "Because they make up everything!"'
},
{
title: 'ONNX Runtime Web unleashes generative AI in the browser using WebGPU',
date: 'February 29th, 2024',
Expand All @@ -60,16 +71,17 @@
image: ORT117Thumbnail,
imgalt: 'ONNX Runtime 1.17 release logo'
},

];
let blogs = [
{
title: 'Accelerating Phi-2, CodeLlama, Gemma and other Gen AI models with ONNX Runtime',
date: 'February 26th, 2024',
blurb: 'Improvements with ONNX Runtime for inferencing popular Gen AI models.',
link: 'blogs/accelerating-phi-2',
image: Phi2Image,
imgalt: 'Phi2 float16 token generation throughput comparison'
}
];
let blogs = [
},
{
title: 'On-Device Training: Training a model in browser',
date: 'February 6th, 2024',
Expand Down
131 changes: 131 additions & 0 deletions src/routes/blogs/accelerating-phi-3/+page.svx
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
title: ONNX Runtime supports Phi-3 mini models across platforms and devices
date: '22nd April, 2024'
description: 'Thanks to day one ONNX Runtime and DirectML support, developers can now deploy Phi-3 Mini at scale'
keywords: 'GenAI , LLM, ONNXRuntime, ORT, Phi, DirectML, Windows'
authors:
[
]
authorsLink:
[
]
image: 'Phi3_Thumbnail.png'
url: 'https://onnxruntime.ai/blogs/accelerating-phi-3'
---

You can now run Microsoft's latest home-grown [Phi-3 models](https://aka.ms/phi3blog-april) across a huge range of devices and platforms thanks to ONNX Runtime and DirectML. Today we're proud to announce day 1 support for both flavors of Phi-3, [phi3-mini-4k-instruct](https://aka.ms/phi3-mini-4k-instruct) and [phi3-mini-128k-instruct](https://aka.ms/phi3-mini-128k-instruct). The optimized ONNX models are available at [phi3-mini-4k-instruct-onnx](https://aka.ms/phi3-mini-4k-instruct-onnx) and [phi3-mini-128k-instruct-onnx](https://aka.ms/phi3-mini-128k-instruct-onnx).

Many language models are too large to run locally on most devices, but Phi-3 represents a significant exception to this rule: this small but mighty suite of models achieves comparable performance to models 10 times larger! Phi-3-mini is also the first model in its weight class to support long contexts of up to 128K tokens. To learn more about how Microsoft's strategic data curation and innovative scaling achieved these remarkable results, see [here](https://aka.ms/phi3-tech-report).
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

## DirectML and ONNX Runtime scales Phi-3 Mini on Windows

By itself, Phi-3 is already small enough to run on many Windows devices, but why stop there? Making Phi-3 even smaller with quantization would dramatically expand the model's reach on Windows, but not all quantization techniques are created equal. We wanted to ensure scalability while also maintaining model accuracy.

Activation-Aware Quantization (AWQ) to quantize Phi-3 Mini lets us reap the memory savings from quantization with only a minimal impact on accuracy. AWQ achieves this by identifying the top 1% of salient weights that are necessary for maintaining model accuracy and quantizing the remaining 99% of weights. This leads to much less accuracy loss from quantization with AWQ compared to many other quantization techniques. For more on AWQ, see [here](https://arxiv.org/abs/2306.00978).

Every GPU that supports DirectX 12 on Windows can run DirectML, regardless of whether it's an AMD, Intel, or NVIDIA GPU. DirectML and ONNX Runtime now support INT4 AWQ, which means developers can now run and deploy this quantized version of Phi-3 across hundreds of millions of Windows devices!

We're working with our hardware vendor partners to provide driver updates that will further improve performance in the coming weeks. Attend our [Build Talk](https://build.microsoft.com/en-US/sessions/65c11f47-56d8-442b-ae52-48df62b7b542?source=sessions) in late May to learn more!

See below for dedicated performance numbers.

## ONNX Runtime for server and edge scenarios

For Linux developers and beyond, ONNX Runtime with CUDA is a great solution that supports a wide range of NVIDIA GPUs, including both consumer and data center GPUs. Phi-3 Mini-128K-Instruct performs better for ONNX Runtime with CUDA than PyTorch for all batch size, prompt length combinations.

For FP16 CUDA and INT4 CUDA, ORT performs up to 5X faster and up to 9X faster than PyTorch, respectively. Phi-3 Mini-128K-Instruct is currently not supported by Llama.cpp.
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

For FP16 and INT4 CUDA, Phi-3 Mini-4K-Instruct performs with ORT performs up to 5X faster and up to 10X faster than PyTorch, respectively. Phi-3 Mini-4K-Instruct is also up to 3X faster than Llama.cpp for large sequence lengths.
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

In addition to supporting both Phi-3 Mini models on various GPUs, ONNX Runtime can help run these models on Mobile, Windows, and Mac CPUs, making it a truly cross-platform framework. ONNX Runtime also supports quantization techniques like RTN to enable these models to run across many different hardware.
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

ONNX Runtime Mobile empowers developers to perform on-device inference with AI models on mobile and edge devices. By removing client-server communications, ORT Mobile provides privacy protection and has zero cost. Using RTN INT4 quantization, we significantly reduce the size of the state-of-the-art Phi-3 Mini models and can run both on a Samsung Galaxy S21 at a moderate speed. When applying RTN INT4 quantization, there is a tuning parameter for the int4 accuracy level. This parameter specifies the minimum accuracy level required for the activation of MatMul in int4 quantization, balancing performance and accuracy trade-offs. Two versions of RTN quantized models have been released with int4_accuracy_level=1, optimized for accuracy, and int4_accuracy_level=4, optimized for performance. If you prefer better performance with a slight trade-off in accuracy, we recommend using the model with int4_accuracy_level=4.
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

Whether it's Windows, Linux, Android, or Mac, there's a path to infer models efficiently with ONNX Runtime!

## Try the ONNX Runtime Generate() API

We are pleased to announce our new Generate() API, which makes it easier to run the Phi-3 models across a range of devices, platforms, and EP backends by wrapping several aspects of generative AI inferencing.
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

This API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps [here](http://aka.ms/generate-tutorial).
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

Example:

\- python model-qa.py -m /***YourModelPath***/onnx/cpu_and_mobile/phi-3-mini-4k-instruct-int4-cpu -k 40 -p 0.95 -t 0.8 -r 1.0
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

\- Input: <user>Tell me a joke<end><assistant>
MaanavD marked this conversation as resolved.
Show resolved Hide resolved
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

\- Output: Why don't scientists trust atoms?
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

Because they make up everything!
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

\- This joke plays on the double meaning of "make up." In science, atoms are the fundamental building blocks of matter, literally making up everything. However, in a colloquial sense, "to make up" can mean to fabricate or lie, hence the humor.
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

Please watch this space for more updates on AMD, and additional optimization with ORT 1.18. Also, Check out our [Build Talk](https://build.microsoft.com/en-US/sessions/e6d21a49-2efb-4a39-8c26-f6eef1410c7a?source=sessions) in late May to learn more about this API!
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

## Performance Metrics

### DirectML:
<div class="grid lg:grid-cols-2 gap-4 grid-cols-1">
<div class="col-span-1">
We measured ONNX Runtime + DirectML performance of Phi-3 Mini (4k sequence length) quantized with AWQ and with a block size of 128 on Windows. Our test machine had an NVIDIA GeForce RTX 4090 GPU and an Intel Core i9-13900K CPU. As you can see in the table, DirectML offers high token throughput even at longer prompts and generation lengths.
<br/><br/>
DirectML lets developers not only achieve great performance but also lets developers deploy models across the entire Windows ecosystem with support from AMD, Intel and NVIDIA. Best of all, AWQ means that developers get this scale while also maintaining high model accuracy.
sophies927 marked this conversation as resolved.
Show resolved Hide resolved
<br/><br/>
Stay tuned for additional performance improvements in the coming weeks thanks to optimized drivers from our hardware partners, along with additional updates to the ONNX Generate() API.
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

</div>
<div class="col-span-1">

<table>
<tr><th>Prompt Length</th><th>Generation Length</th><th>Wall Clock tokens/s</th></tr>
MaanavD marked this conversation as resolved.
Show resolved Hide resolved
<tr><td>16</td><td>256</td><td>266.65</td></tr>
<tr><td>16</td><td>512</td><td>251.63</td></tr>
<tr><td>16</td><td>1024</td><td>238.87</td></tr>
<tr><td>16</td><td>2048</td><td>217.5</td></tr>
<tr><td>32</td><td>256</td><td>278.53</td></tr>
<tr><td>32</td><td>512</td><td>259.73</td></tr>
<tr><td>32</td><td>1024</td><td>241.72</td></tr>
<tr><td>32</td><td>2048</td><td>219.3</td></tr>
<tr><td>64</td><td>256</td><td>308.26</td></tr>
<tr><td>64</td><td>512</td><td>272.47</td></tr>
<tr><td>64</td><td>1024</td><td>245.67</td></tr>
<tr><td>64</td><td>2048</td><td>220.55</td></tr>
</table>

</div>
</div>

### CUDA:

The table below shows improvement on the average throughput of the first 256 tokens generated (tps) for the Phi-3 Mini 128K Instruct ONNX model. The comparisons are for FP16 and INT4 precisions on CUDA, as measured on 1 A100 80GB GPU (SKU: [Standard_ND96amsr_A100_v4](https://learn.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series)).

<div class="m-auto w50">
<img src="./Phi3-Mini-128K-FP16CUDA.png" alt="Average throughput of Phi-3 Mini 128K Instruct ONNX model.">

<i>Note: PyTorch Compile and Llama.cpp do not currently support the Phi-3 Mini 128K instruct model.</i>
</div>
<br/>


The table below shows improvement on the average throughput of the first 256 tokens generated (tps) for Phi-3 Mini 4K Instruct ONNX model. The comparisons are for FP16 and INT4 precisions on CUDA, as measured on 1 A100 80GB GPU (SKU: Standard_ND96amsr_A100_v4).
sophies927 marked this conversation as resolved.
Show resolved Hide resolved
sophies927 marked this conversation as resolved.
Show resolved Hide resolved
<div class="grid grid-cols-1 lg:grid-cols-2 gap-4">
<img class="m-auto" src="./Phi3-4k-Int4CUDA.png" alt="Average throughput of int4 Phi-3 Mini 4K Instruct ONNX model.">

<img class="m-auto" src="./Phi3-4k-FP16CUDA.png" alt="Average throughput of fp16 Phi-3 Mini 4K Instruct ONNX model.">

</div>
<i>Pip packages: torch 2.2.0, triton 2.2.0, onnxruntime-gpu1.18.0, transformers 4.39.0, bitsandbytes 0.42.0</i>
<br/>
<br/>
Performance is improved across CPU and other devices as well.

## Safety

Safety metrics and RAI align with the base Phi-3 models. See [here](https://aka.ms/phi3blog-april) for more details.

## Try ONNX Runtime for Phi3

This blog post introduces how ONNX Runtime and DirectML optimize the Phi-3 model. We've included instructions for running Phi-3 across Windows and other platforms, as well as early benchmarking results. Further improvements and perf optimizations are underway, so stay tuned for ONNX Runtime 1.18 [release](https://github.com/microsoft/onnxruntime/releases), Early May!
sophies927 marked this conversation as resolved.
Show resolved Hide resolved

We encourage you to try out Phi3 and share your feedback in the [ONNX Runtime](https://github.com/microsoft/onnxruntime) GitHub repository!
sophies927 marked this conversation as resolved.
Show resolved Hide resolved
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/routes/blogs/blog-post-featured.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
<div class="card-body">
<h2 class="card-title">{title}</h2>
<p>{description}</p>
<img src={image} alt={imgalt} />
<img class="rounded" src={image} alt={imgalt} />
<div class="text-right text-blue-500">
{date}
</div>
Expand Down
5 changes: 5 additions & 0 deletions src/routes/blogs/github-markdown-light.css
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ ul {
width: 35em;
}
/*light*/
@media (max-width: 768px) {
.w50 {
width: 75%;
}
}

.markdown-body {
-ms-text-size-adjust: 100%;
Expand Down
6 changes: 5 additions & 1 deletion src/routes/blogs/post.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,11 @@
<article class="">
<h1 class="text-5xl pb-2">{title}</h1>
<p class="text-neutral">
By:
{#if authors.length === 0}
<br/>
{:else}
<p>By:</p>
{/if}
{#each authors as author, i}
<a href={authorsLink[i]} class="text-blue-500">{author}</a>{i + 1 === authors.length
? ''
Expand Down
Loading