Skip to content

Commit

Permalink
Merge branch 'phi3_blog' of github.com:MaanavD/onnxruntime into phi3_…
Browse files Browse the repository at this point in the history
…blog
  • Loading branch information
MaanavD committed Apr 23, 2024
2 parents 6ac3113 + 3b22937 commit 95514f6
Showing 1 changed file with 12 additions and 10 deletions.
22 changes: 12 additions & 10 deletions src/routes/blogs/accelerating-phi-3/+page.svx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ url: 'https://onnxruntime.ai/blogs/accelerating-phi-3'

You can now run Microsoft's latest home-grown [Phi-3 models](https://aka.ms/phi3blog-april) across a huge range of devices and platforms thanks to ONNX Runtime and DirectML. Today we're proud to announce day 1 support for both flavors of Phi-3, [phi3-mini-4k-instruct](https://aka.ms/phi3-mini-4k-instruct) and [phi3-mini-128k-instruct](https://aka.ms/phi3-mini-128k-instruct). The optimized ONNX models are available at [phi3-mini-4k-instruct-onnx](https://aka.ms/phi3-mini-4k-instruct-onnx) and [phi3-mini-128k-instruct-onnx](https://aka.ms/phi3-mini-128k-instruct-onnx).

Many language models are too large to run locally on most devices, but Phi-3 represents a significant exception to this rule: this small but mighty suite of models achieves comparable performance to models 10 times larger! Phi-3-mini is also the first model in its weight class to support long contexts of up to 128K tokens. To learn more about how Microsoft's strategic data curation and innovative scaling achieved these remarkable results, see [here](https://aka.ms/phi3-tech-report).
Many language models are too large to run locally on most devices, but Phi-3 represents a significant exception to this rule: this small but mighty suite of models achieves comparable performance to models 10 times larger! Phi-3 Mini is also the first model in its weight class to support long contexts of up to 128K tokens. To learn more about how Microsoft's strategic data curation and innovative scaling achieved these remarkable results, see [here](https://aka.ms/phi3-tech-report).

You can easily get started with Phi-3 with our newly introduced ONNX runtime Generate() API, found [here](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi-3-tutorial.md)!

Expand All @@ -42,18 +42,20 @@ ONNX Runtime Mobile empowers developers to perform on-device inference with AI m

For Linux developers and beyond, ONNX Runtime with CUDA is a great solution that supports a wide range of NVIDIA GPUs, including both consumer and data center GPUs. Phi-3 Mini-128K-Instruct performs better for ONNX Runtime with CUDA than PyTorch for all batch size, prompt length combinations.

For FP16 CUDA and INT4 CUDA, ORT performs up to 5X faster and up to 9X faster than PyTorch, respectively. Phi-3 Mini-128K-Instruct is currently not supported by Llama.cpp.
For FP16 CUDA and INT4 CUDA, Phi-3 Mini-128K-Instruct with ORT performs up to 5X faster and up to 9X faster than PyTorch, respectively. Phi-3 Mini-128K-Instruct is currently not supported by Llama.cpp.

For FP16 and INT4 CUDA, Phi-3 Mini-4K-Instruct performs with ORT performs up to 5X faster and up to 10X faster than PyTorch, respectively. Phi-3 Mini-4K-Instruct is also up to 3X faster than Llama.cpp for large sequence lengths.
For FP16 and INT4 CUDA, Phi-3 Mini-4K-Instruct with ORT performs up to 5X faster and up to 10X faster than PyTorch, respectively. Phi-3 Mini-4K-Instruct is also up to 3X faster than Llama.cpp for large sequence lengths.

In addition to supporting both Phi-3 Mini models on various GPUs, ONNX Runtime can help run these models on mobile, Windows, and Mac CPUs, making it a truly cross-platform framework. ONNX Runtime also supports quantization techniques like RTN to enable these models to run across many different hardware.

ONNX Runtime Mobile empowers developers to perform on-device inference with AI models on mobile and edge devices. By removing client-server communications, ORT Mobile provides privacy protection and has zero cost. Using RTN INT4 quantization, we significantly reduce the size of the state-of-the-art Phi-3 Mini models and can run both on a Samsung Galaxy S21 at a moderate speed. When applying RTN INT4 quantization, there is a tuning parameter for the INT4 accuracy level. This parameter specifies the minimum accuracy level required for the activation of MatMul in INT4 quantization, balancing performance and accuracy trade-offs. Two versions of RTN quantized models have been released: (1) the model optimized for accuracy with int4_accuracy_level=1 and (2) the model optimized for performance with int4_accuracy_level=4. If you prefer better performance with a slight trade-off in accuracy, we recommend using the model with int4_accuracy_level=4.

Whether it's Windows, Linux, Android, or Mac, there's a path to infer models efficiently with ONNX Runtime!

## Try the ONNX Runtime Generate() API

We are pleased to announce our new Generate() API, which makes it easier to run the Phi-3 models across a range of devices, platforms, and EP backends by wrapping several aspects of generative AI inferencing.
We are pleased to announce our new Generate() API, which makes it easier to run the Phi-3 models across a range of devices, platforms, and EP backends by wrapping several aspects of generative AI inferencing. The Generate() API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps [here](http://aka.ms/generate-tutorial).

This API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps [here](http://aka.ms/generate-tutorial).

Example:
<pre>
Expand All @@ -79,9 +81,9 @@ Please watch this space for more updates on AMD, and additional optimization wit
<div class="col-span-1">
We measured ONNX Runtime + DirectML performance of Phi-3 Mini (4k sequence length) quantized with AWQ and with a block size of 128 on Windows. Our test machine had an NVIDIA GeForce RTX 4090 GPU and an Intel Core i9-13900K CPU. As you can see in the table, DirectML offers high token throughput even at longer prompts and generation lengths.
<br/><br/>
DirectML lets developers not only achieve great performance but also lets developers deploy models across the entire Windows ecosystem with support from AMD, Intel and NVIDIA. Best of all, AWQ means that developers get this scale while also maintaining high model accuracy.
DirectML lets developers not only achieve great performance but also deploy models across the entire Windows ecosystem with support from AMD, Intel and NVIDIA. Best of all, AWQ means that developers get this scale while also maintaining high model accuracy.
<br/><br/>
Stay tuned for additional performance improvements in the coming weeks thanks to optimized drivers from our hardware partners, along with additional updates to the ONNX Generate() API.
Stay tuned for additional performance improvements in the coming weeks thanks to optimized drivers from our hardware partners and additional updates to the ONNX Generate() API.

</div>
<div class="col-span-1">
Expand Down Expand Up @@ -118,7 +120,7 @@ The table below shows improvement on the average throughput of the first 256 tok
<br/>


The table below shows improvement on the average throughput of the first 256 tokens generated (tps) for Phi-3 Mini 4K Instruct ONNX model. The comparisons are for FP16 and INT4 precisions on CUDA, as measured on 1 A100 80GB GPU (SKU: Standard_ND96amsr_A100_v4).
The table below shows improvement on the average throughput of the first 256 tokens generated (tps) for Phi-3 Mini 4K Instruct ONNX model. The comparisons are for FP16 and INT4 precisions on CUDA, as measured on 1 A100 80GB GPU (SKU: [Standard_ND96amsr_A100_v4](https://learn.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series)).
<div class="grid grid-cols-1 lg:grid-cols-2 gap-4">
<img class="m-auto" src="./Phi3-4k-Int4CUDA.png" alt="Average throughput of int4 Phi-3 Mini 4K Instruct ONNX model.">

Expand All @@ -136,6 +138,6 @@ Safety metrics and RAI align with the base Phi-3 models. See [here](https://aka.

## Try ONNX Runtime for Phi3

This blog post introduces how ONNX Runtime and DirectML optimize the Phi-3 model. We've included instructions for running Phi-3 across Windows and other platforms, as well as early benchmarking results. Further improvements and perf optimizations are underway, so stay tuned for ONNX Runtime 1.18 [release](https://github.com/microsoft/onnxruntime/releases), Early May!
This blog post introduces how ONNX Runtime and DirectML optimize the Phi-3 model. We've included instructions for running Phi-3 across Windows and other platforms, as well as early benchmarking results. Further improvements and perf optimizations are under way, so stay tuned for the ONNX Runtime 1.18 [release](https://github.com/microsoft/onnxruntime/releases) in early May!

We encourage you to try out Phi3 and share your feedback in the [ONNX Runtime](https://github.com/microsoft/onnxruntime) GitHub repository!
We encourage you to try out Phi-3 and share your feedback in the [ONNX Runtime](https://github.com/microsoft/onnxruntime) GitHub repository!

0 comments on commit 95514f6

Please sign in to comment.