From eede60f22895fb51bc4abaf6ceeadb60b520435b Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:43:48 -0700 Subject: [PATCH 01/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index ce205cf38e8fe..777b527884f14 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -134,4 +134,4 @@ Safety metrics and RAI align with the base Phi-3 models. See [here](https://aka. This blog post introduces how ONNX Runtime and DirectML optimize the Phi-3 model. We've included instructions for running Phi-3 across Windows and other platforms, as well as early benchmarking results. Further improvements and perf optimizations are underway, so stay tuned for ONNX Runtime 1.18 [release](https://github.com/microsoft/onnxruntime/releases), Early May! -We encourage you to try out Phi3 and share your feedback in the [ONNX Runtime](https://github.com/microsoft/onnxruntime) GitHub repository! \ No newline at end of file +We encourage you to try out Phi-3 and share your feedback in the [ONNX Runtime](https://github.com/microsoft/onnxruntime) GitHub repository! \ No newline at end of file From e1d5fdc86786d511340b65fa06d3fb3d69ae4651 Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:44:00 -0700 Subject: [PATCH 02/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index 777b527884f14..6914ffc3bdd4b 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -132,6 +132,6 @@ Safety metrics and RAI align with the base Phi-3 models. See [here](https://aka. ## Try ONNX Runtime for Phi3 -This blog post introduces how ONNX Runtime and DirectML optimize the Phi-3 model. We've included instructions for running Phi-3 across Windows and other platforms, as well as early benchmarking results. Further improvements and perf optimizations are underway, so stay tuned for ONNX Runtime 1.18 [release](https://github.com/microsoft/onnxruntime/releases), Early May! +This blog post introduces how ONNX Runtime and DirectML optimize the Phi-3 model. We've included instructions for running Phi-3 across Windows and other platforms, as well as early benchmarking results. Further improvements and perf optimizations are under way, so stay tuned for the ONNX Runtime 1.18 [release](https://github.com/microsoft/onnxruntime/releases) in early May! We encourage you to try out Phi-3 and share your feedback in the [ONNX Runtime](https://github.com/microsoft/onnxruntime) GitHub repository! \ No newline at end of file From 4ba807116962d5a1044bcc48c836274ed69c7cd4 Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:44:13 -0700 Subject: [PATCH 03/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index 6914ffc3bdd4b..b4f062be788c5 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -77,7 +77,7 @@ We measured ONNX Runtime + DirectML performance of Phi-3 Mini (4k sequence lengt

DirectML lets developers not only achieve great performance but also lets developers deploy models across the entire Windows ecosystem with support from AMD, Intel and NVIDIA. Best of all, AWQ means that developers get this scale while also maintaining high model accuracy.

-Stay tuned for additional performance improvements in the coming weeks thanks to optimized drivers from our hardware partners, along with additional updates to the ONNX Generate() API. +Stay tuned for additional performance improvements in the coming weeks thanks to optimized drivers from our hardware partners and additional updates to the ONNX Generate() API.
From d058a3bc745b558f730954747a6d25e28588caa5 Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:44:19 -0700 Subject: [PATCH 04/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index b4f062be788c5..225f04098b400 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -75,7 +75,7 @@ Please watch this space for more updates on AMD, and additional optimization wit
We measured ONNX Runtime + DirectML performance of Phi-3 Mini (4k sequence length) quantized with AWQ and with a block size of 128 on Windows. Our test machine had an NVIDIA GeForce RTX 4090 GPU and an Intel Core i9-13900K CPU. As you can see in the table, DirectML offers high token throughput even at longer prompts and generation lengths.

-DirectML lets developers not only achieve great performance but also lets developers deploy models across the entire Windows ecosystem with support from AMD, Intel and NVIDIA. Best of all, AWQ means that developers get this scale while also maintaining high model accuracy. +DirectML lets developers not only achieve great performance but also deploy models across the entire Windows ecosystem with support from AMD, Intel and NVIDIA. Best of all, AWQ means that developers get this scale while also maintaining high model accuracy.

Stay tuned for additional performance improvements in the coming weeks thanks to optimized drivers from our hardware partners and additional updates to the ONNX Generate() API. From 84ee7c1cdabdf2bb477c4687033499958eafd873 Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:44:43 -0700 Subject: [PATCH 05/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index 225f04098b400..ac2f5917abc40 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -37,7 +37,7 @@ For Linux developers and beyond, ONNX Runtime with CUDA is a great solution that For FP16 CUDA and INT4 CUDA, ORT performs up to 5X faster and up to 9X faster than PyTorch, respectively. Phi-3 Mini-128K-Instruct is currently not supported by Llama.cpp. -For FP16 and INT4 CUDA, Phi-3 Mini-4K-Instruct performs with ORT performs up to 5X faster and up to 10X faster than PyTorch, respectively. Phi-3 Mini-4K-Instruct is also up to 3X faster than Llama.cpp for large sequence lengths. +For FP16 and INT4 CUDA, Phi-3 Mini-4K-Instruct with ORT performs up to 5X faster and up to 10X faster than PyTorch, respectively. Phi-3 Mini-4K-Instruct is also up to 3X faster than Llama.cpp for large sequence lengths. In addition to supporting both Phi-3 Mini models on various GPUs, ONNX Runtime can help run these models on Mobile, Windows, and Mac CPUs, making it a truly cross-platform framework. ONNX Runtime also supports quantization techniques like RTN to enable these models to run across many different hardware. From 58be8bf3df961e0e0d9fe2dc186d897bd8fbd010 Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:44:51 -0700 Subject: [PATCH 06/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index ac2f5917abc40..a142088e919d8 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -35,7 +35,7 @@ See below for dedicated performance numbers. For Linux developers and beyond, ONNX Runtime with CUDA is a great solution that supports a wide range of NVIDIA GPUs, including both consumer and data center GPUs. Phi-3 Mini-128K-Instruct performs better for ONNX Runtime with CUDA than PyTorch for all batch size, prompt length combinations. -For FP16 CUDA and INT4 CUDA, ORT performs up to 5X faster and up to 9X faster than PyTorch, respectively. Phi-3 Mini-128K-Instruct is currently not supported by Llama.cpp. +For FP16 CUDA and INT4 CUDA, Phi-3 Mini-128K-Instruct with ORT performs up to 5X faster and up to 9X faster than PyTorch, respectively. Phi-3 Mini-128K-Instruct is currently not supported by Llama.cpp. For FP16 and INT4 CUDA, Phi-3 Mini-4K-Instruct with ORT performs up to 5X faster and up to 10X faster than PyTorch, respectively. Phi-3 Mini-4K-Instruct is also up to 3X faster than Llama.cpp for large sequence lengths. From 797457fa6034e496e63aac0c5ff2297e9cc2bb6b Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:44:58 -0700 Subject: [PATCH 07/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index a142088e919d8..bc96ec71071b3 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -15,7 +15,7 @@ url: 'https://onnxruntime.ai/blogs/accelerating-phi-3' You can now run Microsoft's latest home-grown [Phi-3 models](https://aka.ms/phi3blog-april) across a huge range of devices and platforms thanks to ONNX Runtime and DirectML. Today we're proud to announce day 1 support for both flavors of Phi-3, [phi3-mini-4k-instruct](https://aka.ms/phi3-mini-4k-instruct) and [phi3-mini-128k-instruct](https://aka.ms/phi3-mini-128k-instruct). The optimized ONNX models are available at [phi3-mini-4k-instruct-onnx](https://aka.ms/phi3-mini-4k-instruct-onnx) and [phi3-mini-128k-instruct-onnx](https://aka.ms/phi3-mini-128k-instruct-onnx). -Many language models are too large to run locally on most devices, but Phi-3 represents a significant exception to this rule: this small but mighty suite of models achieves comparable performance to models 10 times larger! Phi-3-mini is also the first model in its weight class to support long contexts of up to 128K tokens. To learn more about how Microsoft's strategic data curation and innovative scaling achieved these remarkable results, see [here](https://aka.ms/phi3-tech-report). +Many language models are too large to run locally on most devices, but Phi-3 represents a significant exception to this rule: this small but mighty suite of models achieves comparable performance to models 10 times larger! Phi-3 Mini is also the first model in its weight class to support long contexts of up to 128K tokens. To learn more about how Microsoft's strategic data curation and innovative scaling achieved these remarkable results, see [here](https://aka.ms/phi3-tech-report). You can easily get started with Phi-3 with our newly introduced ONNX runtime Generate() API, found [here](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi-3-tutorial.md)! From 5aedfe14e16a0408f1f072aaca17688e6598a233 Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:45:02 -0700 Subject: [PATCH 08/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index bc96ec71071b3..168e4c630224e 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -39,7 +39,7 @@ For FP16 CUDA and INT4 CUDA, Phi-3 Mini-128K-Instruct with ORT performs up to 5X For FP16 and INT4 CUDA, Phi-3 Mini-4K-Instruct with ORT performs up to 5X faster and up to 10X faster than PyTorch, respectively. Phi-3 Mini-4K-Instruct is also up to 3X faster than Llama.cpp for large sequence lengths. -In addition to supporting both Phi-3 Mini models on various GPUs, ONNX Runtime can help run these models on Mobile, Windows, and Mac CPUs, making it a truly cross-platform framework. ONNX Runtime also supports quantization techniques like RTN to enable these models to run across many different hardware. +In addition to supporting both Phi-3 Mini models on various GPUs, ONNX Runtime can help run these models on mobile, Windows, and Mac CPUs, making it a truly cross-platform framework. ONNX Runtime also supports quantization techniques like RTN to enable these models to run across many different hardware. ONNX Runtime Mobile empowers developers to perform on-device inference with AI models on mobile and edge devices. By removing client-server communications, ORT Mobile provides privacy protection and has zero cost. Using RTN INT4 quantization, we significantly reduce the size of the state-of-the-art Phi-3 Mini models and can run both on a Samsung Galaxy S21 at a moderate speed. When applying RTN INT4 quantization, there is a tuning parameter for the int4 accuracy level. This parameter specifies the minimum accuracy level required for the activation of MatMul in int4 quantization, balancing performance and accuracy trade-offs. Two versions of RTN quantized models have been released with int4_accuracy_level=1, optimized for accuracy, and int4_accuracy_level=4, optimized for performance. If you prefer better performance with a slight trade-off in accuracy, we recommend using the model with int4_accuracy_level=4. From 28705ff0eca1b437cab7577ade944c7be817a3b5 Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:45:09 -0700 Subject: [PATCH 09/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index 168e4c630224e..59504efb53f8a 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -41,7 +41,7 @@ For FP16 and INT4 CUDA, Phi-3 Mini-4K-Instruct with ORT performs up to 5X faster In addition to supporting both Phi-3 Mini models on various GPUs, ONNX Runtime can help run these models on mobile, Windows, and Mac CPUs, making it a truly cross-platform framework. ONNX Runtime also supports quantization techniques like RTN to enable these models to run across many different hardware. -ONNX Runtime Mobile empowers developers to perform on-device inference with AI models on mobile and edge devices. By removing client-server communications, ORT Mobile provides privacy protection and has zero cost. Using RTN INT4 quantization, we significantly reduce the size of the state-of-the-art Phi-3 Mini models and can run both on a Samsung Galaxy S21 at a moderate speed. When applying RTN INT4 quantization, there is a tuning parameter for the int4 accuracy level. This parameter specifies the minimum accuracy level required for the activation of MatMul in int4 quantization, balancing performance and accuracy trade-offs. Two versions of RTN quantized models have been released with int4_accuracy_level=1, optimized for accuracy, and int4_accuracy_level=4, optimized for performance. If you prefer better performance with a slight trade-off in accuracy, we recommend using the model with int4_accuracy_level=4. +ONNX Runtime Mobile empowers developers to perform on-device inference with AI models on mobile and edge devices. By removing client-server communications, ORT Mobile provides privacy protection and has zero cost. Using RTN INT4 quantization, we significantly reduce the size of the state-of-the-art Phi-3 Mini models and can run both on a Samsung Galaxy S21 at a moderate speed. When applying RTN INT4 quantization, there is a tuning parameter for the INT4 accuracy level. This parameter specifies the minimum accuracy level required for the activation of MatMul in INT4 quantization, balancing performance and accuracy trade-offs. Two versions of RTN quantized models have been released: (1) the model optimized for accuracy with int4_accuracy_level=1 and (2) the model optimized for performance with int4_accuracy_level=4. If you prefer better performance with a slight trade-off in accuracy, we recommend using the model with int4_accuracy_level=4. Whether it's Windows, Linux, Android, or Mac, there's a path to infer models efficiently with ONNX Runtime! From ccab8c94733a6bf1eff2cb4958e4a2b6f0c7754f Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:45:18 -0700 Subject: [PATCH 10/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index 59504efb53f8a..bb62ffd0aa310 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -47,7 +47,7 @@ Whether it's Windows, Linux, Android, or Mac, there's a path to infer models eff ## Try the ONNX Runtime Generate() API -We are pleased to announce our new Generate() API, which makes it easier to run the Phi-3 models across a range of devices, platforms, and EP backends by wrapping several aspects of generative AI inferencing. +We are pleased to announce our new Generate() API, which makes it easier to run the Phi-3 models across a range of devices, platforms, and EP backends by wrapping several aspects of generative AI inferencing. The Generate() API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps [here](http://aka.ms/generate-tutorial). This API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps [here](http://aka.ms/generate-tutorial). From 605a8ff6f5780e847f6d0e0f5b8c61f030f64a2d Mon Sep 17 00:00:00 2001 From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> Date: Tue, 23 Apr 2024 10:45:33 -0700 Subject: [PATCH 11/12] Update src/routes/blogs/accelerating-phi-3/+page.svx --- src/routes/blogs/accelerating-phi-3/+page.svx | 1 - 1 file changed, 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx index bb62ffd0aa310..b2518677f5183 100644 --- a/src/routes/blogs/accelerating-phi-3/+page.svx +++ b/src/routes/blogs/accelerating-phi-3/+page.svx @@ -49,7 +49,6 @@ Whether it's Windows, Linux, Android, or Mac, there's a path to infer models eff We are pleased to announce our new Generate() API, which makes it easier to run the Phi-3 models across a range of devices, platforms, and EP backends by wrapping several aspects of generative AI inferencing. The Generate() API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps [here](http://aka.ms/generate-tutorial). -This API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps [here](http://aka.ms/generate-tutorial). Example:

From 3b22937b1788fd459054bc26fa83156d5972e814 Mon Sep 17 00:00:00 2001
From: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
Date: Tue, 23 Apr 2024 10:48:24 -0700
Subject: [PATCH 12/12] Update src/routes/blogs/accelerating-phi-3/+page.svx

---
 src/routes/blogs/accelerating-phi-3/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-3/+page.svx b/src/routes/blogs/accelerating-phi-3/+page.svx
index b2518677f5183..cb64f7f4a6330 100644
--- a/src/routes/blogs/accelerating-phi-3/+page.svx
+++ b/src/routes/blogs/accelerating-phi-3/+page.svx
@@ -113,7 +113,7 @@ The table below shows improvement on the average throughput of the first 256 tok
 
-The table below shows improvement on the average throughput of the first 256 tokens generated (tps) for Phi-3 Mini 4K Instruct ONNX model. The comparisons are for FP16 and INT4 precisions on CUDA, as measured on 1 A100 80GB GPU (SKU: Standard_ND96amsr_A100_v4). +The table below shows improvement on the average throughput of the first 256 tokens generated (tps) for Phi-3 Mini 4K Instruct ONNX model. The comparisons are for FP16 and INT4 precisions on CUDA, as measured on 1 A100 80GB GPU (SKU: [Standard_ND96amsr_A100_v4](https://learn.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series)).
Average throughput of int4 Phi-3 Mini 4K Instruct ONNX model.