From 1a3b85a00f4e3db9255b1003b41b8693a3b453cf Mon Sep 17 00:00:00 2001 From: Maanav Dalal Date: Mon, 26 Feb 2024 18:05:48 -0800 Subject: [PATCH 01/15] Update src/routes/blogs/accelerating-phi-2/+page.svx Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> --- src/routes/blogs/accelerating-phi-2/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx index e4d408fa3c37f..3adb1361b4c1a 100644 --- a/src/routes/blogs/accelerating-phi-2/+page.svx +++ b/src/routes/blogs/accelerating-phi-2/+page.svx @@ -28,7 +28,7 @@ url: 'https://onnxruntime.ai/blogs/accelerating-phi-2' In a fastmoving landscape where speed and efficiency are paramount, [ONNX Runtime](https://onnxruntime.ai/blogs/accelerating-llama-2) (ORT) allows users to easily integrate the power of generative AI models into their apps and services with improved optimizations that yield faster inferencing speeds and effectively lowers costs. These include state-of-the-art fusion and kernel optimizations to help improve model performance. The recent [ONNX Runtime 1.17 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.15.0) improves inference performance of several Gen AI models including Phi-2, Mistral, CodeLlama, Orca-2 and more. ONNX Runtime is a complete solution for small language models (SLMs) from training to inference, showing significant speedups compared to other frameworks. With support for float32, float16, and int4, ONNX Runtime's inference enhancements provide maximum flexibility and performance. -In this blog we will cover significant optimization speed up for both training and inference for latest GenAI models like Phi-2, Mistral, CodeLlama, SD-Turbo, SDXL-Turbo, Llama2, and Orca-2. For these model architectures ONNX Runtime significantly improves performance across a spectrum of batch size and prompt length when compared against other frameworks like PyTorch, and Llama.cpp. These optimizations using ONNX Runtime is now also available using [Olive](https://github.com/microsoft/Olive/tree/main/examples/). +In this blog, we will cover significant optimization speed up for both training and inference for the latest GenAI models like Phi-2, Mistral, CodeLlama, SD-Turbo, SDXL-Turbo, Llama2, and Orca-2. For these model architectures, ONNX Runtime significantly improves performance across a spectrum of batch sizes and prompt lengths when compared against other frameworks like PyTorch, and Llama.cpp. These optimizations using ONNX Runtime are now also available using [Olive](https://github.com/microsoft/Olive/tree/main/examples/). # Quick Links - [Phi-2](#phi-2) - [Mistral](#mistral) From a995e9a317e4e68c48e8050c0aed4d4b2040f4fb Mon Sep 17 00:00:00 2001 From: Maanav Dalal Date: Mon, 26 Feb 2024 18:05:56 -0800 Subject: [PATCH 02/15] Update src/routes/blogs/accelerating-phi-2/+page.svx Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> --- src/routes/blogs/accelerating-phi-2/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx index 3adb1361b4c1a..542ab3ba5f68b 100644 --- a/src/routes/blogs/accelerating-phi-2/+page.svx +++ b/src/routes/blogs/accelerating-phi-2/+page.svx @@ -26,7 +26,7 @@ url: 'https://onnxruntime.ai/blogs/accelerating-phi-2' --- -In a fastmoving landscape where speed and efficiency are paramount, [ONNX Runtime](https://onnxruntime.ai/blogs/accelerating-llama-2) (ORT) allows users to easily integrate the power of generative AI models into their apps and services with improved optimizations that yield faster inferencing speeds and effectively lowers costs. These include state-of-the-art fusion and kernel optimizations to help improve model performance. The recent [ONNX Runtime 1.17 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.15.0) improves inference performance of several Gen AI models including Phi-2, Mistral, CodeLlama, Orca-2 and more. ONNX Runtime is a complete solution for small language models (SLMs) from training to inference, showing significant speedups compared to other frameworks. With support for float32, float16, and int4, ONNX Runtime's inference enhancements provide maximum flexibility and performance. +In a fast-moving landscape where speed and efficiency are paramount, [ONNX Runtime](https://onnxruntime.ai/) (ORT) allows users to easily integrate the power of generative AI models into their apps and services with improved optimizations that yield faster inferencing speeds and effectively lowers costs. These include state-of-the-art fusion and kernel optimizations to help improve model performance. The recent [ONNX Runtime 1.17 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.17.0) improves inference performance of several Gen AI models including Phi-2, Mistral, CodeLlama, Orca-2 and more. ONNX Runtime is a complete solution for small language models (SLMs) from training to inference, showing significant speedups compared to other frameworks. With support for float32, float16, and int4, ONNX Runtime's inference enhancements provide maximum flexibility and performance. In this blog, we will cover significant optimization speed up for both training and inference for the latest GenAI models like Phi-2, Mistral, CodeLlama, SD-Turbo, SDXL-Turbo, Llama2, and Orca-2. For these model architectures, ONNX Runtime significantly improves performance across a spectrum of batch sizes and prompt lengths when compared against other frameworks like PyTorch, and Llama.cpp. These optimizations using ONNX Runtime are now also available using [Olive](https://github.com/microsoft/Olive/tree/main/examples/). # Quick Links From 69e5ca18f270c680b6b14f3f8337adaeba6fa377 Mon Sep 17 00:00:00 2001 From: Maanav Dalal Date: Mon, 26 Feb 2024 18:06:03 -0800 Subject: [PATCH 03/15] Update src/routes/blogs/accelerating-phi-2/+page.svx Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> --- src/routes/blogs/accelerating-phi-2/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx index 542ab3ba5f68b..847d9732a7eab 100644 --- a/src/routes/blogs/accelerating-phi-2/+page.svx +++ b/src/routes/blogs/accelerating-phi-2/+page.svx @@ -58,7 +58,7 @@ For Phi-2 inference, ORT with float16 and int4 quantization performs better than ### ORT gains with float16 -Optimized CUDA performance for prompt throughput (i.e., the rate at which the model processes and generates responses based on input prompts) is **up to 7.39x** faster than PyTorch Compile. We also observe ONNX Runtime is significantly faster for larger batch size and prompt lengths compared to llama.cpp, for e.g., it is **up to 13.08x faster** for batch size =16, prompt length =2048. +Optimized CUDA performance for prompt throughput (i.e., the rate at which the model processes and generates responses based on input prompts) is **up to 7.39x** faster than PyTorch Compile. We also observe ONNX Runtime is significantly faster for larger batch size and prompt lengths compared to Llama.cpp. For example, it is **up to 13.08x faster** for batch size =16, prompt length =2048. Phi2 float16 prompt throughput comparison From 0e1c8d94a3f8fd18d823d4f83aa2915881763736 Mon Sep 17 00:00:00 2001 From: Maanav Dalal Date: Mon, 26 Feb 2024 18:06:08 -0800 Subject: [PATCH 04/15] Update src/routes/blogs/accelerating-phi-2/+page.svx Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> --- src/routes/blogs/accelerating-phi-2/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx index 847d9732a7eab..015d124b6c599 100644 --- a/src/routes/blogs/accelerating-phi-2/+page.svx +++ b/src/routes/blogs/accelerating-phi-2/+page.svx @@ -165,7 +165,7 @@ We published a separate blog for Llama-2 improvements with ORT for Inference [he ## Inference -[Orca 2](https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/) is a research-only system that gives a one-time answer in tasks such as reasoning with user provided data, understanding texts, solving math problems, and summarizing texts. Orca 2 has two versions (7 billion and 13 billion parameters; they are both made by fine-tuning the respective LLAMA 2 base models on customized, high-quality artificial data. ONNX runtime helps optimize Orca-2 inferencing for using graph fusions and kernel optimizations like those for Llama-2. +[Orca-2](https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/) is a research-only system that gives a one-time answer in tasks such as reasoning with user-provided data, understanding texts, solving math problems, and summarizing texts. Orca-2 has two versions (7 billion and 13 billion parameters); they are both made by fine-tuning the respective Llama-2 base models on customized, high-quality artificial data. ONNX Runtime helps optimize Orca-2 inferencing for using graph fusions and kernel optimizations like those for Llama-2. Int4 performance: Orca-2 7b int4 quantization performance comparison indicated **up to 26X** increase in performance in prompt throughput, and up to 16.5X improvement in Token generation throughput over PyTorch. It also shows over **4.75X** improvement in prompt throughput, and 3.64X improvement in token generation throughput compared to Llama.cpp. From a633815a5dfff737d6b9af3b0fa20ac22b635d1d Mon Sep 17 00:00:00 2001 From: Maanav Dalal Date: Mon, 26 Feb 2024 18:06:14 -0800 Subject: [PATCH 05/15] Update src/routes/blogs/accelerating-phi-2/+page.svx Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com> --- src/routes/blogs/accelerating-phi-2/+page.svx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx index 015d124b6c599..48fceec7a3f59 100644 --- a/src/routes/blogs/accelerating-phi-2/+page.svx +++ b/src/routes/blogs/accelerating-phi-2/+page.svx @@ -216,7 +216,7 @@ Orca-2 7b also benefits from training acceleration using ORT. We trained the Orc # Conclusion -In conclusion, ONNX Runtime (ORT) provides significant performance improvements for several models, including Phi-2, Mistral, CodeLlama, SDXL-Turbo, Llama-2, Orca-2, and Gemma. ORT offers state-of-the-art fusion and kernel optimizations, including support for float16 and int4 quantization, resulting in faster inferencing speeds and lower costs. ORT outperforms other frameworks like PyTorch and Llama.cpp in terms of prompt and token generation throughput. ORT also shows significant benefits for training LLMs, with increasing gains for larger batch sizes and composes well with state-of-the-art techniques to enable efficient large model training. +In conclusion, ONNX Runtime (ORT) provides significant performance improvements for several models, including Phi-2, Mistral, CodeLlama, SDXL-Turbo, Llama-2, Orca-2, and Gemma. ORT offers state-of-the-art fusion and kernel optimizations, including support for float16 and int4 quantization, resulting in faster inferencing speeds and lower costs. ORT outperforms other frameworks like PyTorch and Llama.cpp in terms of prompt and token generation throughput. ORT also shows significant benefits for training LLMs, with increasing gains for larger batch sizes, and composes well with state-of-the-art techniques to enable efficient large model training.