From 1a3b85a00f4e3db9255b1003b41b8693a3b453cf Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:05:48 -0800
Subject: [PATCH 01/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index e4d408fa3c37f..3adb1361b4c1a 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -28,7 +28,7 @@ url: 'https://onnxruntime.ai/blogs/accelerating-phi-2'
 
 In a fastmoving landscape where speed and efficiency are paramount, [ONNX Runtime](https://onnxruntime.ai/blogs/accelerating-llama-2) (ORT) allows users to easily integrate the power of generative AI models into their apps and services with improved optimizations that yield faster inferencing speeds and effectively lowers costs. These include state-of-the-art fusion and kernel optimizations to help improve model performance. The recent [ONNX Runtime 1.17 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.15.0) improves inference performance of several Gen AI models including Phi-2, Mistral, CodeLlama, Orca-2 and more. ONNX Runtime is a complete solution for small language models (SLMs) from training to inference, showing significant speedups compared to other frameworks. With support for float32, float16, and int4, ONNX Runtime's inference enhancements provide maximum flexibility and performance.
 
-In this blog we will cover significant optimization speed up for both training and inference for latest GenAI models like Phi-2, Mistral, CodeLlama, SD-Turbo, SDXL-Turbo, Llama2, and Orca-2. For these model architectures ONNX Runtime significantly improves performance across a spectrum of batch size and prompt length when compared against other frameworks like PyTorch, and Llama.cpp. These optimizations using ONNX Runtime is now also available using [Olive](https://github.com/microsoft/Olive/tree/main/examples/).
+In this blog, we will cover significant optimization speed up for both training and inference for the latest GenAI models like Phi-2, Mistral, CodeLlama, SD-Turbo, SDXL-Turbo, Llama2, and Orca-2. For these model architectures, ONNX Runtime significantly improves performance across a spectrum of batch sizes and prompt lengths when compared against other frameworks like PyTorch, and Llama.cpp. These optimizations using ONNX Runtime are now also available using [Olive](https://github.com/microsoft/Olive/tree/main/examples/).
 # Quick Links
 - [Phi-2](#phi-2)
 - [Mistral](#mistral)

From a995e9a317e4e68c48e8050c0aed4d4b2040f4fb Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:05:56 -0800
Subject: [PATCH 02/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 3adb1361b4c1a..542ab3ba5f68b 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -26,7 +26,7 @@ url: 'https://onnxruntime.ai/blogs/accelerating-phi-2'
 ---
 
 
-In a fastmoving landscape where speed and efficiency are paramount, [ONNX Runtime](https://onnxruntime.ai/blogs/accelerating-llama-2) (ORT) allows users to easily integrate the power of generative AI models into their apps and services with improved optimizations that yield faster inferencing speeds and effectively lowers costs. These include state-of-the-art fusion and kernel optimizations to help improve model performance. The recent [ONNX Runtime 1.17 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.15.0) improves inference performance of several Gen AI models including Phi-2, Mistral, CodeLlama, Orca-2 and more. ONNX Runtime is a complete solution for small language models (SLMs) from training to inference, showing significant speedups compared to other frameworks. With support for float32, float16, and int4, ONNX Runtime's inference enhancements provide maximum flexibility and performance.
+In a fast-moving landscape where speed and efficiency are paramount, [ONNX Runtime](https://onnxruntime.ai/) (ORT) allows users to easily integrate the power of generative AI models into their apps and services with improved optimizations that yield faster inferencing speeds and effectively lowers costs. These include state-of-the-art fusion and kernel optimizations to help improve model performance. The recent [ONNX Runtime 1.17 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.17.0) improves inference performance of several Gen AI models including Phi-2, Mistral, CodeLlama, Orca-2 and more. ONNX Runtime is a complete solution for small language models (SLMs) from training to inference, showing significant speedups compared to other frameworks. With support for float32, float16, and int4, ONNX Runtime's inference enhancements provide maximum flexibility and performance.
 
 In this blog, we will cover significant optimization speed up for both training and inference for the latest GenAI models like Phi-2, Mistral, CodeLlama, SD-Turbo, SDXL-Turbo, Llama2, and Orca-2. For these model architectures, ONNX Runtime significantly improves performance across a spectrum of batch sizes and prompt lengths when compared against other frameworks like PyTorch, and Llama.cpp. These optimizations using ONNX Runtime are now also available using [Olive](https://github.com/microsoft/Olive/tree/main/examples/).
 # Quick Links

From 69e5ca18f270c680b6b14f3f8337adaeba6fa377 Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:06:03 -0800
Subject: [PATCH 03/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 542ab3ba5f68b..847d9732a7eab 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -58,7 +58,7 @@ For Phi-2 inference, ORT with float16 and int4 quantization performs better than
 
 ### ORT gains with float16
 
-Optimized CUDA performance for prompt throughput (i.e., the rate at which the model processes and generates responses based on input prompts) is **up to 7.39x** faster than PyTorch Compile. We also observe ONNX Runtime is significantly faster for larger batch size and prompt lengths compared to llama.cpp, for e.g., it is **up to 13.08x faster** for batch size =16, prompt length =2048.
+Optimized CUDA performance for prompt throughput (i.e., the rate at which the model processes and generates responses based on input prompts) is **up to 7.39x** faster than PyTorch Compile. We also observe ONNX Runtime is significantly faster for larger batch size and prompt lengths compared to Llama.cpp. For example, it is **up to 13.08x faster** for batch size =16, prompt length =2048.
 
 <img class="m-auto w50" src="./Phi2_Float16_PromptThroughput.png" alt="Phi2 float16 prompt throughput comparison">
 

From 0e1c8d94a3f8fd18d823d4f83aa2915881763736 Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:06:08 -0800
Subject: [PATCH 04/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 847d9732a7eab..015d124b6c599 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -165,7 +165,7 @@ We published a separate blog for Llama-2 improvements with ORT for Inference [he
 
 ## Inference
 
-[Orca 2](https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/) is a research-only system that gives a one-time answer in tasks such as reasoning with user provided data, understanding texts, solving math problems, and summarizing texts. Orca 2 has two versions (7 billion and 13 billion parameters; they are both made by fine-tuning the respective LLAMA 2 base models on customized, high-quality artificial data. ONNX runtime helps optimize Orca-2 inferencing for using graph fusions and kernel optimizations like those for Llama-2.
+[Orca-2](https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/) is a research-only system that gives a one-time answer in tasks such as reasoning with user-provided data, understanding texts, solving math problems, and summarizing texts. Orca-2 has two versions (7 billion and 13 billion parameters); they are both made by fine-tuning the respective Llama-2 base models on customized, high-quality artificial data. ONNX Runtime helps optimize Orca-2 inferencing for using graph fusions and kernel optimizations like those for Llama-2.
 
 Int4 performance: Orca-2 7b int4 quantization performance comparison indicated **up to 26X** increase in performance in prompt throughput, and up to 16.5X improvement in Token generation throughput over PyTorch. It also shows over **4.75X** improvement in prompt throughput, and 3.64X improvement in token generation throughput compared to Llama.cpp.
 

From a633815a5dfff737d6b9af3b0fa20ac22b635d1d Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:06:14 -0800
Subject: [PATCH 05/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 015d124b6c599..48fceec7a3f59 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -216,7 +216,7 @@ Orca-2 7b also benefits from training acceleration using ORT. We trained the Orc
 
 # Conclusion
 
-In conclusion, ONNX Runtime (ORT) provides significant performance improvements for several models, including Phi-2, Mistral, CodeLlama, SDXL-Turbo, Llama-2, Orca-2, and Gemma. ORT offers state-of-the-art fusion and kernel optimizations, including support for float16 and int4 quantization, resulting in faster inferencing speeds and lower costs. ORT outperforms other frameworks like PyTorch and Llama.cpp in terms of prompt and token generation throughput. ORT also shows significant benefits for training LLMs, with increasing gains for larger batch sizes and composes well with state-of-the-art techniques to enable efficient large model training.
+In conclusion, ONNX Runtime (ORT) provides significant performance improvements for several models, including Phi-2, Mistral, CodeLlama, SDXL-Turbo, Llama-2, Orca-2, and Gemma. ORT offers state-of-the-art fusion and kernel optimizations, including support for float16 and int4 quantization, resulting in faster inferencing speeds and lower costs. ORT outperforms other frameworks like PyTorch and Llama.cpp in terms of prompt and token generation throughput. ORT also shows significant benefits for training LLMs, with increasing gains for larger batch sizes, and composes well with state-of-the-art techniques to enable efficient large model training.
 <style>
 	.anchor {
 		scroll-margin-top: 40px;

From 8438fb6d5a4ca0ccf69acf8a9a79f32aafb4467a Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:06:22 -0800
Subject: [PATCH 06/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 48fceec7a3f59..7adcd688808c2 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -205,7 +205,7 @@ Orca-2 7b also benefits from training acceleration using ORT. We trained the Orc
 
 # Gemma
 
-[Gemma](https://ai.google.dev/gemma/docs) is a family of lightweight, open models built from the research and technology that Google used to create Gemini models. It is available in two sizes 2B and 7B. Each size is released with pre-trained and instruction-tuned variants. ONNX Runtime can be used to optimize and efficiently run any open-source model. We benchmarked against the [Gemma-2b](https://huggingface.co/google/gemma-2b) model, ONNX Runtime with float16 is **up to 7.47x** faster than PyTorch Compile, and **up to 3.47x** faster than Llama.cpp. ORT with int4 quantization **is up to 19.81x** faster than PyTorch Eager and **2.62x** faster than Llama.cpp.
+[Gemma](https://ai.google.dev/gemma/docs) is a family of lightweight, open models built from the research and technology that Google used to create Gemini models. It is available in two sizes: 2B and 7B. Each size is released with pre-trained and instruction-tuned variants. ONNX Runtime can be used to optimize and efficiently run any open-source model. We benchmarked against the [Gemma-2B](https://huggingface.co/google/gemma-2b) model, and ONNX Runtime with float16 is **up to 7.47x** faster than PyTorch Compile and **up to 3.47x** faster than Llama.cpp. ORT with int4 quantization is **up to 19.81x** faster than PyTorch Eager and **2.62x** faster than Llama.cpp.
 <div class="grid grid-cols-1 lg:grid-cols-2">
 <img class="m-auto" src="./Gemma2_int4_tokengenTP.png" alt="Gemma2b int4 token generation throughput comparison">
 

From d57343d0b5f5900634dd6605c2e9320a64e65559 Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:06:27 -0800
Subject: [PATCH 07/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 7adcd688808c2..4e0926f7f9d95 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -196,7 +196,7 @@ _Orca-2 benchmarking done on1 A100 GPU, SKU: Standard_ND96amsr_A100_v4 , Package
 
 ## Training
 
-Orca-2 7b also benefits from training acceleration using ORT. We trained the Orca-2 7b model for a sequence length of 512 with LoRA and with the sparsity optimization enabled, to see good gains in performance. The numbers below are for Orca-2 7b models training with ORT using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
+Orca-2-7B also benefits from training acceleration using ORT. We trained the Orca-2-7B model for a sequence length of 512 with LoRA and with the sparsity optimization enabled and saw good gains in performance. The numbers below are for Orca-2-7B models trained with ORT using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
 
 <img class="m-auto w50" src="./Orca2_Training.png" alt="Orca2 training benchmarks">
 <i>Uses ACPT image: nightly-ubuntu2004-cu118-py38-torch230dev:20240131</i>

From f4da3d8905a48524f3109f5a29f01ec298f09fb0 Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:06:41 -0800
Subject: [PATCH 08/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 4e0926f7f9d95..24734b4185ae0 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -110,7 +110,7 @@ The training benchmarks below were run on 2 A100 and measured throughput in iter
 
 ## Inferencing
 
-[Mistral7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a pretrained generative text LLM with 7 billion parameters. ONNX Runtime improves inference performance significantly for Mistral with both float16 and int4 models. With float16, ONNX Runtime is as high **9.46x** compared to Llama.cpp. Token generation throughput significantly improves with int4 quantization for batch size 1 and is **up to 18.25x** faster than PyTorch Eager.
+[Mistral7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a pretrained generative text LLM with 7 billion parameters. ONNX Runtime improves inference performance significantly for Mistral with both float16 and int4 models. With float16, ONNX Runtime is **as high as 9.46x** compared to Llama.cpp. Token generation throughput significantly improves with int4 quantization for batch size 1 and is **up to 18.25x** faster than PyTorch Eager.
 <div class="grid grid-cols-1 lg:grid-cols-2">
 <img class="m-auto" src="./Mistral_float16_PromptTP.png" alt="Mistral float16 prompt throughput comparison">
 

From 4d69445d2ab0729187ecc9e90261c0de78eac08b Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:06:47 -0800
Subject: [PATCH 09/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 24734b4185ae0..f3ede6b83f2b2 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -167,7 +167,9 @@ We published a separate blog for Llama-2 improvements with ORT for Inference [he
 
 [Orca-2](https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/) is a research-only system that gives a one-time answer in tasks such as reasoning with user-provided data, understanding texts, solving math problems, and summarizing texts. Orca-2 has two versions (7 billion and 13 billion parameters); they are both made by fine-tuning the respective Llama-2 base models on customized, high-quality artificial data. ONNX Runtime helps optimize Orca-2 inferencing for using graph fusions and kernel optimizations like those for Llama-2.
 
-Int4 performance: Orca-2 7b int4 quantization performance comparison indicated **up to 26X** increase in performance in prompt throughput, and up to 16.5X improvement in Token generation throughput over PyTorch. It also shows over **4.75X** improvement in prompt throughput, and 3.64X improvement in token generation throughput compared to Llama.cpp.
+### ORT gains with int4
+
+Orca-2-7B int4 quantization performance comparison indicated **up to 26X** increase in performance in prompt throughput, and **up to 16.5X** improvement in token generation throughput over PyTorch. It also shows over **4.75X** improvement in prompt throughput, and **3.64X** improvement in token generation throughput compared to Llama.cpp.
 
 <div class="grid grid-cols-1 lg:grid-cols-2">
 <img class="m-auto" src="./Orca2_7b_int4_promptTP.png" alt="Orca2 7b int4 prompt throughput comparison">

From 84415fccb22d8985029d2814ac3ababca45c7abb Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:06:55 -0800
Subject: [PATCH 10/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index f3ede6b83f2b2..218ec173e5439 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -62,7 +62,7 @@ Optimized CUDA performance for prompt throughput (i.e., the rate at which the mo
 
 <img class="m-auto w50" src="./Phi2_Float16_PromptThroughput.png" alt="Phi2 float16 prompt throughput comparison">
 
-Token generation throughput is the average throughput of the first 256 tokens generated. ONNX Runtime with float16 is **on average 6.6x faster** than torch.compile and as high as **18.55x** faster. It also performs **up to 1.64x** faster than Llama.cpp.
+Token generation throughput is the average throughput of the first 256 tokens generated. ONNX Runtime with float16 is **on average 6.6x faster** than torch.compile and **as high as 18.55x** faster. It also performs **up to 1.64x** faster than Llama.cpp.
 <img class="m-auto w50" src="./Phi2_Float16_TokenGenerationThroughput.png" alt="Phi2 float16 token generation throughput comparison">
 
 ### ORT gains with int4

From a45e81bab4942c3058c6a16f61c53d136c59ea42 Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:07:06 -0800
Subject: [PATCH 11/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 218ec173e5439..802f71991d657 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -85,7 +85,7 @@ Here is an example of [Phi-2 optimizations with Olive](https://github.com/micros
 
 ## Training
 
-In addition to inference, ONNX Runtime also provides training speedup for Phi-2 and other LLMs. ORT Training is part of the PyTorch Ecosystem and is available via the torch-ort python package, as part of the Azure Container for PyTorch (ACPT). It provides flexible and extensible hardware support, where the same model and APIs works with both NVIDIA and AMD GPUs. ORT accelerates training through optimized kernels and memory optimizations which show significant gains in reducing end-to-end training time for large model training. This involves changing a few lines of code in the model to wrap it with the ORTModule API. It is also composable with popular acceleration libraries like DeepSpeed and Megatron for faster and more efficient training.
+In addition to inference, ONNX Runtime also provides training speedup for Phi-2 and other LLMs. ORT training is part of the PyTorch Ecosystem and is available via the torch-ort python package as part of the Azure Container for PyTorch (ACPT). It provides flexible and extensible hardware support, where the same model and APIs works with both NVIDIA and AMD GPUs. ORT accelerates training through optimized kernels and memory optimizations which show significant gains in reducing end-to-end training time for large model training. This involves changing a few lines of code in the model to wrap it with the ORTModule API. It is also composable with popular acceleration libraries like DeepSpeed and Megatron for faster and more efficient training.
 
 Open AI's Triton is a domain specific language and compiler to write highly efficient custom deep learning primitives. ORT supports Open AI Triton integration (ORT+Triton), where all element wise operators are converted to Triton ops and ORT creates custom fused kernels in Triton.
 

From 62ce01fece3e3175dcce466c91b10a9da99f0142 Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:07:14 -0800
Subject: [PATCH 12/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 802f71991d657..5a4888e9f82f7 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -126,7 +126,7 @@ You can now access the optimized Mistral model on Huggingface [here.](https://hu
 
 ## Training
 
-Similar to Phi-2, Mistral also benefits from training acceleration using ORT. We trained Mistral 7b using the following configuration to see gains with ORT, including when composed with LoRA and QLoRA. The model was trained using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
+Similar to Phi-2, Mistral also benefits from training acceleration using ORT. We trained Mistral-7B using the following configuration to see gains with ORT, including when composed with LoRA and QLoRA. The model was trained using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
 <img class="m-auto w50" src="./Mistral_Training.png" alt="Mistral training benchmarks">
 
 <div class="anchor" id="codellama"/>

From f409b019ed09d7527e1ed342bb55f3ad7eb2642e Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:07:57 -0800
Subject: [PATCH 13/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 5a4888e9f82f7..5b2fcf8d75753 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -145,7 +145,7 @@ Similar to Phi-2, Mistral also benefits from training acceleration using ORT. We
 
 # SD-Turbo and SDXL-Turbo
 
-ONNX Runtime provides inference performance benefits when used with [SD Turbo](https://huggingface.co/stabilityai/sd-turbo) and [SDXL Turbo,](https://huggingface.co/stabilityai/sdxl-turbo) and it also makes the models accessible in languages other than Python, like C# and Java. ONNX Runtime achieved a higher throughput than PyTorch for all (batch size, number of steps combinations evaluated, with throughput improvements **up to 229%** for the SDXL Turbo model and **120%** for the SD Turbo model. ONNX Runtime CUDA is especially good at handling dynamic shape, but it also shows a significant advantage over PyTorch for static shape.
+ONNX Runtime provides inference performance benefits when used with [SD Turbo](https://huggingface.co/stabilityai/sd-turbo) and [SDXL Turbo](https://huggingface.co/stabilityai/sdxl-turbo), and it also makes the models accessible in languages other than Python, like C# and Java. ONNX Runtime achieved a higher throughput than PyTorch for all (batch size, number of steps) combinations evaluated, with throughput improvements **up to 229%** for the SDXL Turbo model and **120%** for the SD Turbo model. ONNX Runtime CUDA is especially good at handling dynamic shape, but it also shows a significant advantage over PyTorch for static shape.
 
 <img class="m-auto" src="./SDXL.jpg" alt="Stable Diffusion XL Turbo Speedup">
 

From ee6f7a8f811f4b7d56a3a432e6d06ee2f269cdcd Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:08:05 -0800
Subject: [PATCH 14/15] Update src/routes/blogs/accelerating-phi-2/+page.svx

Co-authored-by: Sophie Schoenmeyer <107952697+sophies927@users.noreply.github.com>
---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 5b2fcf8d75753..8756d42ae53b2 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -155,7 +155,7 @@ To read more about accelerating SD-Turbo and SDXL-Turbo inference with ONNX Runt
 
 # Llama-2
 
-We published a separate blog for Llama-2 improvements with ORT for Inference [here](https://onnxruntime.ai/blogs/accelerating-llama-2). Additionally, Llama-2 7b and 13b show good gains with ORT for training, especially when combined with LoRA and QLoRA. [These](https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/training/text-classification/README.md#onnx-runtime-training) scripts can be used as an example to finetune Llama-2 with ORT using Optimum. The numbers below are for Llama-2 models training with ORT using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
+We published a separate blog for Llama-2 improvements with ORT for Inference [here](https://onnxruntime.ai/blogs/accelerating-llama-2). Additionally, Llama-2-7B and Llama-2-13B show good gains with ORT for training, especially when combined with LoRA and QLoRA. [These](https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/training/text-classification/README.md#onnx-runtime-training) scripts can be used as an example to finetune Llama-2 with ORT using Optimum. The numbers below are for Llama-2 models training with ORT using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
 
 <img class="m-auto w50" src="./Llama2_Training.png" alt="Llama2 training benchmarks">
 

From d9223744e64398a87f3130ccf566a2746f1a02a5 Mon Sep 17 00:00:00 2001
From: Maanav Dalal <maanavdalal@gmail.com>
Date: Mon, 26 Feb 2024 18:09:23 -0800
Subject: [PATCH 15/15] Update +page.svx

---
 src/routes/blogs/accelerating-phi-2/+page.svx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blogs/accelerating-phi-2/+page.svx b/src/routes/blogs/accelerating-phi-2/+page.svx
index 8756d42ae53b2..b3c8dea68ae01 100644
--- a/src/routes/blogs/accelerating-phi-2/+page.svx
+++ b/src/routes/blogs/accelerating-phi-2/+page.svx
@@ -44,7 +44,7 @@ In this blog, we will cover significant optimization speed up for both training
 
 [Phi-2](https://huggingface.co/microsoft/phi-2) is a 2.7 billion parameter transformer model developed by Microsoft. It is an SLM that exhibits excellent reasoning and language comprehension skills. With its small size, Phi-2 is a great platform for researchers, who can explore various aspects such as mechanistic interpretability, safety improvements, and fine-tuning experiments on different tasks.
 
-ONNX Runtime 1.17 introduces kernels changes that support the Phi-2 model, including optimizations for Attention, Multi-Head Attention, Grouped-Query Attention, and RotaryEmbeddingPhi-2. Specifically, support has been added for the following:
+ONNX Runtime 1.17 introduces kernels changes that support the Phi-2 model, including optimizations for Attention, Multi-Head Attention, Grouped-Query Attention, and RotaryEmbedding for Phi-2. Specifically, support has been added for the following:
 
 - causal mask in the Multi-Head Attention CPU kernel
 - rotary_embedding_dim in the Attention and Rotary Embedding kernels