Skip to content

Commit

Permalink
merged
Browse files Browse the repository at this point in the history
  • Loading branch information
MaanavD committed Feb 27, 2024
2 parents b612686 + d922374 commit f6a9db3
Showing 1 changed file with 17 additions and 15 deletions.
32 changes: 17 additions & 15 deletions src/routes/blogs/accelerating-phi-2/+page.svx
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ url: 'https://onnxruntime.ai/blogs/accelerating-phi-2'
---


In a fastmoving landscape where speed and efficiency are paramount, [ONNX Runtime](https://onnxruntime.ai/blogs/accelerating-llama-2) (ORT) allows users to easily integrate the power of generative AI models into their apps and services with improved optimizations that yield faster inferencing speeds and effectively lowers costs. These include state-of-the-art fusion and kernel optimizations to help improve model performance. The recent [ONNX Runtime 1.17 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.15.0) improves inference performance of several Gen AI models including Phi-2, Mistral, CodeLlama, Orca-2 and more. ONNX Runtime is a complete solution for small language models (SLMs) from training to inference, showing significant speedups compared to other frameworks. With support for float32, float16, and int4, ONNX Runtime's inference enhancements provide maximum flexibility and performance.
In a fast-moving landscape where speed and efficiency are paramount, [ONNX Runtime](https://onnxruntime.ai/) (ORT) allows users to easily integrate the power of generative AI models into their apps and services with improved optimizations that yield faster inferencing speeds and effectively lowers costs. These include state-of-the-art fusion and kernel optimizations to help improve model performance. The recent [ONNX Runtime 1.17 release](https://github.com/microsoft/onnxruntime/releases/tag/v1.17.0) improves inference performance of several Gen AI models including Phi-2, Mistral, CodeLlama, Orca-2 and more. ONNX Runtime is a complete solution for small language models (SLMs) from training to inference, showing significant speedups compared to other frameworks. With support for float32, float16, and int4, ONNX Runtime's inference enhancements provide maximum flexibility and performance.

In this blog we will cover significant optimization speed up for both training and inference for latest GenAI models like Phi-2, Mistral, CodeLlama, SD-Turbo, SDXL-Turbo, Llama2, and Orca-2. For these model architectures ONNX Runtime significantly improves performance across a spectrum of batch size and prompt length when compared against other frameworks like PyTorch, and Llama.cpp. These optimizations using ONNX Runtime is now also available using [Olive](https://github.com/microsoft/Olive/tree/main/examples/).
In this blog, we will cover significant optimization speed up for both training and inference for the latest GenAI models like Phi-2, Mistral, CodeLlama, SD-Turbo, SDXL-Turbo, Llama2, and Orca-2. For these model architectures, ONNX Runtime significantly improves performance across a spectrum of batch sizes and prompt lengths when compared against other frameworks like PyTorch, and Llama.cpp. These optimizations using ONNX Runtime are now also available using [Olive](https://github.com/microsoft/Olive/tree/main/examples/).
# Quick Links
- [Phi-2](#phi-2)
- [Mistral](#mistral)
Expand All @@ -44,7 +44,7 @@ In this blog we will cover significant optimization speed up for both training a

[Phi-2](https://huggingface.co/microsoft/phi-2) is a 2.7 billion parameter transformer model developed by Microsoft. It is an SLM that exhibits excellent reasoning and language comprehension skills. With its small size, Phi-2 is a great platform for researchers, who can explore various aspects such as mechanistic interpretability, safety improvements, and fine-tuning experiments on different tasks.

ONNX Runtime 1.17 introduces kernels changes that support the Phi-2 model, including optimizations for Attention, Multi-Head Attention, Grouped-Query Attention, and RotaryEmbeddingPhi-2. Specifically, support has been added for the following:
ONNX Runtime 1.17 introduces kernels changes that support the Phi-2 model, including optimizations for Attention, Multi-Head Attention, Grouped-Query Attention, and RotaryEmbedding for Phi-2. Specifically, support has been added for the following:

- causal mask in the Multi-Head Attention CPU kernel
- rotary_embedding_dim in the Attention and Rotary Embedding kernels
Expand All @@ -58,11 +58,11 @@ For Phi-2 inference, ORT with float16 and int4 quantization performs better than

### ORT gains with float16

Optimized CUDA performance for prompt throughput (i.e., the rate at which the model processes and generates responses based on input prompts) is **up to 7.39x** faster than PyTorch Compile. We also observe ONNX Runtime is significantly faster for larger batch size and prompt lengths compared to llama.cpp, for e.g., it is **up to 13.08x faster** for batch size =16, prompt length =2048.
Optimized CUDA performance for prompt throughput (i.e., the rate at which the model processes and generates responses based on input prompts) is **up to 7.39x** faster than PyTorch Compile. We also observe ONNX Runtime is significantly faster for larger batch size and prompt lengths compared to Llama.cpp. For example, it is **up to 13.08x faster** for batch size =16, prompt length =2048.

<img class="m-auto w50" src="./Phi2_Float16_PromptThroughput.png" alt="Phi2 float16 prompt throughput comparison">

Token generation throughput is the average throughput of the first 256 tokens generated. ONNX Runtime with float16 is **on average 6.6x faster** than torch.compile and as high as **18.55x** faster. It also performs **up to 1.64x** faster than Llama.cpp.
Token generation throughput is the average throughput of the first 256 tokens generated. ONNX Runtime with float16 is **on average 6.6x faster** than torch.compile and **as high as 18.55x** faster. It also performs **up to 1.64x** faster than Llama.cpp.
<img class="m-auto w50" src="./Phi2_Float16_TokenGenerationThroughput.png" alt="Phi2 float16 token generation throughput comparison">

### ORT gains with int4
Expand All @@ -85,7 +85,7 @@ Here is an example of [Phi-2 optimizations with Olive](https://github.com/micros

## Training

In addition to inference, ONNX Runtime also provides training speedup for Phi-2 and other LLMs. ORT Training is part of the PyTorch Ecosystem and is available via the torch-ort python package, as part of the [Azure Container for PyTorch (ACPT)](https://learn.microsoft.com/en-us/azure/machine-learning/resource-azure-container-for-pytorch?view=azureml-api-2). It provides flexible and extensible hardware support, where the same model and APIs works with both NVIDIA and AMD GPUs. ORT accelerates training through optimized kernels and memory optimizations which show significant gains in reducing end-to-end training time for large model training. This involves changing a few lines of code in the model to wrap it with the ORTModule API. It is also composable with popular acceleration libraries like DeepSpeed and Megatron for faster and more efficient training.
In addition to inference, ONNX Runtime also provides training speedup for Phi-2 and other LLMs. ORT training is part of the PyTorch Ecosystem and is available via the torch-ort python package as part of the Azure Container for PyTorch (ACPT). It provides flexible and extensible hardware support, where the same model and APIs works with both NVIDIA and AMD GPUs. ORT accelerates training through optimized kernels and memory optimizations which show significant gains in reducing end-to-end training time for large model training. This involves changing a few lines of code in the model to wrap it with the ORTModule API. It is also composable with popular acceleration libraries like DeepSpeed and Megatron for faster and more efficient training.

[Open AI's Triton](https://openai.com/research/triton) is a domain specific language and compiler to write highly efficient custom deep learning primitives. ORT supports Open AI Triton integration (ORT+Triton), where all element wise operators are converted to Triton ops and ORT creates custom fused kernels in Triton.

Expand All @@ -110,7 +110,7 @@ The training benchmarks below were run on 2 A100 and measured throughput in iter

## Inferencing

[Mistral7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a pretrained generative text LLM with 7 billion parameters. ONNX Runtime improves inference performance significantly for Mistral with both float16 and int4 models. With float16, ONNX Runtime is as high **9.46x** compared to Llama.cpp. Token generation throughput significantly improves with int4 quantization for batch size 1 and is **up to 18.25x** faster than PyTorch Eager.
[Mistral7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a pretrained generative text LLM with 7 billion parameters. ONNX Runtime improves inference performance significantly for Mistral with both float16 and int4 models. With float16, ONNX Runtime is **as high as 9.46x** compared to Llama.cpp. Token generation throughput significantly improves with int4 quantization for batch size 1 and is **up to 18.25x** faster than PyTorch Eager.
<div class="grid grid-cols-1 lg:grid-cols-2">
<img class="m-auto" src="./Mistral_float16_PromptTP.png" alt="Mistral float16 prompt throughput comparison">

Expand All @@ -126,7 +126,7 @@ You can now access the optimized Mistral model on Huggingface [here.](https://hu

## Training

Similar to Phi-2, Mistral also benefits from training acceleration using ORT. We trained Mistral 7b using the following configuration to see gains with ORT, including when composed with LoRA and QLoRA. The model was trained using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
Similar to Phi-2, Mistral also benefits from training acceleration using ORT. We trained Mistral-7B using the following configuration to see gains with ORT, including when composed with LoRA and QLoRA. The model was trained using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
<img class="m-auto w50" src="./Mistral_Training.png" alt="Mistral training benchmarks">

<div class="anchor" id="codellama"/>
Expand All @@ -145,7 +145,7 @@ Similar to Phi-2, Mistral also benefits from training acceleration using ORT. We

# SD-Turbo and SDXL-Turbo

ONNX Runtime provides inference performance benefits when used with [SD Turbo](https://huggingface.co/stabilityai/sd-turbo) and [SDXL Turbo,](https://huggingface.co/stabilityai/sdxl-turbo) and it also makes the models accessible in languages other than Python, like C# and Java. ONNX Runtime achieved a higher throughput than PyTorch for all (batch size, number of steps combinations evaluated, with throughput improvements **up to 229%** for the SDXL Turbo model and **120%** for the SD Turbo model. ONNX Runtime CUDA is especially good at handling dynamic shape, but it also shows a significant advantage over PyTorch for static shape.
ONNX Runtime provides inference performance benefits when used with [SD Turbo](https://huggingface.co/stabilityai/sd-turbo) and [SDXL Turbo](https://huggingface.co/stabilityai/sdxl-turbo), and it also makes the models accessible in languages other than Python, like C# and Java. ONNX Runtime achieved a higher throughput than PyTorch for all (batch size, number of steps) combinations evaluated, with throughput improvements **up to 229%** for the SDXL Turbo model and **120%** for the SD Turbo model. ONNX Runtime CUDA is especially good at handling dynamic shape, but it also shows a significant advantage over PyTorch for static shape.

<img class="m-auto" src="./SDXL.jpg" alt="Stable Diffusion XL Turbo Speedup">

Expand All @@ -155,7 +155,7 @@ To read more about accelerating SD-Turbo and SDXL-Turbo inference with ONNX Runt

# Llama-2

We published a separate blog for Llama-2 improvements with ORT for Inference [here](https://onnxruntime.ai/blogs/accelerating-llama-2). Additionally, Llama-2 7b and 13b show good gains with ORT for training, especially when combined with LoRA and QLoRA. [These](https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/training/text-classification/README.md#onnx-runtime-training) scripts can be used as an example to finetune Llama-2 with ORT using Optimum. The numbers below are for Llama-2 models training with ORT using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
We published a separate blog for Llama-2 improvements with ORT for Inference [here](https://onnxruntime.ai/blogs/accelerating-llama-2). Additionally, Llama-2-7B and Llama-2-13B show good gains with ORT for training, especially when combined with LoRA and QLoRA. [These](https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/training/text-classification/README.md#onnx-runtime-training) scripts can be used as an example to finetune Llama-2 with ORT using Optimum. The numbers below are for Llama-2 models training with ORT using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.

<img class="m-auto w50" src="./Llama2_Training.png" alt="Llama2 training benchmarks">

Expand All @@ -165,9 +165,11 @@ We published a separate blog for Llama-2 improvements with ORT for Inference [he

## Inference

[Orca 2](https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/) is a research-only system that gives a one-time answer in tasks such as reasoning with user provided data, understanding texts, solving math problems, and summarizing texts. Orca 2 has two versions (7 billion and 13 billion parameters; they are both made by fine-tuning the respective LLAMA 2 base models on customized, high-quality artificial data. ONNX runtime helps optimize Orca-2 inferencing for using graph fusions and kernel optimizations like those for Llama-2.
[Orca-2](https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/) is a research-only system that gives a one-time answer in tasks such as reasoning with user-provided data, understanding texts, solving math problems, and summarizing texts. Orca-2 has two versions (7 billion and 13 billion parameters); they are both made by fine-tuning the respective Llama-2 base models on customized, high-quality artificial data. ONNX Runtime helps optimize Orca-2 inferencing for using graph fusions and kernel optimizations like those for Llama-2.

Int4 performance: Orca-2 7b int4 quantization performance comparison indicated **up to 26X** increase in performance in prompt throughput, and up to 16.5X improvement in Token generation throughput over PyTorch. It also shows over **4.75X** improvement in prompt throughput, and 3.64X improvement in token generation throughput compared to Llama.cpp.
### ORT gains with int4

Orca-2-7B int4 quantization performance comparison indicated **up to 26X** increase in performance in prompt throughput, and **up to 16.5X** improvement in token generation throughput over PyTorch. It also shows over **4.75X** improvement in prompt throughput, and **3.64X** improvement in token generation throughput compared to Llama.cpp.

<div class="grid grid-cols-1 lg:grid-cols-2">
<img class="m-auto" src="./Orca2_7b_int4_promptTP.png" alt="Orca2 7b int4 prompt throughput comparison">
Expand Down Expand Up @@ -196,7 +198,7 @@ _Orca-2 benchmarking done on1 A100 GPU, SKU: Standard_ND96amsr_A100_v4 , Package

## Training

Orca-2 7b also benefits from training acceleration using ORT. We trained the Orca-2 7b model for a sequence length of 512 with LoRA and with the sparsity optimization enabled, to see good gains in performance. The numbers below are for Orca-2 7b models training with ORT using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.
Orca-2-7B also benefits from training acceleration using ORT. We trained the Orca-2-7B model for a sequence length of 512 with LoRA and with the sparsity optimization enabled and saw good gains in performance. The numbers below are for Orca-2-7B models trained with ORT using DeepSpeed Stage-2 for 5 epochs, with batch size 1 on the wikitext dataset.

<img class="m-auto w50" src="./Orca2_Training.png" alt="Orca2 training benchmarks">
<i>Uses ACPT image: nightly-ubuntu2004-cu118-py38-torch230dev:20240131</i>
Expand All @@ -205,7 +207,7 @@ Orca-2 7b also benefits from training acceleration using ORT. We trained the Orc

# Gemma

[Gemma](https://ai.google.dev/gemma/docs) is a family of lightweight, open models built from the research and technology that Google used to create Gemini models. It is available in two sizes 2B and 7B. Each size is released with pre-trained and instruction-tuned variants. ONNX Runtime can be used to optimize and efficiently run any open-source model. We benchmarked against the [Gemma-2b](https://huggingface.co/google/gemma-2b) model, ONNX Runtime with float16 is **up to 7.47x** faster than PyTorch Compile, and **up to 3.47x** faster than Llama.cpp. ORT with int4 quantization **is up to 19.81x** faster than PyTorch Eager and **2.62x** faster than Llama.cpp.
[Gemma](https://ai.google.dev/gemma/docs) is a family of lightweight, open models built from the research and technology that Google used to create Gemini models. It is available in two sizes: 2B and 7B. Each size is released with pre-trained and instruction-tuned variants. ONNX Runtime can be used to optimize and efficiently run any open-source model. We benchmarked against the [Gemma-2B](https://huggingface.co/google/gemma-2b) model, and ONNX Runtime with float16 is **up to 7.47x** faster than PyTorch Compile and **up to 3.47x** faster than Llama.cpp. ORT with int4 quantization is **up to 19.81x** faster than PyTorch Eager and **2.62x** faster than Llama.cpp.
<div class="grid grid-cols-1 lg:grid-cols-2">
<img class="m-auto" src="./Gemma2_int4_tokengenTP.png" alt="Gemma2b int4 token generation throughput comparison">

Expand All @@ -216,7 +218,7 @@ Orca-2 7b also benefits from training acceleration using ORT. We trained the Orc

# Conclusion

In conclusion, ONNX Runtime (ORT) provides significant performance improvements for several models, including Phi-2, Mistral, CodeLlama, SDXL-Turbo, Llama-2, Orca-2, and Gemma. ORT offers state-of-the-art fusion and kernel optimizations, including support for float16 and int4 quantization, resulting in faster inferencing speeds and lower costs. ORT outperforms other frameworks like PyTorch and Llama.cpp in terms of prompt and token generation throughput. ORT also shows significant benefits for training LLMs, with increasing gains for larger batch sizes and composes well with state-of-the-art techniques to enable efficient large model training.
In conclusion, ONNX Runtime (ORT) provides significant performance improvements for several models, including Phi-2, Mistral, CodeLlama, SDXL-Turbo, Llama-2, Orca-2, and Gemma. ORT offers state-of-the-art fusion and kernel optimizations, including support for float16 and int4 quantization, resulting in faster inferencing speeds and lower costs. ORT outperforms other frameworks like PyTorch and Llama.cpp in terms of prompt and token generation throughput. ORT also shows significant benefits for training LLMs, with increasing gains for larger batch sizes, and composes well with state-of-the-art techniques to enable efficient large model training.
<style>
.anchor {
scroll-margin-top: 40px;
Expand Down

0 comments on commit f6a9db3

Please sign in to comment.