diff --git a/src/routes/blogs/accelerating-llama-2/+page.svelte b/src/routes/blogs/accelerating-llama-2/+page.svelte index 5854bfcb489e8..aeb3bf6b83fae 100644 --- a/src/routes/blogs/accelerating-llama-2/+page.svelte +++ b/src/routes/blogs/accelerating-llama-2/+page.svelte @@ -45,11 +45,11 @@
- By: Kunal Vaishnavi and - Parinita Rahi + Parinita Rahi
14TH NOVEMBER, 2023 (Updated 22nd November) @@ -76,7 +76,7 @@ Llama2 is a state-of-the-art open source LLM from Meta ranging in scale from 7B to 70B parameters (7B, 13B, 70B). Microsoft and Meta announcedannounced their AI on Azure and Windows collaboration in July 2023. As part of the announcement, Llama2 was added to the Azure AI model catalog, which serves as a hub of foundation models that empower developers and machine learning (ML) professionals to easily discover, evaluate, customize, and @@ -152,7 +152,7 @@
More details on these metrics can be found herehere.
@@ -165,7 +165,7 @@- ONNX Runtime applied Megatron-LM Tensor Parallelism on the 70B model to split the original model weight onto different GPUs. Megatron @@ -176,7 +176,7 @@ You can find additional example scripts herehere.
@@ -252,7 +252,7 @@ calculate the rotary embeddings more efficiently with less memory usage. The rotary embedding compute kernels also support interleaved and non-interleaved formats to support both the Microsoft version of LLaMA-2Microsoft version of LLaMA-2 and the Hugging Face version of LLaMA-2 respectively while sharing the same calculations. @@ -260,11 +260,11 @@The optimizations work for the Hugging Face versionsHugging Face versions (models ending with -hf) and the Microsoft versions. You can download the optimized HF versions from - Microsoft's LLaMA-2 ONNX repository. Stay tuned for newer Microsoft versions coming soon!
@@ -281,7 +281,7 @@Here is an example of Llama2 optimization with OliveLlama2 optimization with Olive, which harnesses ONNX Runtime optimizations highlighted in this blog. Distinct optimization flows cater to various requirements. For instance, you have the flexibility to choose different data types for quantization in CPU and GPU inference, based on your accuracy @@ -294,7 +294,7 @@
Here is a sample notebooksample notebook that shows you an end-to-end example of how you can use the above ONNX Runtime optimizations in your application.
diff --git a/src/routes/training/+page.svelte b/src/routes/training/+page.svelte index 44fd288350c49..a51093a9cb397 100644 --- a/src/routes/training/+page.svelte +++ b/src/routes/training/+page.svelte @@ -221,8 +221,8 @@ Personalization tasks where the model needs to be trained on the user's data + Examples: