From 1abae98815332621f5d07bc3df9fa19e8c77a5c4 Mon Sep 17 00:00:00 2001 From: Parinita Rahi <101819959+parinitarahi@users.noreply.github.com> Date: Wed, 15 Nov 2023 07:01:16 +0000 Subject: [PATCH] checked on changes --- src/routes/blogs/accelerating-llama-2/+page.svelte | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/routes/blogs/accelerating-llama-2/+page.svelte b/src/routes/blogs/accelerating-llama-2/+page.svelte index 1905c99943723..cb83a865cbeb9 100644 --- a/src/routes/blogs/accelerating-llama-2/+page.svelte +++ b/src/routes/blogs/accelerating-llama-2/+page.svelte @@ -22,7 +22,7 @@

Accelerating LLaMA-2 Inference with ONNX Runtime

-

By: Parinita Rahi and Kunal Vaishnavi

+

By: Kunal Vaishnavi and Parinita Rahi

14TH NOVEMBER, 2023

@@ -142,7 +142,7 @@

- 70B Llama2 Model Throughput +
70B Llama2 Model Throughput
Figure 4: 70B Llama2 Model Throughput
@@ -208,15 +208,15 @@ to decide which approach is best for them.

-

In addition to these fusions and kernel optimizations, ONNX Runtime reduces the model’s memory usage. Besides quantization improvements (which will be covered in a future post), ONNX Runtime compresses the size of the cosine and sine caches used in each of the rotary embeddings by 50%. The compute kernels in ONNX Runtime that run the rotary embedding computations can then recognize this format and use their parallelized implementations to calculate the rotary embeddings more efficiently with less memory usage. The rotary embedding compute kernels also support interleaved and non-interleaved formats to support both - the Microsoft version of LLaMA-2 and the Hugging Face version of LLaMA-2 respectively while sharing the - same calculations. + the Microsoft version of LLaMA-2 + and the Hugging Face version of LLaMA-2 respectively while sharing the same calculations.