From ba05e2c29a7a94e765a4b838340e3a0cae005643 Mon Sep 17 00:00:00 2001 From: MaanavD Date: Fri, 17 Nov 2023 13:55:54 -0800 Subject: [PATCH] Small llama blog change, readme to contributing. --- README.md => CONTRIBUTING.md | 0 src/routes/blogs/accelerating-llama-2/+page.svelte | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) rename README.md => CONTRIBUTING.md (100%) diff --git a/README.md b/CONTRIBUTING.md similarity index 100% rename from README.md rename to CONTRIBUTING.md diff --git a/src/routes/blogs/accelerating-llama-2/+page.svelte b/src/routes/blogs/accelerating-llama-2/+page.svelte index 0f5add02a8d4b..c5adea9cbd88b 100644 --- a/src/routes/blogs/accelerating-llama-2/+page.svelte +++ b/src/routes/blogs/accelerating-llama-2/+page.svelte @@ -137,7 +137,7 @@ shards the PyTorch model with FP16 precision into 4 partitions, converts each partition into ONNX format, and then applies a new ONNX Runtime graph fusion on the converted ONNX model. The 70B model has ~30 tokens per second throughput for token generation at batch size 1, and - end-to-end throughput starts at 30 ms for smaller sequence lengths with these optimizations. + end-to-end throughput starts at 30 tps for smaller sequence lengths with these optimizations. You can find additional example scripts here.