Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update link to E2E notebook in LLaMA-2 blog #20724

Merged
merged 1 commit into from
May 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/routes/blogs/accelerating-llama-2/+page.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@
>batch size * (prompt length + token generation length) / wall-clock latency</i
> where wall-clock latency = the latency from running end-to-end and token generation length =
256 generated tokens. The E2E throughput is 2.4X more (13B) and 1.8X more (7B) when compared to
PyTorch compile. For higher batch size, sequence length like 16, 2048 pytorch eager times out,
PyTorch compile. For higher batch size, sequence length pairs such as (16, 2048), PyTorch eager times out,
while ORT shows better performance than compile mode.
</p>
<div class="grid grid-cols-1 lg:grid-cols-2 gap-4">
Expand Down Expand Up @@ -151,7 +151,7 @@

<p class="mb-4">
More details on these metrics can be found <a
href="https://github.com/microsoft/onnxruntime-inference-examples/blob/main/python/models/llama2/README.md"
href="https://github.com/microsoft/onnxruntime-inference-examples/blob/main/python/models/llama/README.md"
class="text-blue-500">here</a
>.
</p>
Expand Down
Loading