Skip to content

Commit

Permalink
Merge branch 'gh-pages' into nimbleedge_blog
Browse files Browse the repository at this point in the history
  • Loading branch information
MaanavD authored Jun 17, 2024
2 parents bb17882 + 82bb41a commit c5e7872
Show file tree
Hide file tree
Showing 24 changed files with 132 additions and 104 deletions.
5 changes: 2 additions & 3 deletions docs/performance/device-tensor.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ nav_order: 6

Using device tensors can be a crucial part in building efficient AI pipelines, especially on heterogenous memory systems.
A typical example of such systems is any PC with a dedicated GPU.
While a [recent GPU](https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889) itself has a memory bandwidth of about 1TB/s, the interconnect [PCI 4.0 x16](https://de.wikipedia.org/wiki/PCI_Express) to the CPU can often be the limiting factor with only ~32GB/s.
While a [recent GPU](https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889) itself has a memory bandwidth of about 1TB/s, the interconnect [PCI 4.0 x16](https://en.wikipedia.org/wiki/PCI_Express) to the CPU can often be the limiting factor with only ~32GB/s.
Therefore it is often best to keep data local to the GPU as much as possible or hide slow memory traffic behind computation as the GPU is able to execute compute and PCI memory traffic simultaneously.

A typical use case for these scenarios where memory is already local to the inference device would be a GPU accelerated video processing of an encoded video stream which can be decoded with GPU decoders.
Expand All @@ -20,7 +20,7 @@ Tile based inference for high resolution images is another use-case where custom
## CUDA

CUDA in ONNX Runtime has two custom memory types.
`"CudaPinned"` and `"Cuda"` memory where [CUDA pinned](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/) is actually CPU memory which is directly accesible by the GPU allowing for fully asynchronous up and download of memory using [`cudaMemcpyAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79).
`"CudaPinned"` and `"Cuda"` memory where [CUDA pinned](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/) is actually CPU memory which is directly accessible by the GPU allowing for fully asynchronous up and download of memory using [`cudaMemcpyAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79).
Normal CPU tensors only allow for a synchronous downloads from GPU to CPU while CPU to GPU copies can always be executed asynchronous.

Allocating a tensor using the `Ort::Sessions`'s allocator is very straight forward using the [C++ API](https://onnxruntime.ai/docs/api/c/struct_ort_1_1_value.html#a5d35080239ae47cdbc9e505666dc32ec) which directly maps to the C API.
Expand Down Expand Up @@ -132,5 +132,4 @@ binding.bind_output("out", "dml")
# binding.bind_ortvalue_output("out", dml_array_out)
session.run_with_iobinding(binding)
```
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,29 @@
}
}
};
const toggleScroll = () => {
if (scrollerRef) {
const currentState = window.getComputedStyle(scrollerRef).animationPlayState;
scrollerRef.style.animationPlayState = currentState === 'running' ? 'paused' : 'running';
}
};
const handleKeyDown = (event: { key: string; preventDefault: () => void; }) => {
if (event.key === 'Enter' || event.key === ' ') {
event.preventDefault(); // Prevent default spacebar scrolling behavior
toggleScroll();
}
};
</script>

<div bind:this={containerRef} class={cn('scroller relative z-2 overflow-hidden ', className)}>
<button class="hover:bg-primary focus:bg-primary menu-item py-2 sr-only focus:not-sr-only" on:keydown={handleKeyDown} on:click={toggleScroll}>Toggle scrolling</button>
<ul
bind:this={scrollerRef}
class={cn(
' flex w-max min-w-full shrink-0 flex-nowrap gap-4 py-4',
start && 'animate-scroll ',
start && 'animate-scroll',
pauseOnHover && 'hover:[animation-play-state:paused]'
)}
>
Expand Down
2 changes: 1 addition & 1 deletion src/routes/+layout.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
<Header />
{/if}
{#key data.pathname}
<div in:fade={{ duration: 300, delay: 400 }} out:fade={{ duration: 300 }}>
<div id="main-content" in:fade={{ duration: 300, delay: 400 }} out:fade={{ duration: 300 }}>
<slot />
</div>
{/key}
Expand Down
38 changes: 19 additions & 19 deletions src/routes/blogs/accelerating-llama-2/+page.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,11 @@
<div class="container mx-auto px-4 md:px-8 lg:px-48 pt-8">
<h1 class="text-5xl pb-2">Accelerating LLaMA-2 Inference with ONNX Runtime</h1>
<p class="text-neutral">
By: <a href="https://www.linkedin.com/in/kunal-v-16315b94" class="text-blue-500"
By: <a href="https://www.linkedin.com/in/kunal-v-16315b94" class="text-blue-700"
>Kunal Vaishnavi</a
>
and
<a href="https://www.linkedin.com/in/parinitaparinita/" class="text-blue-500">Parinita Rahi</a>
<a href="https://www.linkedin.com/in/parinitaparinita/" class="text-blue-700">Parinita Rahi</a>
</p>
<p class="text-neutral">
14TH NOVEMBER, 2023 <span class="italic text-stone-500">(Updated 22nd November)</span>
Expand All @@ -70,13 +70,13 @@
quantization updates, and cross-platform usage scenarios.
</p>

<h2 class="text-blue-500 text-3xl mb-4">Background: Llama2 and Microsoft</h2>
<h2 class="text-blue-700 text-3xl mb-4">Background: Llama2 and Microsoft</h2>

<p class="mb-4">
Llama2 is a state-of-the-art open source LLM from Meta ranging in scale from 7B to 70B
parameters (7B, 13B, 70B). Microsoft and Meta <a
href="https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/"
class="text-blue-500">announced</a
class="text-blue-700">announced</a
> their AI on Azure and Windows collaboration in July 2023. As part of the announcement, Llama2
was added to the Azure AI model catalog, which serves as a hub of foundation models that empower
developers and machine learning (ML) professionals to easily discover, evaluate, customize, and
Expand All @@ -89,7 +89,7 @@
your costs.
</p>

<h2 class="text-blue-500 text-3xl mb-4">
<h2 class="text-blue-700 text-3xl mb-4">
Faster Inferencing with New ONNX Runtime Optimizations
</h2>

Expand All @@ -115,7 +115,7 @@
</div>
<div class="mt-2 mb-4 text-center">Figure 1: E2E Throughput Comparisons</div>

<h2 class="text-blue-500 text-3xl mb-4">Latency and Throughput</h2>
<h2 class="text-blue-700 text-3xl mb-4">Latency and Throughput</h2>

<p class="mb-4">
The graphs below show latency comparisons between the ONNX Runtime and PyTorch variants of the
Expand Down Expand Up @@ -152,11 +152,11 @@
<p class="mb-4">
More details on these metrics can be found <a
href="https://github.com/microsoft/onnxruntime-inference-examples/blob/main/python/models/llama/README.md"
class="text-blue-500">here</a
class="text-blue-700">here</a
>.
</p>

<h2 class="text-blue-500 text-3xl mb-4">ONNX Runtime with Multi-GPU Inference</h2>
<h2 class="text-blue-700 text-3xl mb-4">ONNX Runtime with Multi-GPU Inference</h2>

<p class="mb-4">
ONNX Runtime supports multi-GPU inference to enable serving large models. Even in FP16
Expand All @@ -165,7 +165,7 @@
</p>

<p class="mb-4">
ONNX Runtime applied <a href="https://arxiv.org/pdf/1909.08053.pdf" class="text-blue-500"
ONNX Runtime applied <a href="https://arxiv.org/pdf/1909.08053.pdf" class="text-blue-700"
>Megatron-LM</a
>
Tensor Parallelism on the 70B model to split the original model weight onto different GPUs. Megatron
Expand All @@ -176,7 +176,7 @@
You can find additional example scripts
<a
href="https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/llama/"
class="text-blue-500">here</a
class="text-blue-700">here</a
>.
</p>

Expand All @@ -185,7 +185,7 @@
<figcaption class="mt-2 mb-4 text-center">Figure 4: 70B Llama2 Model Throughput</figcaption>
</figure>

<h2 class="text-blue-500 text-3xl mb-4">ONNX Runtime Optimizations</h2>
<h2 class="text-blue-700 text-3xl mb-4">ONNX Runtime Optimizations</h2>
<figure class="px-10 pt-4">
<img src={figure5} alt="LLaMA-2 Optimization Diagram" />
<figcaption class="mt-2 mb-4 text-center">Figure 5: LLaMA-2 Optimization Diagram</figcaption>
Expand Down Expand Up @@ -252,24 +252,24 @@
calculate the rotary embeddings more efficiently with less memory usage. The rotary embedding
compute kernels also support interleaved and non-interleaved formats to support both the <a
href="https://github.com/microsoft/Llama-2-Onnx"
class="text-blue-500">Microsoft version of LLaMA-2</a
class="text-blue-700">Microsoft version of LLaMA-2</a
>
and the Hugging Face version of LLaMA-2 respectively while sharing the same calculations.
</p>

<p class="mb-4">
The optimizations work for the <a
href="https://huggingface.co/meta-llama"
class="text-blue-500">Hugging Face versions</a
class="text-blue-700">Hugging Face versions</a
>
(models ending with <i>-hf</i>) and the Microsoft versions. You can download the optimized HF
versions from
<a href="https://github.com/microsoft/Llama-2-Onnx/tree/main-CUDA_CPU" class="text-blue-500"
<a href="https://github.com/microsoft/Llama-2-Onnx/tree/main-CUDA_CPU" class="text-blue-700"
>Microsoft's LLaMA-2 ONNX repository</a
>. Stay tuned for newer Microsoft versions coming soon!
</p>

<h2 class="text-blue-500 text-3xl mb-4">Optimize your own model using Olive</h2>
<h2 class="text-blue-700 text-3xl mb-4">Optimize your own model using Olive</h2>

<p class="mb-4">
Olive is a hardware-aware model optimization tool that incorporates advanced techniques such
Expand All @@ -281,25 +281,25 @@
<p class="mb-4">
Here is an example of <a
href="https://github.com/microsoft/Olive/tree/main/examples/llama2"
class="text-blue-500">Llama2 optimization with Olive</a
class="text-blue-700">Llama2 optimization with Olive</a
>, which harnesses ONNX Runtime optimizations highlighted in this blog. Distinct optimization
flows cater to various requirements. For instance, you have the flexibility to choose
different data types for quantization in CPU and GPU inference, based on your accuracy
tolerance. Additionally, you can fine-tune your own Llama2 model with Olive-QLoRa on client
GPUs and perform inference with ONNX Runtime optimizations.
</p>

<h2 class="text-blue-500 text-3xl mb-4">Usage Example</h2>
<h2 class="text-blue-700 text-3xl mb-4">Usage Example</h2>

<p class="mb-4">
Here is a <a
href="https://github.com/microsoft/onnxruntime-inference-examples/blob/main/python/models/llama/LLaMA-2%20E2E%20Notebook.ipynb"
class="text-blue-500">sample notebook</a
class="text-blue-700">sample notebook</a
> that shows you an end-to-end example of how you can use the above ONNX Runtime optimizations
in your application.
</p>

<h2 class="text-blue-500 text-3xl mb-4">Conclusion</h2>
<h2 class="text-blue-700 text-3xl mb-4">Conclusion</h2>

<p class="mb-4">
The advancements discussed in this blog provide faster Llama2 inferencing with ONNX Runtime,
Expand Down
2 changes: 1 addition & 1 deletion src/routes/blogs/blog-post-featured.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
<h2 class="card-title">{title}</h2>
<p>{description}</p>
<img class="rounded" src={image} alt={imgalt} />
<div class="text-right text-blue-500">
<div class="text-right text-blue-700">
{date}
</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion src/routes/blogs/blog-post.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
<div class="card-body">
<h2 class="card-title">{title}</h2>
<p>{description}</p>
<p class="text-blue-500 text-right">
<p class="text-blue-700 text-right">
{date}
</p>
</div>
Expand Down
2 changes: 1 addition & 1 deletion src/routes/blogs/post.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@
<p class="inline">By:</p>
{/if}
{#each authors as author, i}
<a href={authorsLink[i]} class="text-blue-500">{author}</a>{i + 1 === authors.length
<a href={authorsLink[i]} class="text-blue-700">{author}</a>{i + 1 === authors.length
? ''
: ', '}
{/each}
Expand Down
30 changes: 15 additions & 15 deletions src/routes/blogs/pytorch-on-the-edge/+page.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -179,9 +179,9 @@ fun run(audioTensor: OnnxTensor): Result {
<div class="container mx-auto px-4 md:px-8 lg:px-48 pt-8">
<h1 class="text-5xl pb-2">Run PyTorch models on the edge</h1>
<p class="text-neutral">
By: <a href="https://www.linkedin.com/in/natkershaw/" class="text-blue-500">Natalie Kershaw</a>
By: <a href="https://www.linkedin.com/in/natkershaw/" class="text-blue-700">Natalie Kershaw</a>
and
<a href="https://www.linkedin.com/in/prasanthpulavarthi/" class="text-blue-500"
<a href="https://www.linkedin.com/in/prasanthpulavarthi/" class="text-blue-700"
>Prasanth Pulavarthi</a
>
</p>
Expand Down Expand Up @@ -217,12 +217,12 @@ fun run(audioTensor: OnnxTensor): Result {
anywhere that is outside of the cloud, ranging from large, well-resourced personal computers
to small footprint devices such as mobile phones. This has been a challenging task to
accomplish in the past, but new advances in model optimization and software like
<a href="https://onnxruntime.ai/pytorch" class="text-blue-500">ONNX Runtime</a>
<a href="https://onnxruntime.ai/pytorch" class="text-blue-700">ONNX Runtime</a>
make it more feasible - even for new generative AI and large language models like Stable Diffusion,
Whisper, and Llama2.
</p>

<h2 class="text-blue-500 text-3xl mb-4">Considerations for PyTorch models on the edge</h2>
<h2 class="text-blue-700 text-3xl mb-4">Considerations for PyTorch models on the edge</h2>

<p class="mb-4">
There are several factors to keep in mind when thinking about running a PyTorch model on the
Expand Down Expand Up @@ -292,7 +292,7 @@ fun run(audioTensor: OnnxTensor): Result {
</li>
</ul>

<h2 class="text-blue-500 text-3xl mb-4">Tools for PyTorch models on the edge</h2>
<h2 class="text-blue-700 text-3xl mb-4">Tools for PyTorch models on the edge</h2>

<p class="mb-4">
We mentioned ONNX Runtime several times above. ONNX Runtime is a compact, standards-based
Expand All @@ -305,7 +305,7 @@ fun run(audioTensor: OnnxTensor): Result {
format that doesn't require the PyTorch framework and its gigabytes of dependencies. PyTorch
has thought about this and includes an API that enables exactly this - <a
href="https://pytorch.org/docs/stable/onnx.html"
class="text-blue-500">torch.onnx</a
class="text-blue-700">torch.onnx</a
>. <a href="https://onnx.ai/">ONNX</a> is an open standard that defines the operators that make
up models. The PyTorch ONNX APIs take the Pythonic PyTorch code and turn it into a functional
graph that captures the operators that are needed to run the model without Python. As with everything
Expand All @@ -318,7 +318,7 @@ fun run(audioTensor: OnnxTensor): Result {
The popular Hugging Face library also has APIs that build on top of this torch.onnx
functionality to export models to the ONNX format. Over <a
href="https://huggingface.co/blog/ort-accelerating-hf-models"
class="text-blue-500">130,000 models</a
class="text-blue-700">130,000 models</a
> are supported making it very likely that the model you care about is one of them.
</p>

Expand All @@ -328,7 +328,7 @@ fun run(audioTensor: OnnxTensor): Result {
and web browsers) via various languages (from C# to JavaScript to Swift).
</p>

<h2 class="text-blue-500 text-3xl mb-4">Examples of PyTorch models on the edge</h2>
<h2 class="text-blue-700 text-3xl mb-4">Examples of PyTorch models on the edge</h2>

<h3 class=" text-2xl mb-2">Stable Diffusion on Windows</h3>

Expand All @@ -345,7 +345,7 @@ fun run(audioTensor: OnnxTensor): Result {
<p class="mb-4">
You don't have to export the fifth model, ClipTokenizer, as it is available in <a
href="https://onnxruntime.ai/docs/extensions"
class="text-blue-500">ONNX Runtime extensions</a
class="text-blue-700">ONNX Runtime extensions</a
>, a library for pre and post processing PyTorch models.
</p>

Expand All @@ -366,15 +366,15 @@ fun run(audioTensor: OnnxTensor): Result {
<p class="mb-4">
You can build the application and run it on Windows with the detailed steps shown in this <a
href="https://onnxruntime.ai/docs/tutorials/csharp/stable-diffusion-csharp.html"
class="text-blue-500">tutorial</a
class="text-blue-700">tutorial</a
>.
</p>

<h3 class=" text-2xl mb-2">Text generation in the browser</h3>

<p class="mb-4">
Running a PyTorch model locally in the browser is not only possible but super simple with
the <a href="https://huggingface.co/docs/transformers.js/index" class="text-blue-500"
the <a href="https://huggingface.co/docs/transformers.js/index" class="text-blue-700"
>transformers.js</a
> library. Transformers.js uses ONNX Runtime Web as its backend. Many models are already converted
to ONNX and served by the tranformers.js CDN, making inference in the browser a matter of writing
Expand Down Expand Up @@ -407,7 +407,7 @@ fun run(audioTensor: OnnxTensor): Result {
All components of the Whisper Tiny model (audio decoder, encoder, decoder, and text sequence
generation) can be composed and exported to a single ONNX model using the <a
href="https://github.com/microsoft/Olive/tree/main/examples/whisper"
class="text-blue-500">Olive framework</a
class="text-blue-700">Olive framework</a
>. To run this model as part of a mobile application, you can use ONNX Runtime Mobile, which
supports Android, iOS, react-native, and MAUI/Xamarin.
</p>
Expand All @@ -420,7 +420,7 @@ fun run(audioTensor: OnnxTensor): Result {
<p class="mb-4">
The relevant snippet of a example <a
href="https://github.com/microsoft/onnxruntime-inference-examples/tree/main/mobile/examples/speech_recognition"
class="text-blue-500">Android mobile app</a
class="text-blue-700">Android mobile app</a
> that performs speech transcription on short samples of audio is shown below:
</p>
<Highlight language={kotlin} code={mobilecode} />
Expand Down Expand Up @@ -476,11 +476,11 @@ fun run(audioTensor: OnnxTensor): Result {
<p class="mb-4">
You can read the full <a
href="https://onnxruntime.ai/docs/tutorials/on-device-training/ios-app.html"
class="text-blue-500">Speaker Verification tutorial</a
class="text-blue-700">Speaker Verification tutorial</a
>, and
<a
href="https://github.com/microsoft/onnxruntime-training-examples/tree/master/on_device_training/mobile/ios"
class="text-blue-500">build and run the application from source</a
class="text-blue-700">build and run the application from source</a
>.
</p>

Expand Down
Loading

0 comments on commit c5e7872

Please sign in to comment.