Merge branch 'gh-pages' into nimbleedge_blog

microsoft · Jun 17, 2024 · c5e7872 · c5e7872
2 parents bb17882 + 82bb41a
commit c5e7872
Show file tree

Hide file tree

Showing 24 changed files with 132 additions and 104 deletions.
diff --git a/docs/performance/device-tensor.md b/docs/performance/device-tensor.md
@@ -8,7 +8,7 @@ nav_order: 6
 
 Using device tensors can be a crucial part in building efficient AI pipelines, especially on heterogenous memory systems.
 A typical example of such systems is any PC with a dedicated GPU.
-While a [recent GPU](https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889) itself has a memory bandwidth of about 1TB/s, the interconnect [PCI 4.0 x16](https://de.wikipedia.org/wiki/PCI_Express) to the CPU can often be the limiting factor with only ~32GB/s.
+While a [recent GPU](https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889) itself has a memory bandwidth of about 1TB/s, the interconnect [PCI 4.0 x16](https://en.wikipedia.org/wiki/PCI_Express) to the CPU can often be the limiting factor with only ~32GB/s.
 Therefore it is often best to keep data local to the GPU as much as possible or hide slow memory traffic behind computation as the GPU is able to execute compute and PCI memory traffic simultaneously.
 
 A typical use case for these scenarios where memory is already local to the inference device would be a GPU accelerated video processing of an encoded video stream which can be decoded with GPU decoders.
@@ -20,7 +20,7 @@ Tile based inference for high resolution images is another use-case where custom
 ## CUDA
 
 CUDA in ONNX Runtime has two custom memory types.
-`"CudaPinned"` and `"Cuda"` memory where [CUDA pinned](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/) is actually CPU memory which is directly accesible by the GPU allowing for fully asynchronous up and download of memory using [`cudaMemcpyAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79).
+`"CudaPinned"` and `"Cuda"` memory where [CUDA pinned](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/) is actually CPU memory which is directly accessible by the GPU allowing for fully asynchronous up and download of memory using [`cudaMemcpyAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79).
 Normal CPU tensors only allow for a synchronous downloads from GPU to CPU while CPU to GPU copies can always be executed asynchronous.
 
 Allocating a tensor using the `Ort::Sessions`'s allocator is very straight forward using the [C++ API](https://onnxruntime.ai/docs/api/c/struct_ort_1_1_value.html#a5d35080239ae47cdbc9e505666dc32ec) which directly maps to the C API.
@@ -132,5 +132,4 @@ binding.bind_output("out", "dml")
 # binding.bind_ortvalue_output("out", dml_array_out)
 
 session.run_with_iobinding(binding)
-
 ```
diff --git a/src/lib/components/ui/InfiniteMovingCards/InfiniteMovingCards.svelte b/src/lib/components/ui/InfiniteMovingCards/InfiniteMovingCards.svelte
@@ -57,14 +57,29 @@
 			}
 		}
 	};
+
+	const toggleScroll = () => {
+    if (scrollerRef) {
+      const currentState = window.getComputedStyle(scrollerRef).animationPlayState;
+      scrollerRef.style.animationPlayState = currentState === 'running' ? 'paused' : 'running';
+    }
+  };
+
+  const handleKeyDown = (event: { key: string; preventDefault: () => void; }) => {
+    if (event.key === 'Enter' || event.key === ' ') {
+      event.preventDefault(); // Prevent default spacebar scrolling behavior
+      toggleScroll();
+    }
+  };
 </script>
 
 <div bind:this={containerRef} class={cn('scroller relative z-2 overflow-hidden ', className)}>
+	<button class="hover:bg-primary focus:bg-primary menu-item py-2 sr-only focus:not-sr-only" on:keydown={handleKeyDown} on:click={toggleScroll}>Toggle scrolling</button>
 	<ul
 		bind:this={scrollerRef}
 		class={cn(
 			' flex w-max min-w-full shrink-0 flex-nowrap gap-4 py-4',
-			start && 'animate-scroll ',
+			start && 'animate-scroll',
 			pauseOnHover && 'hover:[animation-play-state:paused]'
 		)}
 	>

diff --git a/src/routes/+layout.svelte b/src/routes/+layout.svelte
@@ -46,7 +46,7 @@
 		<Header />
 	{/if}
 	{#key data.pathname}
-		<div in:fade={{ duration: 300, delay: 400 }} out:fade={{ duration: 300 }}>
+		<div id="main-content" in:fade={{ duration: 300, delay: 400 }} out:fade={{ duration: 300 }}>
 			<slot />
 		</div>
 	{/key}

diff --git a/src/routes/blogs/accelerating-llama-2/+page.svelte b/src/routes/blogs/accelerating-llama-2/+page.svelte
@@ -45,11 +45,11 @@
 <div class="container mx-auto px-4 md:px-8 lg:px-48 pt-8">
 	<h1 class="text-5xl pb-2">Accelerating LLaMA-2 Inference with ONNX Runtime</h1>
 	<p class="text-neutral">
-		By: <a href="https://www.linkedin.com/in/kunal-v-16315b94" class="text-blue-500"
+		By: <a href="https://www.linkedin.com/in/kunal-v-16315b94" class="text-blue-700"
 			>Kunal Vaishnavi</a
 		>
 		and
-		<a href="https://www.linkedin.com/in/parinitaparinita/" class="text-blue-500">Parinita Rahi</a>
+		<a href="https://www.linkedin.com/in/parinitaparinita/" class="text-blue-700">Parinita Rahi</a>
 	</p>
 	<p class="text-neutral">
 		14TH NOVEMBER, 2023 <span class="italic text-stone-500">(Updated 22nd November)</span>
@@ -70,13 +70,13 @@
 			quantization updates, and cross-platform usage scenarios.
 		</p>
 
-		<h2 class="text-blue-500 text-3xl mb-4">Background: Llama2 and Microsoft</h2>
+		<h2 class="text-blue-700 text-3xl mb-4">Background: Llama2 and Microsoft</h2>
 
 		<p class="mb-4">
 			Llama2 is a state-of-the-art open source LLM from Meta ranging in scale from 7B to 70B
 			parameters (7B, 13B, 70B). Microsoft and Meta <a
 				href="https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/"
-				class="text-blue-500">announced</a
+				class="text-blue-700">announced</a
 			> their AI on Azure and Windows collaboration in July 2023. As part of the announcement, Llama2
 			was added to the Azure AI model catalog, which serves as a hub of foundation models that empower
 			developers and machine learning (ML) professionals to easily discover, evaluate, customize, and
@@ -89,7 +89,7 @@
 			your costs.
 		</p>
 
-		<h2 class="text-blue-500 text-3xl mb-4">
+		<h2 class="text-blue-700 text-3xl mb-4">
 			Faster Inferencing with New ONNX Runtime Optimizations
 		</h2>
 
@@ -115,7 +115,7 @@
 		</div>
 		<div class="mt-2 mb-4 text-center">Figure 1: E2E Throughput Comparisons</div>
 
-		<h2 class="text-blue-500 text-3xl mb-4">Latency and Throughput</h2>
+		<h2 class="text-blue-700 text-3xl mb-4">Latency and Throughput</h2>
 
 		<p class="mb-4">
 			The graphs below show latency comparisons between the ONNX Runtime and PyTorch variants of the
@@ -152,11 +152,11 @@
 		<p class="mb-4">
 			More details on these metrics can be found <a
 				href="https://github.com/microsoft/onnxruntime-inference-examples/blob/main/python/models/llama/README.md"
-				class="text-blue-500">here</a
+				class="text-blue-700">here</a
 			>.
 		</p>
 
-		<h2 class="text-blue-500 text-3xl mb-4">ONNX Runtime with Multi-GPU Inference</h2>
+		<h2 class="text-blue-700 text-3xl mb-4">ONNX Runtime with Multi-GPU Inference</h2>
 
 		<p class="mb-4">
 			ONNX Runtime supports multi-GPU inference to enable serving large models. Even in FP16
@@ -165,7 +165,7 @@
 		</p>
 
 		<p class="mb-4">
-			ONNX Runtime applied <a href="https://arxiv.org/pdf/1909.08053.pdf" class="text-blue-500"
+			ONNX Runtime applied <a href="https://arxiv.org/pdf/1909.08053.pdf" class="text-blue-700"
 				>Megatron-LM</a
 			>
 			Tensor Parallelism on the 70B model to split the original model weight onto different GPUs. Megatron
@@ -176,7 +176,7 @@
 			You can find additional example scripts
 			<a
 				href="https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/llama/"
-				class="text-blue-500">here</a
+				class="text-blue-700">here</a
 			>.
 		</p>
 
@@ -185,7 +185,7 @@
 			<figcaption class="mt-2 mb-4 text-center">Figure 4: 70B Llama2 Model Throughput</figcaption>
 		</figure>
 
-		<h2 class="text-blue-500 text-3xl mb-4">ONNX Runtime Optimizations</h2>
+		<h2 class="text-blue-700 text-3xl mb-4">ONNX Runtime Optimizations</h2>
 		<figure class="px-10 pt-4">
 			<img src={figure5} alt="LLaMA-2 Optimization Diagram" />
 			<figcaption class="mt-2 mb-4 text-center">Figure 5: LLaMA-2 Optimization Diagram</figcaption>
@@ -252,24 +252,24 @@
 			calculate the rotary embeddings more efficiently with less memory usage. The rotary embedding
 			compute kernels also support interleaved and non-interleaved formats to support both the <a
 				href="https://github.com/microsoft/Llama-2-Onnx"
-				class="text-blue-500">Microsoft version of LLaMA-2</a
+				class="text-blue-700">Microsoft version of LLaMA-2</a
 			>
 			and the Hugging Face version of LLaMA-2 respectively while sharing the same calculations.
 		</p>
 
 		<p class="mb-4">
 			The optimizations work for the <a
 				href="https://huggingface.co/meta-llama"
-				class="text-blue-500">Hugging Face versions</a
+				class="text-blue-700">Hugging Face versions</a
 			>
 			(models ending with <i>-hf</i>) and the Microsoft versions. You can download the optimized HF
 			versions from
-			<a href="https://github.com/microsoft/Llama-2-Onnx/tree/main-CUDA_CPU" class="text-blue-500"
+			<a href="https://github.com/microsoft/Llama-2-Onnx/tree/main-CUDA_CPU" class="text-blue-700"
 				>Microsoft's LLaMA-2 ONNX repository</a
 			>. Stay tuned for newer Microsoft versions coming soon!
 		</p>
 
-		<h2 class="text-blue-500 text-3xl mb-4">Optimize your own model using Olive</h2>
+		<h2 class="text-blue-700 text-3xl mb-4">Optimize your own model using Olive</h2>
 
 		<p class="mb-4">
 			Olive is a hardware-aware model optimization tool that incorporates advanced techniques such
@@ -281,25 +281,25 @@
 		<p class="mb-4">
 			Here is an example of <a
 				href="https://github.com/microsoft/Olive/tree/main/examples/llama2"
-				class="text-blue-500">Llama2 optimization with Olive</a
+				class="text-blue-700">Llama2 optimization with Olive</a
 			>, which harnesses ONNX Runtime optimizations highlighted in this blog. Distinct optimization
 			flows cater to various requirements. For instance, you have the flexibility to choose
 			different data types for quantization in CPU and GPU inference, based on your accuracy
 			tolerance. Additionally, you can fine-tune your own Llama2 model with Olive-QLoRa on client
 			GPUs and perform inference with ONNX Runtime optimizations.
 		</p>
 
-		<h2 class="text-blue-500 text-3xl mb-4">Usage Example</h2>
+		<h2 class="text-blue-700 text-3xl mb-4">Usage Example</h2>
 
 		<p class="mb-4">
 			Here is a <a
 				href="https://github.com/microsoft/onnxruntime-inference-examples/blob/main/python/models/llama/LLaMA-2%20E2E%20Notebook.ipynb"
-				class="text-blue-500">sample notebook</a
+				class="text-blue-700">sample notebook</a
 			> that shows you an end-to-end example of how you can use the above ONNX Runtime optimizations
 			in your application.
 		</p>
 
-		<h2 class="text-blue-500 text-3xl mb-4">Conclusion</h2>
+		<h2 class="text-blue-700 text-3xl mb-4">Conclusion</h2>
 
 		<p class="mb-4">
 			The advancements discussed in this blog provide faster Llama2 inferencing with ONNX Runtime,

diff --git a/src/routes/blogs/blog-post-featured.svelte b/src/routes/blogs/blog-post-featured.svelte
@@ -33,7 +33,7 @@
 				<h2 class="card-title">{title}</h2>
 				<p>{description}</p>
 				<img class="rounded" src={image} alt={imgalt} />
-				<div class="text-right text-blue-500">
+				<div class="text-right text-blue-700">
 					{date}
 				</div>
 			</div>

diff --git a/src/routes/blogs/blog-post.svelte b/src/routes/blogs/blog-post.svelte
@@ -30,7 +30,7 @@
 			<div class="card-body">
 				<h2 class="card-title">{title}</h2>
 				<p>{description}</p>
-				<p class="text-blue-500 text-right">
+				<p class="text-blue-700 text-right">
 					{date}
 				</p>
 			</div>

diff --git a/src/routes/blogs/post.svelte b/src/routes/blogs/post.svelte
@@ -82,7 +82,7 @@
 				<p class="inline">By:</p>
 			{/if}
 			{#each authors as author, i}
-				<a href={authorsLink[i]} class="text-blue-500">{author}</a>{i + 1 === authors.length
+				<a href={authorsLink[i]} class="text-blue-700">{author}</a>{i + 1 === authors.length
 					? ''
 					: ', '}
 			{/each}

diff --git a/src/routes/blogs/pytorch-on-the-edge/+page.svelte b/src/routes/blogs/pytorch-on-the-edge/+page.svelte
@@ -179,9 +179,9 @@ fun run(audioTensor: OnnxTensor): Result {
 <div class="container mx-auto px-4 md:px-8 lg:px-48 pt-8">
 	<h1 class="text-5xl pb-2">Run PyTorch models on the edge</h1>
 	<p class="text-neutral">
-		By: <a href="https://www.linkedin.com/in/natkershaw/" class="text-blue-500">Natalie Kershaw</a>
+		By: <a href="https://www.linkedin.com/in/natkershaw/" class="text-blue-700">Natalie Kershaw</a>
 		and
-		<a href="https://www.linkedin.com/in/prasanthpulavarthi/" class="text-blue-500"
+		<a href="https://www.linkedin.com/in/prasanthpulavarthi/" class="text-blue-700"
 			>Prasanth Pulavarthi</a
 		>
 	</p>
@@ -217,12 +217,12 @@ fun run(audioTensor: OnnxTensor): Result {
 				anywhere that is outside of the cloud, ranging from large, well-resourced personal computers
 				to small footprint devices such as mobile phones. This has been a challenging task to
 				accomplish in the past, but new advances in model optimization and software like
-				<a href="https://onnxruntime.ai/pytorch" class="text-blue-500">ONNX Runtime</a>
+				<a href="https://onnxruntime.ai/pytorch" class="text-blue-700">ONNX Runtime</a>
 				make it more feasible - even for new generative AI and large language models like Stable Diffusion,
 				Whisper, and Llama2.
 			</p>
 
-			<h2 class="text-blue-500 text-3xl mb-4">Considerations for PyTorch models on the edge</h2>
+			<h2 class="text-blue-700 text-3xl mb-4">Considerations for PyTorch models on the edge</h2>
 
 			<p class="mb-4">
 				There are several factors to keep in mind when thinking about running a PyTorch model on the
@@ -292,7 +292,7 @@ fun run(audioTensor: OnnxTensor): Result {
 				</li>
 			</ul>
 
-			<h2 class="text-blue-500 text-3xl mb-4">Tools for PyTorch models on the edge</h2>
+			<h2 class="text-blue-700 text-3xl mb-4">Tools for PyTorch models on the edge</h2>
 
 			<p class="mb-4">
 				We mentioned ONNX Runtime several times above. ONNX Runtime is a compact, standards-based
@@ -305,7 +305,7 @@ fun run(audioTensor: OnnxTensor): Result {
 				format that doesn't require the PyTorch framework and its gigabytes of dependencies. PyTorch
 				has thought about this and includes an API that enables exactly this - <a
 					href="https://pytorch.org/docs/stable/onnx.html"
-					class="text-blue-500">torch.onnx</a
+					class="text-blue-700">torch.onnx</a
 				>. <a href="https://onnx.ai/">ONNX</a> is an open standard that defines the operators that make
 				up models. The PyTorch ONNX APIs take the Pythonic PyTorch code and turn it into a functional
 				graph that captures the operators that are needed to run the model without Python. As with everything
@@ -318,7 +318,7 @@ fun run(audioTensor: OnnxTensor): Result {
 				The popular Hugging Face library also has APIs that build on top of this torch.onnx
 				functionality to export models to the ONNX format. Over <a
 					href="https://huggingface.co/blog/ort-accelerating-hf-models"
-					class="text-blue-500">130,000 models</a
+					class="text-blue-700">130,000 models</a
 				> are supported making it very likely that the model you care about is one of them.
 			</p>
 
@@ -328,7 +328,7 @@ fun run(audioTensor: OnnxTensor): Result {
 				and web browsers) via various languages (from C# to JavaScript to Swift).
 			</p>
 
-			<h2 class="text-blue-500 text-3xl mb-4">Examples of PyTorch models on the edge</h2>
+			<h2 class="text-blue-700 text-3xl mb-4">Examples of PyTorch models on the edge</h2>
 
 			<h3 class=" text-2xl mb-2">Stable Diffusion on Windows</h3>
 
@@ -345,7 +345,7 @@ fun run(audioTensor: OnnxTensor): Result {
 			<p class="mb-4">
 				You don't have to export the fifth model, ClipTokenizer, as it is available in <a
 					href="https://onnxruntime.ai/docs/extensions"
-					class="text-blue-500">ONNX Runtime extensions</a
+					class="text-blue-700">ONNX Runtime extensions</a
 				>, a library for pre and post processing PyTorch models.
 			</p>
 
@@ -366,15 +366,15 @@ fun run(audioTensor: OnnxTensor): Result {
 			<p class="mb-4">
 				You can build the application and run it on Windows with the detailed steps shown in this <a
 					href="https://onnxruntime.ai/docs/tutorials/csharp/stable-diffusion-csharp.html"
-					class="text-blue-500">tutorial</a
+					class="text-blue-700">tutorial</a
 				>.
 			</p>
 
 			<h3 class=" text-2xl mb-2">Text generation in the browser</h3>
 
 			<p class="mb-4">
 				Running a PyTorch model locally in the browser is not only possible but super simple with
-				the <a href="https://huggingface.co/docs/transformers.js/index" class="text-blue-500"
+				the <a href="https://huggingface.co/docs/transformers.js/index" class="text-blue-700"
 					>transformers.js</a
 				> library. Transformers.js uses ONNX Runtime Web as its backend. Many models are already converted
 				to ONNX and served by the tranformers.js CDN, making inference in the browser a matter of writing
@@ -407,7 +407,7 @@ fun run(audioTensor: OnnxTensor): Result {
 				All components of the Whisper Tiny model (audio decoder, encoder, decoder, and text sequence
 				generation) can be composed and exported to a single ONNX model using the <a
 					href="https://github.com/microsoft/Olive/tree/main/examples/whisper"
-					class="text-blue-500">Olive framework</a
+					class="text-blue-700">Olive framework</a
 				>. To run this model as part of a mobile application, you can use ONNX Runtime Mobile, which
 				supports Android, iOS, react-native, and MAUI/Xamarin.
 			</p>
@@ -420,7 +420,7 @@ fun run(audioTensor: OnnxTensor): Result {
 			<p class="mb-4">
 				The relevant snippet of a example <a
 					href="https://github.com/microsoft/onnxruntime-inference-examples/tree/main/mobile/examples/speech_recognition"
-					class="text-blue-500">Android mobile app</a
+					class="text-blue-700">Android mobile app</a
 				> that performs speech transcription on short samples of audio is shown below:
 			</p>
 			<Highlight language={kotlin} code={mobilecode} />
@@ -476,11 +476,11 @@ fun run(audioTensor: OnnxTensor): Result {
 			<p class="mb-4">
 				You can read the full <a
 					href="https://onnxruntime.ai/docs/tutorials/on-device-training/ios-app.html"
-					class="text-blue-500">Speaker Verification tutorial</a
+					class="text-blue-700">Speaker Verification tutorial</a
 				>, and
 				<a
 					href="https://github.com/microsoft/onnxruntime-training-examples/tree/master/on_device_training/mobile/ios"
-					class="text-blue-500">build and run the application from source</a
+					class="text-blue-700">build and run the application from source</a
 				>.
 			</p>