Skip to content

Commit

Permalink
Merge branch 'patch-3' of https://github.com/prasanthpul/onnxruntime
Browse files Browse the repository at this point in the history
…into add-edge-blog
  • Loading branch information
natke committed Oct 13, 2023
2 parents e265292 + c72d507 commit 44baa21
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions blogs/pytorch-on-the-edge.html
Original file line number Diff line number Diff line change
Expand Up @@ -97,12 +97,12 @@ <h2 class="blue-text">Considerations for PyTorch models on the edge</h2>

<p>There are several factors to keep in mind when thinking about running a PyTorch model on the edge:</p>
<ul>
<li><strong>Size</strong>: modern models can be several gigabytes (hence the name Large Language Models!). On the cloud, size is usually not a consideration until it becomes too large to fit on a single GPU. At that point there are various well-known solutions for running across multiple GPUs. For edge devices, we need to find models that can fit within the constraints of the device. This sometimes requires a tradeoff with quality. Most modern models come in several sizes (1 billion parameters, 13 billion parameters, 70 billion parameters, etc) so you can select a variant that fits on your device. Techniques such as quantization are usually applied to reduce the number of bits representing parameters, further reducing the model size. The size of the application is also limited, especially for apps distributed through stores, so bringing in gigabytes of dependencies won't work on the edge.</li>
<li><strong>Size</strong>: modern models can be several gigabytes (hence the name Large Language Models!). On the cloud, size is usually not a consideration until it becomes too large to fit on a single GPU. At that point there are various well-known solutions for running across multiple GPUs. For edge devices, we need to find models that can fit within the constraints of the device. This sometimes requires a tradeoff with quality. Most modern models come in several sizes (1 billion parameters, 13 billion parameters, 70 billion parameters, etc) so you can select a variant that fits on your device. Techniques such as quantization are usually applied to reduce the number of bits representing parameters, further reducing the model size. The size of the application is also constrained by the app stores, so bringing in gigabytes of libraries won't work on the edge.</li>
<li><strong>API for application integration</strong>: on the cloud, models are usually packaged as Docker containers that expose an endpoint that is called by an application or service. On edge devices, Docker containers may take up too many resources or may not even be supported. By using an optimized engine, like ONNX Runtime, the dependency on Python and Docker containers can be eliminated. ONNX Runtime also has APIs in many languages including C, C++, C#, Rust, Java, JavaScript, Objective-C and Swift for making it easier to integrate natively with the hosting application.</li>
<li><strong>Performance</strong>: with large amounts of memory, no power restrictions, and hefty compute capabilities, running non-optimized models on the cloud is possible. On edge, these luxuries do not exist and optimization is crucial. For example, ONNX Runtime optimizes memory allocations, fuses model operators, reduces kernel launch times, minimizes tensor transfers between processing units, and applies tuned matrix math algorithms. It's also able to make use of compilers and engines that are device-specific, providing a common interface for your application while harnessing the best approach on each device.</li>
<li><strong>Maintainability</strong>: on the cloud, updating a model is as simple as deploying a new container image and ramping up traffic. On the edge, you need to consider how you will distribute model updates. Sometimes this involves publishing updates to an app store, sometimes it might be possible to implement a data update mechanism within your app and download new model files or maybe even deltas. There are many possible paths, so we won't go into much depth on this topic in this article but it's an aspect to keep in mind as you plan for production.</li>
<li><strong>Hybrid</strong>: instead of cloud versus device, you can choose to utilize both. There are several hybrid patterns that are used in production today by applications such as Office. One pattern is to dynamically decide whether to run on the device or in the cloud based on network conditions or input characteristics. Another pattern is to run part of the model pipeline on the device and part on the cloud. This is especially useful with modern model pipelines that have separate encoder and decoder stages. Using an engine like ONNX Runtime that works on both cloud and device simplifies development. We'll discuss hybrid scenarios in more detail in a forthcoming article.</li>
<li><strong>Personalization</strong>: in many cases, the PyTorch model is simply being run on the device. However, you may also have scenarios where you need to personalize the model on the device without sending data to the cloud. Recommendation and content targeting are example scenarios that can improve their quality by updating models based on activity on the device. Fine tuning and training with PyTorch on the device may not feasible (due to performance and size concerns) but using an engine like ONNX Runtime allows PyTorch models to be updated and personalized locally. The same mechanism also enabled federated learning, which can help mitigate user data exposure. Most of this article focuses on inference but this is an important scenario to be aware of – we'll have a future article that deep dives into this use case.</li>
<li><strong>Performance</strong>: with large amounts of memory, no power restrictions, and hefty compute capabilities, running non-optimized models on the cloud is possible. On edge devices, these luxuries do not exist and optimization is crucial. For example, ONNX Runtime optimizes memory allocations, fuses model operators, reduces kernel launch times, minimizes tensor transfers between processing units, and applies tuned matrix math algorithms. Its also able to make use of compilers and engines that are device-specific, providing a common interface for your application while harnessing the best approach on each device.</li>
<li><strong>Maintainability</strong>: on the cloud, updating a model is as simple as deploying a new container image and ramping up traffic. On the edge, you need to consider how you will distribute model updates. Sometimes this involves publishing updates to an app store, sometimes it might be possible to implement a data update mechanism within your app and download new model files or maybe even deltas. There are many possible paths, so we wont go into much depth on this topic in this article but its an aspect to keep in mind as you plan for production.</li>
<li><strong>Hybrid</strong>: instead of cloud versus device, you can choose to utilize both. There are several hybrid patterns that are used in production today by applications such as Office. One pattern is to dynamically decide whether to run on the device or in the cloud based on network conditions or input characteristics. Another pattern is to run part of the model pipeline on the device and part on the cloud. This is especially useful with modern model pipelines that have separate encoder and decoder stages. Using an engine like ONNX Runtime that works on both cloud and device simplifies development. Well discuss hybrid scenarios in more detail in a forthcoming article.</li>
<li><strong>Personalization</strong>: in many cases, the PyTorch model is simply being run on the device. However, you may also have scenarios where you need to personalize the model on the device without sending data to the cloud. Recommendation and content targeting are example scenarios that can improve their quality by updating models based on activity on the device. Fine tuning and training with PyTorch on the device may not feasible (due to performance and size concerns) but using an engine like ONNX Runtime allows PyTorch models to be updated and personalized locally. The same mechanism also enabled federated learning, which can help mitigate user data exposure.</li>
</ul>

<h2 class="blue-text">Tools for PyTorch models on the edge</h2>
Expand All @@ -113,13 +113,13 @@ <h2 class="blue-text">Tools for PyTorch models on the edge</h2>

<p>The popular Hugging Face library also has APIs that build on top of this torch.onnx functionality to export models to the ONNX format. Over <a href="https://huggingface.co/blog/ort-accelerating-hf-models">130,000 models</a> are supported making it very likely that the model you care about is one of them.</p>

<p>In this article, we'll show you several examples involving state-of-the-art PyTorch models (like Whisper and Stable Diffusion) on popular devices (like Windows laptops, mobile phones, and web browsers) via various languages (from C# to Javascript to Swift).</p>
<p>In this article, we'll show you several examples involving state-of-the-art PyTorch models (like Whisper and Stable Diffusion) on popular devices (like Windows laptops, mobile phones, and web browsers) via various languages (from C# to JavaScript to Swift).</p>

<h2 class="blue-text">PyTorch models on the edge</h2>
<h2 class="blue-text">Examples of PyTorch models on the edge</h2>

<h3 class="r-heading">Stable Diffusion on Windows</h3>

<p>The Stable Diffusion pipeline consists of five PyTorch models that build an image from a description. The diffusion process iterates on random pixels until the output image matches the description. </p>
<p>The Stable Diffusion pipeline consists of five PyTorch models that build an image from a text description. The diffusion process iterates on random pixels until the output image matches the description.</p>

<p>To run on the edge, four of the models can be exported to ONNX format from HuggingFace.</p>

Expand All @@ -131,7 +131,7 @@ <h3 class="r-heading">Stable Diffusion on Windows</h3>

<p>You don't have to export the fifth model, ClipTokenizer, as it is available in <a href="https://onnxruntime.ai/docs/extensions">ONNX Runtime extensions</a>, a library for pre and post processing PyTorch models.</p>

<p>To run this pipeline of models as a .NET application, we built the pipeline code in C#. This code can be run on CPU, GPU, or NPU, if they are available on your machine, using ONNX Runtime's device-specific hardware accelerators. This is configured with the ExecutionProviderTarget below.</p>
<p>To run this pipeline of models as a .NET application, we build the pipeline code in C#. This code can be run on CPU, GPU, or NPU, if they are available on your machine, using ONNX Runtime's device-specific hardware accelerators. This is configured with the <code>ExecutionProviderTarget</code> below.</p>

<pre><code class="language-csharp">
static void Main(string[] args)
Expand Down Expand Up @@ -159,15 +159,15 @@ <h3 class="r-heading">Stable Diffusion on Windows</h3>
}
</code></pre>

<p>This is the output of the model pipelines, running with 50 inference iterations</p>
<p>This is the output of the model pipeline, running with 50 inference iterations:</p>

<img src="../images/pytorch-on-the-edge-puppies.png" alt="Two golden retriever puppies playing in the grass" class="img-fluid">

<p>You can build the application and run it on Windows with the detailed steps shown in this <a href="https://onnxruntime.ai/docs/tutorials/csharp/stable-diffusion-csharp.html">tutorial</a>.</p>

<h3 class="r-heading">Text generation in the browser </h3>

<p>Running a PyTorch model locally in the browser is not only possible but super simple with the <a href="https://huggingface.co/docs/transformers.js/index">transformers.js</a> library. Transformers.js uses ONNX Runtime Web as a backend. Many models are already converted to ONNX and served by the tranformers.js CDN, making inference in the browser a matter of writing a few lines of HTML.</p>
<p>Running a PyTorch model locally in the browser is not only possible but super simple with the <a href="https://huggingface.co/docs/transformers.js/index">transformers.js</a> library. Transformers.js uses ONNX Runtime Web as its backend. Many models are already converted to ONNX and served by the tranformers.js CDN, making inference in the browser a matter of writing a few lines of HTML:</p>

<pre><code class="language-html">
&lt;html&gt;
Expand Down Expand Up @@ -354,4 +354,4 @@ <h3 class="r-heading">Where to next?</h3>

</body>

</html>
</html>

0 comments on commit 44baa21

Please sign in to comment.