diff --git a/blogs/pytorch-on-the-edge.html b/blogs/pytorch-on-the-edge.html
index 90eb65a8e5667..89532f8d5af44 100644
--- a/blogs/pytorch-on-the-edge.html
+++ b/blogs/pytorch-on-the-edge.html
@@ -90,30 +90,30 @@
Run PyTorch models on the edge
High cost of cloud resources (especially when device capabilities are underutilized)
Application requirements to operate without internet connectivity
- In this article, we’ll demystify running PyTorch models on the edge. We define ‘edge’ as anywhere that is outside of the cloud, ranging from large, well-resourced personal computers to small footprint devices such as mobile phones. This has been a challenging task to accomplish in the past, but new advances in model optimization and software like ONNX Runtime make it more feasible – even for new generative AI and large language models like Stable Diffusion, Whisper, and Llama2.
+ In this article, we'll demystify running PyTorch models on the edge. We define 'edge' as anywhere that is outside of the cloud, ranging from large, well-resourced personal computers to small footprint devices such as mobile phones. This has been a challenging task to accomplish in the past, but new advances in model optimization and software like ONNX Runtime make it more feasible – even for new generative AI and large language models like Stable Diffusion, Whisper, and Llama2.
Considerations for PyTorch models on the edge
There are several factors to keep in mind when thinking about running a PyTorch model on the edge:
- - Size: modern models can be several gigabytes (hence the name Large Language Models!). On the cloud, size is usually not a consideration until it becomes too large to fit on a single GPU. At that point there are various well-known solutions for running across multiple GPUs. For edge devices, we need to find models that can fit within the constraints of the device. This sometimes requires a tradeoff with quality. Most modern models come in several sizes (1 billion parameters, 13 billion parameters, 70 billion parameters, etc) so you can select a variant that fits on your device. Techniques such as quantization are usually applied to reduce the number of bits representing parameters, further reducing the model size. The size of the application is also limited, especially for apps distributed through stores, so bringing in gigabytes of dependencies won’t work on the edge.
+ - Size: modern models can be several gigabytes (hence the name Large Language Models!). On the cloud, size is usually not a consideration until it becomes too large to fit on a single GPU. At that point there are various well-known solutions for running across multiple GPUs. For edge devices, we need to find models that can fit within the constraints of the device. This sometimes requires a tradeoff with quality. Most modern models come in several sizes (1 billion parameters, 13 billion parameters, 70 billion parameters, etc) so you can select a variant that fits on your device. Techniques such as quantization are usually applied to reduce the number of bits representing parameters, further reducing the model size. The size of the application is also limited, especially for apps distributed through stores, so bringing in gigabytes of dependencies won't work on the edge.
- API for application integration: on the cloud, models are usually packaged as Docker containers that expose an endpoint that is called by an application or service. On edge devices, Docker containers may take up too many resources or may not even be supported. By using an optimized engine, like ONNX Runtime, the dependency on Python and Docker containers can be eliminated. ONNX Runtime also has APIs in many languages including C, C++, C#, Rust, Java, JavaScript, Objective-C and Swift for making it easier to integrate natively with the hosting application.
- - Performance: with large amounts of memory, no power restrictions, and hefty compute capabilities, running non-optimized models on the cloud is possible. On edge, these luxuries do not exist and optimization is crucial. For example, ONNX Runtime optimizes memory allocations, fuses model operators, reduces kernel launch times, minimizes tensor transfers between processing units, and applies tuned matrix math algorithms. It’s also able to make use of compilers and engines that are device-specific, providing a common interface for your application while harnessing the best approach on each device.
- - Maintainability: on the cloud, updating a model is as simple as deploying a new container image and ramping up traffic. On the edge, you need to consider how you will distribute model updates. Sometimes this involves publishing updates to an app store, sometimes it might be possible to implement a data update mechanism within your app and download new model files or maybe even deltas. There are many possible paths, so we won’t go into much depth on this topic in this article but it’s an aspect to keep in mind as you plan for production.
- - Hybrid: instead of cloud versus device, you can choose to utilize both. There are several hybrid patterns that are used in production today by applications such as Office. One pattern is to dynamically decide whether to run on the device or in the cloud based on network conditions or input characteristics. Another pattern is to run part of the model pipeline on the device and part on the cloud. This is especially useful with modern model pipelines that have separate encoder and decoder stages. Using an engine like ONNX Runtime that works on both cloud and device simplifies development. We’ll discuss hybrid scenarios in more detail in a forthcoming article.
- - Personalization: in many cases, the PyTorch model is simply being run on the device. However, you may also have scenarios where you need to personalize the model on the device without sending data to the cloud. Recommendation and content targeting are example scenarios that can improve their quality by updating models based on activity on the device. Fine tuning and training with PyTorch on the device may not feasible (due to performance and size concerns) but using an engine like ONNX Runtime allows PyTorch models to be updated and personalized locally. The same mechanism also enabled federated learning, which can help mitigate user data exposure. Most of this article focuses on inference but this is an important scenario to be aware of – we’ll have a future article that deep dives into this use case.
+ - Performance: with large amounts of memory, no power restrictions, and hefty compute capabilities, running non-optimized models on the cloud is possible. On edge, these luxuries do not exist and optimization is crucial. For example, ONNX Runtime optimizes memory allocations, fuses model operators, reduces kernel launch times, minimizes tensor transfers between processing units, and applies tuned matrix math algorithms. It's also able to make use of compilers and engines that are device-specific, providing a common interface for your application while harnessing the best approach on each device.
+ - Maintainability: on the cloud, updating a model is as simple as deploying a new container image and ramping up traffic. On the edge, you need to consider how you will distribute model updates. Sometimes this involves publishing updates to an app store, sometimes it might be possible to implement a data update mechanism within your app and download new model files or maybe even deltas. There are many possible paths, so we won't go into much depth on this topic in this article but it's an aspect to keep in mind as you plan for production.
+ - Hybrid: instead of cloud versus device, you can choose to utilize both. There are several hybrid patterns that are used in production today by applications such as Office. One pattern is to dynamically decide whether to run on the device or in the cloud based on network conditions or input characteristics. Another pattern is to run part of the model pipeline on the device and part on the cloud. This is especially useful with modern model pipelines that have separate encoder and decoder stages. Using an engine like ONNX Runtime that works on both cloud and device simplifies development. We'll discuss hybrid scenarios in more detail in a forthcoming article.
+ - Personalization: in many cases, the PyTorch model is simply being run on the device. However, you may also have scenarios where you need to personalize the model on the device without sending data to the cloud. Recommendation and content targeting are example scenarios that can improve their quality by updating models based on activity on the device. Fine tuning and training with PyTorch on the device may not feasible (due to performance and size concerns) but using an engine like ONNX Runtime allows PyTorch models to be updated and personalized locally. The same mechanism also enabled federated learning, which can help mitigate user data exposure. Most of this article focuses on inference but this is an important scenario to be aware of – we'll have a future article that deep dives into this use case.
Tools for PyTorch models on the edge
- We mentioned ONNX Runtime several times above. ONNX Runtime is a compact, standards-based engine that has deep integration with PyTorch. By using PyTorch’s ONNX APIs, your PyTorch models can run on a spectrum of edge devices with ONNX Runtime.
+ We mentioned ONNX Runtime several times above. ONNX Runtime is a compact, standards-based engine that has deep integration with PyTorch. By using PyTorch's ONNX APIs, your PyTorch models can run on a spectrum of edge devices with ONNX Runtime.
- The first step for running PyTorch models on the edge is to get them into a lightweight format that doesn’t require the PyTorch framework and its gigabytes of dependencies. PyTorch has thought about this and includes an API that enables exactly this - torch.onnx. ONNX is an open standard that defines the operators that make up models. The PyTorch ONNX APIs take the Pythonic PyTorch code and turn it into a functional graph that captures the operators that are needed to run the model without Python. As with everything in machine learning, there are some limitations to be aware of. Some PyTorch models cannot be represented as a single graph – in this case you may need to output several graphs and stitch them together in your own pipeline.
+ The first step for running PyTorch models on the edge is to get them into a lightweight format that doesn't require the PyTorch framework and its gigabytes of dependencies. PyTorch has thought about this and includes an API that enables exactly this - torch.onnx. ONNX is an open standard that defines the operators that make up models. The PyTorch ONNX APIs take the Pythonic PyTorch code and turn it into a functional graph that captures the operators that are needed to run the model without Python. As with everything in machine learning, there are some limitations to be aware of. Some PyTorch models cannot be represented as a single graph – in this case you may need to output several graphs and stitch them together in your own pipeline.
The popular Hugging Face library also has APIs that build on top of this torch.onnx functionality to export models to the ONNX format. Over 130,000 models are supported making it very likely that the model you care about is one of them.
- In this article, we’ll show you several examples involving state-of-the-art PyTorch models (like Whisper and Stable Diffusion) on popular devices (like Windows laptops, mobile phones, and web browsers) via various languages (from C# to Javascript to Swift).
+ In this article, we'll show you several examples involving state-of-the-art PyTorch models (like Whisper and Stable Diffusion) on popular devices (like Windows laptops, mobile phones, and web browsers) via various languages (from C# to Javascript to Swift).
PyTorch models on the edge
@@ -129,9 +129,9 @@ Stable Diffusion on Windows
pipeline.save_pretrained("./onnx-stable-diffusion")
- You don’t have to export the fifth model, ClipTokenizer, as it is available in ONNX Runtime extensions, a library for pre and post processing PyTorch models.
+ You don't have to export the fifth model, ClipTokenizer, as it is available in ONNX Runtime extensions, a library for pre and post processing PyTorch models.
- To run this pipeline of models as a .NET application, we built the pipeline code in C#. This code can be run on CPU, GPU, or NPU, if they are available on your machine, using ONNX Runtime’s device-specific hardware accelerators. This is configured with the ExecutionProviderTarget below.
+ To run this pipeline of models as a .NET application, we built the pipeline code in C#. This code can be run on CPU, GPU, or NPU, if they are available on your machine, using ONNX Runtime's device-specific hardware accelerators. This is configured with the ExecutionProviderTarget below.
static void Main(string[] args)
@@ -301,7 +301,7 @@ Train a model to recognize your voice on mobile
)
- This set of artifacts is now ready to be loaded by the mobile application, shown here as iOS Swift code. Within the application, a number of samples of the speaker’s audio are provided to the application and the model is trained with the samples.
+ This set of artifacts is now ready to be loaded by the mobile application, shown here as iOS Swift code. Within the application, a number of samples of the speaker's audio are provided to the application and the model is trained with the samples.
func trainStep(inputData: [Data], labels: [Int64]) throws {
@@ -323,9 +323,7 @@ Train a model to recognize your voice on mobile
Where to next?
- In this article we’ve shown why you would run PyTorch models on the edge and what aspects to consider. We also shared several examples with code that you can use for running state-of-the-art PyTorch model on the edge with ONNX Runtime. We also showed how ONNX Runtime was built for performance and cross-platform execution, making it the ideal way to run PyTorch models on the edge. You may have noticed that we didn’t include a Llama2 example even though ONNX Runtime is optimized to run it. That’s because the amazing Llama2 model deserves its own article, so stay tuned for that!
-
- You can read more about how to run your PyTorch model on the edge here: https://onnxruntime.ai/docs/
+ In this article we've shown why you would run PyTorch models on the edge and what aspects to consider. We also shared several examples with code that you can use for running state-of-the-art PyTorch model on the edge with ONNX Runtime. We also showed how ONNX Runtime was built for performance and cross-platform execution, making it the ideal way to run PyTorch models on the edge. You may have noticed that we didn't include a Llama2 example even though ONNX Runtime is optimized to run it. That's because the amazing Llama2 model deserves its own article, so stay tuned for that!