From 7c520f56c2e0494baed2c90302323ef099393ddc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maximilian=20M=C3=BCller?= Date: Fri, 31 May 2024 12:53:51 +0200 Subject: [PATCH] fixing typos --- docs/performance/device-tensor.md | 16 ++++++++-------- docs/performance/tune-performance/iobinding.md | 2 +- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/performance/device-tensor.md b/docs/performance/device-tensor.md index 64977c98e62cc..0ddcd8457f1ef 100644 --- a/docs/performance/device-tensor.md +++ b/docs/performance/device-tensor.md @@ -9,10 +9,10 @@ nav_order: 6 Using device tensors can be a crucial part in building efficient AI pipelines, especially on heterogenous memory systems. A typical example of such systems is any PC with a dedicated GPU. While a [recent GPU](https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889) itself has a memory bandwidth of about 1TB/s, the interconnect [PCI 4.0 x16](https://de.wikipedia.org/wiki/PCI_Express) to the CPU can often be the limiting factor with only ~32GB/s. -Therefore it is often best to keep data local to the GPU as much as possible or hide slow memory traffic behind computation as the GPU is able to execute compute and PCI memory traffic simultaneous. +Therefore it is often best to keep data local to the GPU as much as possible or hide slow memory traffic behind computation as the GPU is able to execute compute and PCI memory traffic simultaneously. A typical use case for these scenarios where memory is already local to the inference device would be a GPU accelerated video processing of an encoded video stream which can be decoded with GPU decoders. -Another common case are iterative network like diffusion networks or large language models for which intermediate tensors do not have to be copied back to CPU. +Another common case are iterative networks like diffusion networks or large language models for which intermediate tensors do not have to be copied back to CPU. Tile based inference for high resolution images is another use-case where custom memory management is important to reduce GPU idle times during PCI copies. Rather than doing sequential processing of each tile it is possible to overlap PCI copies and processing on the GPU and pipeline work in that matter. Image of sequential PCI->Processing->PCI and another image of it being interleaved. @@ -78,12 +78,12 @@ Enabling asynchronous execution in python is possible through the same [run opti ## DirectML -Acheiving the same behaviour is possible through DirectX resources. -To run asynchronous processing it is crucial to do the same manamgemnt of execution streams as needed with CUDA. -For DirectX this means managing the device and it's command queue which is possible through the C API. -Details of how to set the compute command queue are documented with the usaged of [`SessionOptionsAppendExecutionProvider_DML1`](../execution-providers/DirectML-ExecutionProvider.md#usage). +Achieving the same behavior is possible through DirectX resources. +To run asynchronous processing, it is crucial to do the same management of execution streams as needed with CUDA. +For DirectX, this means managing the device and its command queue, which is possible through the C API. +Details of how to set the compute command queue are documented with the usage of [`SessionOptionsAppendExecutionProvider_DML1`](../execution-providers/DirectML-ExecutionProvider.md#usage). -If separate command queues are used for copy and compute it is possible to overlap PCI copies and execution as well as make execution asynchronous. +If separate command queues are used for copy and compute, it is possible to overlap PCI copies and execution as well as make execution asynchronous. ```c++ #include @@ -113,7 +113,7 @@ A [single file sample](https://github.com/ankan-ban/HelloOrtDml/blob/main/Main.c ### Python API -Although allocating DirectX inputs from python might not be a big use case the API is available as well. For intermediate network caches e.g. KV caching in LLMs this can be very beneficial. +Although allocating DirectX inputs from Python might not be a major use case, the API is available. This can prove to be very beneficial, especially for intermediate network caches, such as key-value caching in large language models (LLMs). ```python import onnxruntime as ort diff --git a/docs/performance/tune-performance/iobinding.md b/docs/performance/tune-performance/iobinding.md index 44766bb9fc018..53c1bbca6b367 100644 --- a/docs/performance/tune-performance/iobinding.md +++ b/docs/performance/tune-performance/iobinding.md @@ -31,7 +31,7 @@ Following are code snippets in various languages demonstrating the usage of this Notice that in the above code sample the output tensor is not allocated before binding it, rather an `Ort::MemoryInfo` is bound as output. This is an effective way to let the session allocate the tensor depending on the needed shapes. -Especially for date dependent shapes or dynamic shapes this can be a great solution to get the right allocation. +Especially for data dependent shapes or dynamic shapes this can be a great solution to get the right allocation. However in case the output shape is known and the output tensor should be reused it is beneficial to bind an `Ort::Value` to the output as well. This can be allocated using the session allocator or external memory. Please refer to the [device tensor docs](../device-tensor.md) for more details: ```c++