From 7e64928b060b8eac133733c3d8fb18e73472c78a Mon Sep 17 00:00:00 2001 From: ivberg <ivberg@microsoft.com> Date: Wed, 7 Feb 2024 10:47:15 -0800 Subject: [PATCH] Added docs for ONNX 1.17 covering logging, tracing, and QNN EP Profiling (#19428) ### Description Added docs for ONNX 1.17 covering logging, tracing, and QNN EP Profiling ### Motivation and Context - ONNX Logging has not been documented - ONNX Tracing with Windows has barely been documented - ONNX 1.17 has new tracing and QNN EP Profiling PRs: #16259, #18201, #18882, #19397 --- docs/build/custom.md | 8 +- docs/build/eps.md | 2 +- .../QNN-ExecutionProvider.md | 3 + .../performance/tune-performance/iobinding.md | 2 +- .../tune-performance/logging_tracing.md | 95 +++++++++++++++++++ docs/performance/tune-performance/memory.md | 2 +- .../tune-performance/profiling-tools.md | 30 +++++- .../performance/tune-performance/threading.md | 2 +- .../tune-performance/troubleshooting.md | 2 +- docs/tutorials/csharp/csharp-gpu.md | 2 +- 10 files changed, 137 insertions(+), 11 deletions(-) create mode 100644 docs/performance/tune-performance/logging_tracing.md diff --git a/docs/build/custom.md b/docs/build/custom.md index 93e1c1bfa221e..e270feac445a1 100644 --- a/docs/build/custom.md +++ b/docs/build/custom.md @@ -165,7 +165,7 @@ _[This section is coming soon]_ ### iOS -To produce pods for an iOS build, use the [build_and_assemble_ios_pods.py](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/apple/build_and_assemble_ios_pods.py) script from the ONNX Runtime repo. +To produce pods for an iOS build, use the [build_and_assemble_apple_pods.py](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/apple/build_and_assemble_apple_pods.py) script from the ONNX Runtime repo. 1. Check out the version of ONNX Runtime you want to use. @@ -174,7 +174,7 @@ To produce pods for an iOS build, use the [build_and_assemble_ios_pods.py](https For example: ```bash - python3 tools/ci_build/github/apple/build_and_assemble_ios_pods.py \ + python3 tools/ci_build/github/apple/build_and_assemble_apple_pods.py \ --staging-dir /path/to/staging/dir \ --include-ops-by-config /path/to/ops.config \ --build-settings-file /path/to/build_settings.json @@ -186,14 +186,14 @@ To produce pods for an iOS build, use the [build_and_assemble_ios_pods.py](https The reduced set of ops in the custom build is specified with the file provided to the `--include_ops_by_config` option. See the current op config used by the pre-built mobile package at [tools/ci_build/github/android/mobile_package.required_operators.config](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/android/mobile_package.required_operators.config) (Android and iOS pre-built mobile packages share the same config file). You can use this file directly. - The default package does not include the training APIs. To create a training package, add `--enable_training_apis` in the build options file provided to `--build-settings-file` and add the `--variant Training` option when calling `build_and_assemble_ios_pods.py`. + The default package does not include the training APIs. To create a training package, add `--enable_training_apis` in the build options file provided to `--build-settings-file` and add the `--variant Training` option when calling `build_and_assemble_apple_pods.py`. For example: ```bash # /path/to/build_settings.json is a file that includes the `--enable_training_apis` option - python3 tools/ci_build/github/apple/build_and_assemble_ios_pods.py \ + python3 tools/ci_build/github/apple/build_and_assemble_apple_pods.py \ --staging-dir /path/to/staging/dir \ --include-ops-by-config /path/to/ops.config \ --build-settings-file /path/to/build_settings.json \ diff --git a/docs/build/eps.md b/docs/build/eps.md index 2c6e2c894824a..410944953d009 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -104,7 +104,7 @@ See more information on the TensorRT Execution Provider [here](../execution-prov * The path to the CUDA installation must be provided via the CUDA_PATH environment variable, or the `--cuda_home` parameter. The CUDA path should contain `bin`, `include` and `lib` directories. * The path to the CUDA `bin` directory must be added to the PATH environment variable so that `nvcc` is found. * The path to the cuDNN installation (path to cudnn bin/include/lib) must be provided via the cuDNN_PATH environment variable, or `--cudnn_home` parameter. - * On Windows, cuDNN requires [zlibwapi.dll](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-windows). Feel free to place this dll under `path_to_cudnn/bin` + * On Windows, cuDNN requires [zlibwapi.dll](https://docs.nvidia.com/deeplearning/cudnn/installation/windows.html). Feel free to place this dll under `path_to_cudnn/bin` * Follow [instructions for installing TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html) * The TensorRT execution provider for ONNX Runtime is built and tested with TensorRT 8.6. * The path to TensorRT installation must be provided via the `--tensorrt_home` parameter. diff --git a/docs/execution-providers/QNN-ExecutionProvider.md b/docs/execution-providers/QNN-ExecutionProvider.md index 6ad125c231bef..61d0aae21c50b 100644 --- a/docs/execution-providers/QNN-ExecutionProvider.md +++ b/docs/execution-providers/QNN-ExecutionProvider.md @@ -55,6 +55,9 @@ The QNN Execution Provider supports a number of configuration options. These pro |'basic'|| |'detailed'|| +See [profiling-tools](../performance/tune-performance/profiling-tools.md) for more info on profiling +Alternatively to setting profiling_level at compile time, profiling can be enabled dynamically with ETW (Windows). See [tracing](../performance/tune-performance/logging_tracing.md) for more details + |`"rpc_control_latency"`|Description| |---|---| |microseconds (string)|allows client to set up RPC control latency in microseconds| diff --git a/docs/performance/tune-performance/iobinding.md b/docs/performance/tune-performance/iobinding.md index baa97420d185f..d8c433fa9d7ad 100644 --- a/docs/performance/tune-performance/iobinding.md +++ b/docs/performance/tune-performance/iobinding.md @@ -2,7 +2,7 @@ title: I/O Binding grand_parent: Performance parent: Tune performance -nav_order: 4 +nav_order: 5 --- # I/O Binding diff --git a/docs/performance/tune-performance/logging_tracing.md b/docs/performance/tune-performance/logging_tracing.md new file mode 100644 index 0000000000000..ff0aa7684befd --- /dev/null +++ b/docs/performance/tune-performance/logging_tracing.md @@ -0,0 +1,95 @@ +--- +title: Logging & Tracing +grand_parent: Performance +parent: Tune performance +nav_order: 2 +--- + +# Logging & Tracing + +## Contents +{: .no_toc } + +* TOC placeholder +{:toc} + + +## Developer Logging + +ONNX Runtime has built-in cross-platform internal [printf style logging LOGS()](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/common/logging/macros.h). This logging is available to configure in *production builds* for a dev **using the API**. + +There will likely be a performance penalty for using the default sink output (stdout) with higher log severity levels. + +### log_severity_level +[Python](https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.log_severity_level) (below) - [C/C++ CreateEnv](https://onnxruntime.ai/docs/api/c/struct_ort_api.html#a22085f699a2d1adb52f809383f475ed1) / [OrtLoggingLevel](https://onnxruntime.ai/docs/api/c/group___global.html#ga1c0fbcf614dbd0e2c272ae1cc04c629c) - [.NET/C#](https://onnxruntime.ai/docs/api/csharp/api/Microsoft.ML.OnnxRuntime.SessionOptions.html#Microsoft_ML_OnnxRuntime_SessionOptions_LogSeverityLevel) +```python +sess_opt = SessionOptions() +sess_opt.log_severity_level = 0 // Verbose +sess = ort.InferenceSession('model.onnx', sess_opt) +``` + +### Note +Note that [log_verbosity_level](https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.log_verbosity_level) is a separate setting and only available in DEBUG custom builds. + +## Tracing About + +Tracing is a super-set of logging in that tracing +- Includes the previously mentioned logging +- Adds tracing events that are more structured than printf style logging +- Can be integrated with a larger tracing eco-system of the OS, such that + - Tracing from multiple systems with ONNX, OS system level, and user-mode software that uses ONNX can be combined + - Timestamps are high resolution and consistent with other traced components + - Can log at high performance with a high number of events / second. + - Events are not logged via stdout, but instead usually via a high performance in memory sink + - Can be enabled dynamically at run-time to investigate issues including in production systems + +Currently, only Tracelogging combined with Windows ETW is supported, although [TraceLogging](https://github.com/microsoft/tracelogging) is cross-platform and support for other OSes instrumentation systems could be added. + +## Tracing - Windows + +There are 2 main ONNX Runtime TraceLogging providers that can be enabled at run-time that can be captured with Windows [ETW](https://learn.microsoft.com/en-us/windows-hardware/test/weg/instrumenting-your-code-with-etw) + +### Quickstart Tracing with WPR + +On Windows, you can use Windows Performance Recorder ([WPR](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/wpr-command-line-options)) to capture a trace. The 2 providers covered below are already configured in these WPR profiles. + +- Download [ort.wprp](https://github.com/microsoft/onnxruntime/blob/main/ort.wprp) and [etw_provider.wprp](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/platform/windows/logging/etw_provider.wprp) (these could also be combined later) + +```dos +wpr -start ort.wprp -start etw_provider.wprp +echo Repro the issue allowing ONNX to run +wpr -stop onnx.etl -compress +``` + +### ONNXRuntimeTraceLoggingProvider +Beginning in ONNX Runtime 1.17 the [ONNXRuntimeTraceLoggingProvider](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/platform/windows/logging/HowToValidateEtwSinkOutput.md) can also be enabled. + +This will dynamically trace with high-performance the previously mentioned LOGS() macro printf logs that were previously only controlled by log_severity_level. A user or developer tracing with this provider will have the log severity level set dynamically with what ETW level they provide at run-time. + +Provider Name: ONNXRuntimeTraceLoggingProvider +Provider GUID: 929DD115-1ECB-4CB5-B060-EBD4983C421D +Keyword: Logs (0x2) keyword per [logging.h](https://github.com/ivberg/onnxruntime/blob/user/ivberg/ETWRundown/include/onnxruntime/core/common/logging/logging.h#L83) +Level: 1 (CRITICAL ) through 5 (VERBOSE) per [TraceLoggingLevel](https://learn.microsoft.com/en-us/windows/win32/api/traceloggingprovider/nf-traceloggingprovider-tracelogginglevel#remarks) + +### Microsoft.ML.ONNXRuntime + +The [Microsoft.ML.ONNXRuntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/platform/windows/telemetry.cc#L47) provider provides structured logging. + +Provider Name: Microsoft.ML.ONNXRuntime +Provider GUID: 3a26b1ff-7484-7484-7484-15261f42614d +Keywords: Multiple per [logging.h](https://github.com/ivberg/onnxruntime/blob/user/ivberg/ETWRundown/include/onnxruntime/core/common/logging/logging.h#L81) +Level: 1 (CRITICAL ) through 5 (VERBOSE) per [TraceLoggingLevel](https://learn.microsoft.com/en-us/windows/win32/api/traceloggingprovider/nf-traceloggingprovider-tracelogginglevel#remarks) +Note: This provider supports ETW [CaptureState](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/capturestateonsave) (Rundown) for logging state for example when a trace is saved + +ORT 1.17 includes new events logging session options and EP provider options + +#### Profiling + +Microsoft.ML.ONNXRuntime can also output profiling events. That is covered in [profiling](profiling-tools.md) + +### WinML + +WindowsML has it's own tracing providers that be enabled in addition the providers above + +- Microsoft.Windows.WinML - d766d9ff-112c-4dac-9247-241cf99d123f +- Microsoft.Windows.AI.MachineLearning - BCAD6AEE-C08D-4F66-828C-4C43461A033D \ No newline at end of file diff --git a/docs/performance/tune-performance/memory.md b/docs/performance/tune-performance/memory.md index f868df4501f53..54b92c91a69ef 100644 --- a/docs/performance/tune-performance/memory.md +++ b/docs/performance/tune-performance/memory.md @@ -2,7 +2,7 @@ title: Memory consumption grand_parent: Performance parent: Tune performance -nav_order: 2 +nav_order: 3 --- # Reduce memory consumption diff --git a/docs/performance/tune-performance/profiling-tools.md b/docs/performance/tune-performance/profiling-tools.md index 66decc518cd52..3d1973508fb88 100644 --- a/docs/performance/tune-performance/profiling-tools.md +++ b/docs/performance/tune-performance/profiling-tools.md @@ -38,6 +38,34 @@ In both cases, you will get a JSON file which contains the detailed performance * Type chrome://tracing in the address bar * Load the generated JSON file +## Execution Provider (EP) Profiling + +Starting with ONNX 1.17 support has been added to profile EPs or Neural Processing Unit (NPU)s, if that EP supports profiling in it's SDK + +## Qualcomm QNN EP + +As mentioned in the [QNN EP Doc](../../execution-providers/QNN-ExecutionProvider.md) profiling is supported + +### Cross-Platform CSV Tracing + +The Qualcomm AI Engine Direct SDK (QNN SDK) supports profiling. QNN will output to CSV in a text format if a dev were to use the QNN SDK directly outside ONNX. To enable equivalent functionality, ONNX mimics this support and outputs the same CSV formatting. + +If profiling_level is provided then ONNX will append log to current working directory a qnn-profiling-data.csv [file](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/qnn/builder/qnn_backend_manager.cc#L911) + +### TraceLogging ETW (Windows) Profiling + +As covered in [logging](logging_tracing.md) ONNX supports dynamic enablement of tracing ETW providers. Specifically the following settings. If the Tracelogging provider is enabled and profiling_level was provided, then CSV support is automatically disabled + +- Provider Name: Microsoft.ML.ONNXRuntime +- Provider GUID: 3a26b1ff-7484-7484-7484-15261f42614d +- Keywords: Profiling = 0x100 per [logging.h](https://github.com/ivberg/onnxruntime/blob/user/ivberg/ETWRundown/include/onnxruntime/core/common/logging/logging.h#L81) +- Level: + - 5 (VERBOSE) = profiling_level=basic (good details without perf loss) + - greater than 5 = profiling_level=detailed (individual ops are logged with inference perf hit) +- Event: [QNNProfilingEvent](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/qnn/builder/qnn_backend_manager.cc#L1083) + +## GPU Profiling + To profile CUDA kernels, please add the cupti library to your PATH and use the onnxruntime binary built from source with `--enable_cuda_profiling`. To profile ROCm kernels, please add the roctracer library to your PATH and use the onnxruntime binary built from source with `--enable_rocm_profiling`. @@ -55,4 +83,4 @@ If an operator called multiple kernels during execution, the performance numbers {"cat":"Node", "name":<name of the node>, ...} {"cat":"Kernel", "name":<name of the kernel called first>, ...} {"cat":"Kernel", "name":<name of the kernel called next>, ...} -``` \ No newline at end of file +``` diff --git a/docs/performance/tune-performance/threading.md b/docs/performance/tune-performance/threading.md index 2b546422c080c..a6603dbadf589 100644 --- a/docs/performance/tune-performance/threading.md +++ b/docs/performance/tune-performance/threading.md @@ -2,7 +2,7 @@ title: Thread management grand_parent: Performance parent: Tune performance -nav_order: 3 +nav_order: 4 --- # Thread management diff --git a/docs/performance/tune-performance/troubleshooting.md b/docs/performance/tune-performance/troubleshooting.md index 51c7c6fd4ac1e..de481b3d38df5 100644 --- a/docs/performance/tune-performance/troubleshooting.md +++ b/docs/performance/tune-performance/troubleshooting.md @@ -2,7 +2,7 @@ title: Troubleshooting grand_parent: Performance parent: Tune performance -nav_order: 5 +nav_order: 6 --- # Troubleshooting performance issues diff --git a/docs/tutorials/csharp/csharp-gpu.md b/docs/tutorials/csharp/csharp-gpu.md index a7dd199073f7a..3f62fdd649781 100644 --- a/docs/tutorials/csharp/csharp-gpu.md +++ b/docs/tutorials/csharp/csharp-gpu.md @@ -31,7 +31,7 @@ See this table for supported versions: NOTE: Full table can be found [here](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements) -- Follow section [2. Installing cuDNN on Windows](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows). NOTE: Skip step 5 in section 2.3 on updating Visual Studio settings, this is only for C++ projects. +- Follow section [2. Installing cuDNN on Windows](https://docs.nvidia.com/deeplearning/cudnn/installation/windows.html). NOTE: Skip step 5 in section 2.3 on updating Visual Studio settings, this is only for C++ projects. - Restart your computer and verify the installation by running the following command or in python with PyTorch: