ORT ETW dynamic logging that improves ORT diagnosability & performance #18882

ivberg · 2023-12-20T00:08:35Z

Description

This PR has several combined ORT ETW changes that improve ORT log diagnosability & performance.

The existing log behavior in the ORT API and Severity behavior remain the same as compiled by the dev using the ORT API
The PR keeps the existing design which has 2 TraceLogging providers defined (although both were not used before this PR)
Keeps great inference (inf) and session load performance even with dynamic logging enabled (see below)
On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT will dynamically add a new sink reflecting the severity level provided by ETW dynamically. E.G Critical - Verbose per the need at runtime
- This allows previous printf style LOGS() statements both for CPU and NPU cases to flow to ETW via a local trace (if enabled)
This DOES NOT add any new Telemetry which can optionally be sent to Microsoft.
- Telemetry are ETW events marked with the Measure keyword that can be sampled if a box opts-in
Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and levels added if they were missing
If Execution Providers (EPs) can provide more detailed insight into their HW or component, then this PR allows for those to be dynamically logged instead of just at compile time
- In this PR, the QNN EP for QC NPUs can have basic or detailed profiling enabled to give some insight into how the NPU is performing
- When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the Profiling keyword and level 5 then QC QNN basic profiling info is output to ETW

Motivation and Context

This make ORT logging and diagnosability more performant (on Windows) and available in a wider variety of runtime environments.
- The performance difference for inf times was ~300x+ drastically better/faster when these logs were output to ETW vs just stdout (Verbose Severity)
This style of ETW dynamic tracing is the widely used standard for Windows components, and even by some 3rd party software since the ETW API is open and part of the Windows API
These ETW based logs can be seamlessly combined with other ETW logs such as an AI component/feature using ORT, OS CPU profiling, scheduling, and more
Before the PR, ORT logging is largely printf style and output to a sink (usually stdout) only if compiled with a certain log Severity. When enabled the previous logging (to stdout) would vastly slow down inference times. Once compiled for release the internal ORT logs were not accessible by anyone except the AI model developer in their dev inner loop. That means logs could not be enabled on a lab machine, or on a production system where the runtime behavior or performance might be different for various reasons on a wide variety of HW.
This change was tested with performance in mind and tested with a mobilenet small AI model with onnxruntime_perf_test
- CPU: There was no statistical difference for inf times with the baseline (main) or this PR whether ETW was enabled or not (both ORT providers all keywords level 5).
- NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical difference for inf times with this PR whether ETW (both ORT providers all keywords) were enabled or not for Level 5 (Verbose). This is even with QNN Basic profiling turned on and outputting NPU stats to ETW
  - As expected and designed, there was perf slowdown when Max Level 255 is enabled which translates to QNN Detailed profiling. This mirrors the expected slowdown in the NPU when individual model operations are monitored & recorded as well. This perf is similar to the QNN SDK Detailed profiling performance separate from this PR. This is designed to be above level 5 (verbose) as that is commonly the max level used in many trace profiles and won't affect inf performance.
Other OSes such as Linux & Android are left untouched for now.
- Out of scope for this PR but TraceLogging is available for Linux with LTTng tracing. So in the future, this optional tracing could also be made available on other OSes where a TraceLogging API is available

include/onnxruntime/core/common/logging/logging.h

onnxruntime/core/platform/telemetry.cc

onnxruntime/core/platform/windows/logging/etw_sink.h

skottmckay · 2024-01-09T06:24:53Z

Has this been merged with main any time recently? I don't see anything obvious in the commits that looks like a merge. The Web CI failure may go away with the latest main.

ivberg · 2024-01-09T18:56:31Z

Has this been merged with main any time recently? I don't see anything obvious in the commits that looks like a merge. The Web CI failure may go away with the latest main.

Last rebase to main was on 12/13. I can rebase again although at this time I don't think the test failures are related to this code. Rebased to main now (1/9)

…d and use that in QNN EP to set profiling level

…n have more detail that what session options got passed in by default

…without much impacting perf of inference say during a tight loop like onnxruntime_perf_test.exe

…rom the default of VERBOSE to say ERROR or INFO. This is so that the appropiate level of events show up when an ETW user specifies a lower level when listening for events

ivberg · 2024-01-09T19:00:19Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2024-01-09T19:00:58Z

Azure Pipelines successfully started running 9 pipeline(s).

onnxruntime/core/providers/qnn/builder/qnn_backend_manager.cc

HectorSVC

…ing (#19428) ### Description Added docs for ONNX 1.17 covering logging, tracing, and QNN EP Profiling ### Motivation and Context - ONNX Logging has not been documented - ONNX Tracing with Windows has barely been documented - ONNX 1.17 has new tracing and QNN EP Profiling PRs: #16259, #18201, #18882, #19397

…er options (#19397) ### Description Add capturestate / rundown ETW support logging for session and provider options. ### Motivation and Context Follow-up to #16259 and #18882 This is very useful when you have longer running ONNX sessions which will be the case for a lot of AI workloads. That means ETW tracing may start minutes or hours after a process & session has been established. When a trace is captured, you would want to know the state of ONNX at that time. The state for ONNX is session and config options so that they show up in the trace. Tested with xperf and ORT xperf -start ort -on 3a26b1ff-7484-7484-7484-15261f42614d xperf -capturestate ort 3a26b1ff-7484-7484-7484-15261f42614d <--- Run this after session has been up for some time xperf -stop ort -d .\ort.etl <- Trace will now also have rundown events Also these will show if you use WPR [CaptureStateOnSave ](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/capturestateonsave)

### Description Windows - Fully dynamic ETW controlled logging for ORT and QNN logs The logging support is documented here - https://onnxruntime.ai/docs/performance/tune-performance/logging_tracing.html#tracing---windows - https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html#tracelogging-etw-windows-profiling Also add support for logging ORT SessionCreation on ETW CaptureState ### Motivation and Context The previous ETW support only worked if you enabled ETW before the session started. There can commonly be long-lived AI inference processes that need to be traced & debugged. This enables logging fully on the fly. Without this support a dev would have to end up killing a process or stopping a service in order to get tracing. We had to do this for a recent issue with QNN, and it was a bit painful to get the logs and it ruined the repro. ### Testing I tested with the following cases - Leaving default ORT run - Enabling ETW prior to start and leaving running for entire session + inferences, then stopping - Starting ORT session + inf, then enabling and stopping ETW - Start ORT session /w long running Inferences - wpr -start [ort.wprp](https://github.com/microsoft/onnxruntime/blob/e6228575e4d5866bdb831e76cc93e6c35af4de8b/ort.wprp#L4) -start [etw_provider.wprp](https://github.com/microsoft/onnxruntime/blob/e6228575e4d5866bdb831e76cc93e6c35af4de8b/onnxruntime/test/platform/windows/logging/etw_provider.wprp) - Wait a few seconds - wpr -stop ort.etl - Inferences are still running - Verify ONNXRuntimeLogEvent provider events are present and new SessionCreation_CaptureState event under Microsoft.ML.ONNXRuntime provider Related: #18882 #19428

ivberg requested review from snnn, pranavsharma and HectorSVC December 20, 2023 00:08

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

snnn previously approved these changes Dec 21, 2023

View reviewed changes

skottmckay reviewed Dec 21, 2023

View reviewed changes

ivberg dismissed snnn’s stale review via 255962c January 8, 2024 21:44

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

ivberg requested a review from skottmckay January 8, 2024 23:51

This comment was marked as outdated.

Sign in to view

ivberg requested a review from snnn January 9, 2024 05:02

ivberg force-pushed the user/ivberg/ORTDynamicETW branch from 955cac4 to 7e387c6 Compare January 9, 2024 18:58

ivberg added 5 commits January 9, 2024 10:59

Initial version of dynamic ETW logger

f7d1fd5

Also fix logger created with CreateEnvWithCustomLogger API

5c9af58

Add support to Telemetry provider to expose if enabled, level, keywor…

b9e95b5

…d and use that in QNN EP to set profiling level

Support CompositeSink having seperate min severity so that ETWSink ca…

421249d

…n have more detail that what session options got passed in by default

Support writing QNN profiling events to ETW if the provider is enabled

1231a32

ivberg added 16 commits January 9, 2024 10:59

Take the min of the severity as we don't know which one is more verbose

f40530e

LintRunner

fe208ee

move per inference very verbose log statements to VLOGs

4efea8b

Mark all ETW events with appropiate keyword and level. Cleanup

0a92598

Fine-tune QNN Profiling and logging to be aware of common ETW levels …

e56d951

…without much impacting perf of inference say during a tight loop like onnxruntime_perf_test.exe

Introduce appropiate ETW levels of some existing "Telemetry" events f…

7e67f71

…rom the default of VERBOSE to say ERROR or INFO. This is so that the appropiate level of events show up when an ETW user specifies a lower level when listening for events

Change ORT.wprp to level 5

281d624

LintRunner

d7f4d5a

Fixes for PR cross-platform build, cpplint

17fad57

Lintrunner

e083dcc

More PR fixes lintcpp and Linux build

e757042

More PR fixes for lintcpp and Linux build

dc2fb4d

Small changes for PR comments

b070134

Lintrunner strikes again

80f08b5

Address PR comments with renames and comments

be85e9d

Rename ORTTraceLoggingKeyword and more lint NITs

7e387c6

skottmckay approved these changes Jan 10, 2024

View reviewed changes

HectorSVC reviewed Jan 10, 2024

View reviewed changes

onnxruntime/core/providers/qnn/builder/qnn_backend_manager.cc Show resolved Hide resolved

HectorSVC approved these changes Jan 10, 2024

View reviewed changes

jywu-msft merged commit 4d1243b into microsoft:main Jan 11, 2024
79 of 89 checks passed

ivberg deleted the user/ivberg/ORTDynamicETW branch January 23, 2024 18:19

This was referenced Feb 2, 2024

Add capturestate / rundown ETW support logging for session and provider options #19397

Merged

Added docs for ONNX 1.17 covering logging, tracing, and QNN EP Profiling #19428

Merged

ivberg mentioned this pull request May 1, 2024

Fully dynamic ETW controlled logging for ORT and QNN logs #20537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORT ETW dynamic logging that improves ORT diagnosability & performance #18882

ORT ETW dynamic logging that improves ORT diagnosability & performance #18882

ivberg commented Dec 20, 2023

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

skottmckay commented Jan 9, 2024

ivberg commented Jan 9, 2024 •

edited

Loading

ivberg commented Jan 9, 2024

azure-pipelines bot commented Jan 9, 2024

HectorSVC left a comment

ORT ETW dynamic logging that improves ORT diagnosability & performance #18882

ORT ETW dynamic logging that improves ORT diagnosability & performance #18882

Conversation

ivberg commented Dec 20, 2023

Description

Motivation and Context

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

skottmckay commented Jan 9, 2024

ivberg commented Jan 9, 2024 • edited Loading

ivberg commented Jan 9, 2024

azure-pipelines bot commented Jan 9, 2024

HectorSVC left a comment

Choose a reason for hiding this comment

ivberg commented Jan 9, 2024 •

edited

Loading