Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORT ETW dynamic logging that improves ORT diagnosability & performance #18882

Merged
merged 22 commits into from
Jan 11, 2024

Conversation

ivberg
Copy link
Contributor

@ivberg ivberg commented Dec 20, 2023

Description

This PR has several combined ORT ETW changes that improve ORT log diagnosability & performance. 

  • The existing log behavior in the ORT API and Severity behavior remain the same as compiled by the dev using the ORT API
  • The PR keeps the existing design which has 2 TraceLogging providers defined (although both were not used before this PR)
  • Keeps great inference (inf) and session load performance even with dynamic logging enabled (see below)
  • On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT will dynamically add a new sink reflecting the severity level provided by ETW dynamically. E.G Critical - Verbose per the need at runtime
    • This allows previous printf style LOGS() statements both for CPU and NPU cases to flow to ETW via a local trace (if enabled)
  • This DOES NOT add any new Telemetry which can optionally be sent to Microsoft.
    • Telemetry are ETW events marked with the Measure keyword that can be sampled if a box opts-in
  • Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and levels added if they were missing
  • If Execution Providers (EPs) can provide more detailed insight into their HW or component, then this PR allows for those to be dynamically logged instead of just at compile time
    • In this PR, the QNN EP for QC NPUs can have basic or detailed profiling enabled to give some insight into how the NPU is performing
    • When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the Profiling keyword and level 5 then QC QNN basic profiling info is output to ETW

Motivation and Context

  • This make ORT logging and diagnosability more performant (on Windows) and available in a wider variety of runtime environments.
    • The performance difference for inf times was ~300x+ drastically better/faster when these logs were output to ETW vs just stdout (Verbose Severity)
  • This style of ETW dynamic tracing is the widely used standard for Windows components, and even by some 3rd party software since the ETW API is open and part of the Windows API
  • These ETW based logs can be seamlessly combined with other ETW logs such as an AI component/feature using ORT, OS CPU profiling, scheduling, and more
  • Before the PR, ORT logging is largely printf style and output to a sink (usually stdout) only if compiled with a certain log Severity. When enabled the previous logging (to stdout) would vastly slow down inference times. Once compiled for release the internal ORT logs were not accessible by anyone except the AI model developer in their dev inner loop. That means logs could not be enabled on a lab machine, or on a production system where the runtime behavior or performance might be different for various reasons on a wide variety of HW.
  • This change was tested with performance in mind and tested with a mobilenet small AI model with onnxruntime_perf_test
    • CPU: There was no statistical difference for inf times with the baseline (main) or this PR whether ETW was enabled or not (both ORT providers all keywords level 5).
    • NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical difference for inf times with this PR whether ETW (both ORT providers all keywords) were enabled or not for Level 5 (Verbose). This is even with QNN Basic profiling turned on and outputting NPU stats to ETW
      • As expected and designed, there was perf slowdown when Max Level 255 is enabled which translates to QNN Detailed profiling. This mirrors the expected slowdown in the NPU when individual model operations are monitored & recorded as well. This perf is similar to the QNN SDK Detailed profiling performance separate from this PR. This is designed to be above level 5 (verbose) as that is commonly the max level used in many trace profiles and won't affect inf performance.
  • Other OSes such as Linux & Android are left untouched for now.
    • Out of scope for this PR but TraceLogging is available for Linux with LTTng tracing. So in the future, this optional tracing could also be made available on other OSes where a TraceLogging API is available

@ivberg

This comment was marked as resolved.

This comment was marked as resolved.

@ivberg

This comment was marked as resolved.

This comment was marked as resolved.

@ivberg

This comment was marked as resolved.

This comment was marked as outdated.

snnn
snnn previously approved these changes Dec 21, 2023
@ivberg

This comment was marked as resolved.

This comment was marked as resolved.

@ivberg

This comment was marked as outdated.

This comment was marked as outdated.

@ivberg

This comment was marked as outdated.

@ivberg ivberg requested a review from skottmckay January 8, 2024 23:51

This comment was marked as outdated.

@ivberg

This comment was marked as outdated.

This comment was marked as outdated.

@ivberg ivberg requested a review from snnn January 9, 2024 05:02
@skottmckay
Copy link
Contributor

Has this been merged with main any time recently? I don't see anything obvious in the commits that looks like a merge. The Web CI failure may go away with the latest main.

@ivberg
Copy link
Contributor Author

ivberg commented Jan 9, 2024

Has this been merged with main any time recently? I don't see anything obvious in the commits that looks like a merge. The Web CI failure may go away with the latest main.

Last rebase to main was on 12/13. I can rebase again although at this time I don't think the test failures are related to this code. Rebased to main now (1/9)

@ivberg ivberg force-pushed the user/ivberg/ORTDynamicETW branch from 955cac4 to 7e387c6 Compare January 9, 2024 18:58
@ivberg
Copy link
Contributor Author

ivberg commented Jan 9, 2024

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link
Contributor

@HectorSVC HectorSVC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jywu-msft jywu-msft merged commit 4d1243b into microsoft:main Jan 11, 2024
79 of 89 checks passed
@ivberg ivberg deleted the user/ivberg/ORTDynamicETW branch January 23, 2024 18:19
ivberg added a commit that referenced this pull request Feb 7, 2024
…ing (#19428)

### Description
Added docs for ONNX 1.17 covering logging, tracing, and QNN EP Profiling

### Motivation and Context
- ONNX Logging has not been documented
- ONNX Tracing with Windows has barely been documented
- ONNX 1.17 has new tracing and QNN EP Profiling

PRs: #16259,  #18201, #18882, #19397
ivberg added a commit that referenced this pull request Feb 8, 2024
…er options (#19397)

### Description
Add capturestate / rundown ETW support logging for session and provider
options.

### Motivation and Context
Follow-up to #16259 and #18882

This is very useful when you have longer running ONNX sessions which
will be the case for a lot of AI workloads. That means ETW tracing may
start minutes or hours after a process & session has been established.
When a trace is captured, you would want to know the state of ONNX at
that time. The state for ONNX is session and config options so that they
show up in the trace.

Tested with xperf and ORT 
xperf -start ort -on 3a26b1ff-7484-7484-7484-15261f42614d
xperf -capturestate ort 3a26b1ff-7484-7484-7484-15261f42614d <--- Run
this after session has been up for some time
xperf -stop ort -d .\ort.etl  <- Trace will now also have rundown events

Also these will show if you use WPR [CaptureStateOnSave
](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/capturestateonsave)
YUNQIUGUO pushed a commit that referenced this pull request Feb 9, 2024
…er options (#19397)

### Description
Add capturestate / rundown ETW support logging for session and provider
options.

### Motivation and Context
Follow-up to #16259 and #18882

This is very useful when you have longer running ONNX sessions which
will be the case for a lot of AI workloads. That means ETW tracing may
start minutes or hours after a process & session has been established.
When a trace is captured, you would want to know the state of ONNX at
that time. The state for ONNX is session and config options so that they
show up in the trace.

Tested with xperf and ORT 
xperf -start ort -on 3a26b1ff-7484-7484-7484-15261f42614d
xperf -capturestate ort 3a26b1ff-7484-7484-7484-15261f42614d <--- Run
this after session has been up for some time
xperf -stop ort -d .\ort.etl  <- Trace will now also have rundown events

Also these will show if you use WPR [CaptureStateOnSave
](https://learn.microsoft.com/en-us/windows-hardware/test/wpt/capturestateonsave)
ivberg added a commit that referenced this pull request Jun 7, 2024
### Description
Windows - Fully dynamic ETW controlled logging for ORT and QNN logs

The logging support is documented here 
-
https://onnxruntime.ai/docs/performance/tune-performance/logging_tracing.html#tracing---windows
-
https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html#tracelogging-etw-windows-profiling

Also add support for logging ORT SessionCreation on ETW CaptureState

### Motivation and Context
The previous ETW support only worked if you enabled ETW before the
session started. There can commonly be long-lived AI inference processes
that need to be traced & debugged. This enables logging fully on the
fly.

Without this support a dev would have to end up killing a process or
stopping a service in order to get tracing. We had to do this for a
recent issue with QNN, and it was a bit painful to get the logs and it
ruined the repro.

### Testing
I tested with the following cases
- Leaving default ORT run
- Enabling ETW prior to start and leaving running for entire session +
inferences, then stopping
- Starting ORT session + inf, then enabling and stopping ETW
  - Start ORT session /w long running Inferences 
- wpr -start
[ort.wprp](https://github.com/microsoft/onnxruntime/blob/e6228575e4d5866bdb831e76cc93e6c35af4de8b/ort.wprp#L4)
-start
[etw_provider.wprp](https://github.com/microsoft/onnxruntime/blob/e6228575e4d5866bdb831e76cc93e6c35af4de8b/onnxruntime/test/platform/windows/logging/etw_provider.wprp)
  - Wait a few seconds
  - wpr -stop ort.etl
  - Inferences are still running
- Verify ONNXRuntimeLogEvent provider events are present and new
SessionCreation_CaptureState event under Microsoft.ML.ONNXRuntime
provider

Related:
#18882
#19428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants