Bottleneck Characterization (Known Issues) #378

coleramos425 · 2024-06-14T22:02:34Z

coleramos425
Jun 14, 2024
Maintainer

This is a discussion to track known issues and workarounds for the new "Bottleneck Characterization" feature being added to Omniperf's analyze mode. i.e.

omniperf analyze -p workloads/dummy/MI200 --bottleneck-trace omnitrace-dummy_app.proto --gui

CC: @dwchang79 (#242)

Omnitrace "Did not end" Bug

At the time of writing there is a known Omnitrace bug where a minority of workloads do not end
and the workload-characterization.py script will log an incorrect total trace time of -1 (i.e., column
B in the CSV file). The screengrab visualizes the entire workload in the Perfetto UI and
highlights where the error occurs.

The semi-transparent color near the top right shows the trace not ending and the information in
the bottom left corner says, “did not end.” When a trace does not end, the total trace time is
logged to -1. Since the total trace time is used to create the end-to-end, a value of -1 causes the
plot-characterization.py script to generate an inaccurate end-to-end analysis.
To fix this, the total trace time in the CSV file needs to be manually updated with the correct value.
If possible, the correct value can be found by opening the Perfetto Proto file generated by
Omnitrace in the Perfetto UI (https://ui.perfetto.dev/).
Manually navigate the trace in the Peretto UI and find a trace at the end that does complete and
note its end time. Similarly, find a trace at the beginning and note its start time. The difference
between these is the total trace time. An example of this workaround can be seen below.

The Omnitrace team is aware of this bug and hopefully it will be fixed in a future Omnitrace
revision. Until then, unfortunately, the manual workaround can be tedious. The bug ticket has
been submitted (Slice has duration of "Did not end." omnitrace#311)

Total Trace Time does not add up Bug

Occasionally the CSV file generated by workload-characterization.py will have a total trace time
that is too small. Where the summation of the GPU and communication timing is greater than the
total trace time. The plot-characterization.py script will then create an end-to-end analysis with
negative times as shown below.

It is believed that this bug is occurring due to overlap of the GPU kernels and the device to host
communication. Adding these two values together creates a value greater than the total trace
time. The reality is, the Device to Host and GPU time should overlap and only counted once, but
our tool cannot show this overlapped time correctly. A truncated version of the Perfetto file is
shown below. In the figure the hipMemcpy operation completely overlaps with the
previously invoked GPU kernel, bodyForce().

Looking at the Perfetto file in the Perfetto UI shows that the total trace time recorded by our tool
is correct and that value should be left alone.
Although there is currently no solution to completely fix this, the user can open the CSV file and
manually edit the device to host entry that is being double counted and zero it out. Admittedly,
this is not 100% correct since device to host timing is now not being tracked, but the CPU, GPU,
and other remaining traffic should be correct.
Re-run the plot-characterization.py script and it should generate a more accurate end-to-end
analysis. A corrected end-to-end analysis is shown below.

Although this is admittedly not a 100% accurate end-to-end analysis, we have confidence that
the GPU bottleneck breakdown is correct. Therefore, the user can be confident with the
conclusions drawn regarding the workload’s bottlenecks on the GPU. Also, the Device to Host
communication time in Figure 12 is the correct amount of time, but again, the communication
time would actually overlap with the GPU, and we have no way of visualizing this correctly.
This bug will be investigated further and hopefully solved in a future revision.

jrmadsen · 2024-06-15T02:22:21Z

jrmadsen
Jun 15, 2024
Collaborator

The Omnitrace team is aware of this bug and hopefully it will be fixed in a future Omnitrace
revision. Until then, unfortunately, the manual workaround can be tedious. The bug ticket has been submitted (ROCm/omnitrace#311)

Aware of the ticket but reproducers are needed. Otherwise the only option is to speculate and that is not a good approach. For example, in the ticket, I can speculate based on the results in the image there might be a bug in the python interpreter instrumentation which is causing this weird behavior but, as was mentioned, this has been found in more than one app. If at least one of those apps is not Python-based, then any work on the Python interpreter instrumentation in pursuit of this problem is a waste of time.

It is believed that this bug is occurring due to overlap of the GPU kernels and the device to host communication. Adding these two values together creates a value greater than the total trace time. The reality is, the Device to Host and GPU time should overlap and only counted once, but our tool cannot show this overlapped time correctly. A truncated version of the Perfetto file is shown below. In the figure the hipMemcpy operation completely overlaps with the
previously invoked GPU kernel, bodyForce()

roctracer has a poor and unwieldy interface for distinguishing kernels from memory copies. Furthermore, the absence of useful samples and documentation, API breakages, and very poor testing over the years has resulted in a nightmare internally, it’s basically held together by duct tact. I’m almost finished with the new rocprofiler-sdk implementation (which has completely removed everything related to roctracer and rocprofiler-v1). This particular issue no longer exists in that implementation — memory copies are separated into different perfetto tracks than kernel dispatches.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bottleneck Characterization (Known Issues) #378

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Bottleneck Characterization (Known Issues) #378

coleramos425 Jun 14, 2024 Maintainer

Omnitrace "Did not end" Bug

Total Trace Time does not add up Bug

Replies: 1 comment

jrmadsen Jun 15, 2024 Collaborator

coleramos425
Jun 14, 2024
Maintainer

jrmadsen
Jun 15, 2024
Collaborator