Bottleneck Characterization (Known Issues) #378
Replies: 1 comment
-
Aware of the ticket but reproducers are needed. Otherwise the only option is to speculate and that is not a good approach. For example, in the ticket, I can speculate based on the results in the image there might be a bug in the python interpreter instrumentation which is causing this weird behavior but, as was mentioned, this has been found in more than one app. If at least one of those apps is not Python-based, then any work on the Python interpreter instrumentation in pursuit of this problem is a waste of time.
roctracer has a poor and unwieldy interface for distinguishing kernels from memory copies. Furthermore, the absence of useful samples and documentation, API breakages, and very poor testing over the years has resulted in a nightmare internally, it’s basically held together by duct tact. I’m almost finished with the new rocprofiler-sdk implementation (which has completely removed everything related to roctracer and rocprofiler-v1). This particular issue no longer exists in that implementation — memory copies are separated into different perfetto tracks than kernel dispatches. |
Beta Was this translation helpful? Give feedback.
-
This is a discussion to track known issues and workarounds for the new "Bottleneck Characterization" feature being added to Omniperf's analyze mode. i.e.
omniperf analyze -p workloads/dummy/MI200 --bottleneck-trace omnitrace-dummy_app.proto --gui
CC: @dwchang79 (#242)
Omnitrace "Did not end" Bug
and the workload-characterization.py script will log an incorrect total trace time of -1 (i.e., column
B in the CSV file). The screengrab visualizes the entire workload in the Perfetto UI and
highlights where the error occurs.
The semi-transparent color near the top right shows the trace not ending and the information in
the bottom left corner says, “did not end.” When a trace does not end, the total trace time is
logged to -1. Since the total trace time is used to create the end-to-end, a value of -1 causes the
plot-characterization.py script to generate an inaccurate end-to-end analysis.
To fix this, the total trace time in the CSV file needs to be manually updated with the correct value.
If possible, the correct value can be found by opening the Perfetto Proto file generated by
Omnitrace in the Perfetto UI (https://ui.perfetto.dev/).
Manually navigate the trace in the Peretto UI and find a trace at the end that does complete and
note its end time. Similarly, find a trace at the beginning and note its start time. The difference
between these is the total trace time. An example of this workaround can be seen below.
revision. Until then, unfortunately, the manual workaround can be tedious. The bug ticket has
been submitted (Slice has duration of "Did not end." omnitrace#311)
Total Trace Time does not add up Bug
that is too small. Where the summation of the GPU and communication timing is greater than the
total trace time. The plot-characterization.py script will then create an end-to-end analysis with
negative times as shown below.
communication. Adding these two values together creates a value greater than the total trace
time. The reality is, the Device to Host and GPU time should overlap and only counted once, but
our tool cannot show this overlapped time correctly. A truncated version of the Perfetto file is
shown below. In the figure the hipMemcpy operation completely overlaps with the
previously invoked GPU kernel, bodyForce().
Looking at the Perfetto file in the Perfetto UI shows that the total trace time recorded by our tool
is correct and that value should be left alone.
Although there is currently no solution to completely fix this, the user can open the CSV file and
manually edit the device to host entry that is being double counted and zero it out. Admittedly,
this is not 100% correct since device to host timing is now not being tracked, but the CPU, GPU,
and other remaining traffic should be correct.
Re-run the plot-characterization.py script and it should generate a more accurate end-to-end
analysis. A corrected end-to-end analysis is shown below.
Although this is admittedly not a 100% accurate end-to-end analysis, we have confidence that
the GPU bottleneck breakdown is correct. Therefore, the user can be confident with the
conclusions drawn regarding the workload’s bottlenecks on the GPU. Also, the Device to Host
communication time in Figure 12 is the correct amount of time, but again, the communication
time would actually overlap with the GPU, and we have no way of visualizing this correctly.
This bug will be investigated further and hopefully solved in a future revision.
Beta Was this translation helpful? Give feedback.
All reactions