-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
omnitrace-python errors with OMNITRACE_USE_ROCM_SMI = true #330
Comments
Based on the error message here, it looks like rocm-smi doesn’t support getting the temperature on MI300 so omnitrace disables rocm-smi sampling, which is why you don’t see any activity.
It looks like you won’t be able to collect ROCm-SMI data until there is either an omnitrace patch to selectively collect only the queries that are supported or rocm-smi adds full support for MI300 |
I saw the error logs that you highlighted. I just wanted to confirm if it is expected behavior or not.
|
For my current use case HIP Activity Device is much more important than the metrics provided by rocm smi. I collected the trace on the following toy program and it shows the device activity track. Toy hip kernel
Command used to profile the above example: omnitrace config
Triton code
My setup: MI 300X, omnitrace 1.11, rocm-6.0.0-91 Any idea why the HIP activity trace doesn't render with the same config while I profile the triton kernel? |
Try:
|
that didn't help. same result as before. |
There may be some issues regardless which require some detailed explanation. I’ve got a full docket today so I’ll try to provide that once I’ve got some time. |
But in the meantime, I’ll just let you know that you’ll probably want to try to play with LD_LIBRARY_PATH to get Omnitrace to use the same ROCm libraries as PyTorch, but it may not be possible if PyTorch doesn’t have/use ROCm libraries with SOVERSIONs (e.g. only libroctracer.so instead of libroctracer.so.4). It’s something we have a solution for in the new rocprofiler but until it’s released, there’s very little Omnitrace can do. |
Well actually, I’ve probably got enough time now. The fundamental problem I’ve seen with some PyTorch apps in the past is that PyTorch has an RPATH to the ROCm libraries it installs and those libs do not have SOVERSIONs. Omnitrace sets an env variable HSA_TOOLS_LIB which causes the HSA runtime to call an OnLoad function when it initializes (which is triggered on first HIP call). When that happens, Omnitrace makes the appropriate calls to roctracer to set up tracing. But roctracer is linked to the HSA and HIP runtimes with SOVERSIONs. My theory (which I haven’t fully confirmed but empirical evidence from experimentation with LD_PRELOAD and making soft links in PyTorch installs to emulate SOVERSIONs does suggest) is that roctracer ends up communicating with different runtime libraries and effectively enables instrumenting a different HIP/HSA runtime than the one PyTorch uses. Thus, from Omnitrace’s perspective it enables tracing HIP but the application simply never called the HIP API or launched any kernels. |
Could you do me a favor and run your app normally (without Omnitrace) and before it exits, print out |
And for the record, the way we are addressing this issue in the new rocprofiler (which combines the capabilities of roctracer and rocprofiler) is that rocprofiler doesn’t link to the runtimes and each runtime effectively passes a table of function pointers into rocprofiler when it initializes — guaranteeing that the calls (via the function pointers in the table) that rocprofiler needs to make to enable profiling capabilities are applied to that specific runtime instance. Once this is release and Omnitrace uses the new rocprofiler API, you could have 20 different HIP runtimes and Omnitrace would be able to trace any/all of them. |
Thanks for the detailed answer. I'll try to get the maps and share them with you. Could you please join the |
Hi @anupambhatnagar. Has your issue been resolved? If so, please close the ticket. Thanks! |
Hi, I'm profiling a triton kernel on MI300 with rocm 6.0.0.
true
the collected trace fails to collect events from ROCM_SMI. the backtrace is available here.The omnitrace config I use is here.
How can I enable the collection of events from rocm-smi and view the HIP Activity Device?
Thanks!
P.S. I installed omnitrace using
omnitrace-1.11.0-rhel-9.3-ROCm-60000-PAPI-OMPT-Python3.sh
from the releases page.The text was updated successfully, but these errors were encountered: