Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: CLR asserts when a code object is loaded for which no implimentation is avaialble for the GPUS in the system. #102

Open
IMbackK opened this issue Oct 31, 2024 · 2 comments

Comments

@IMbackK
Copy link

IMbackK commented Oct 31, 2024

Problem Description

Loading a shared object containing gpu code that is not compiled for one of the gpus in the system causes CLR to assert here

assert(err == hipSuccess);
when assertions are enabled.

For release builds the assertions in clr are disabled, in this case:

When assertions are disabled and none of the gpus in the system have a implementation available in the loaded code object, clr will silently fail and the application will crash if the code object is used.

When assertions are disabled and there are gpus for which an implementation is available and gpus for which no implementation is available are present in the system clr, depending on the order in which amdgpu.ko initialized the gpus, will return without having loaded any gpu code for the code objects in question, even for the gpus that do have an implementations available.

I would like to note that i DONT think this is a bug in CLR but rather that linking against a shared library for which no gpu code is available for one of the gpus in the system is a bug in the client. However this practice has become extremely common in ROCM's libraries, mainly centered around hipBlasLT and client code owners have instructed me to file a bug against clr.

hipBlasLT only supports, and thus is only compiled for, gfx908 (documentation is incorrect here), gfx90a, gfx94x and gfx11x.
In rocm, projects such as pytorch and miopen it has become common practice to unconditionally link against hipblaslt causing clr to assert when these projects are used on a system that contains any other gpu or, alternatively, if assertions are disabled, causes clr to leave the code objects unloaded even for the supported devices in the system when a unsupported gpu is present, causing a crash when they are used.

Operating System

Any linux

CPU

Epyc 7552

GPU

GFX900, GFX906, GFX908, GFX1030

ROCm Version

ROCm 6.2.3

ROCm Component

No response

Steps to Reproduce

Compile clr with assertions enabled

Have a system with a gpu not supported by hipblaslt
link any binary against hipblaslt.so (no need to use hipblaslt for anything)
observe assert when the binary is run.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@ppanchad-amd
Copy link

Hi @IMbackK. Internal ticket has been created to investigate your issue. Thanks!

@schung-amd
Copy link

Hi @IMbackK, thanks for keeping up with this. We're working on replacing this assertion and others with proper error handling. I'll reach out to the internal team to see if we have a timeline for this, as well as what the expected behavior will be in this usecase (i.e. heterogenous systems with one or more unsupported GPUs). Let me know if you have any additional questions and I'll pass those on as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants