Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress MLIR ubsan errors #3198

Closed
wants to merge 4 commits into from

Conversation

BrianHarrisonAMD
Copy link
Collaborator

Add suppressing ubsan errors for the file shared_ptr_base.h to workaround errors when upgrading to ROCm 6.2 in the base docker.

See relevant PR #3181, and Issue #3192 for additional details.

The issue can be tracked down to simply creating the MLIR handle after upgrading to the latest ROCm 6.2.
Since we froze MLIR to an older release, suppressing this warning is our only real option.

Note:

  • Params used when constructing the MLIR handle to replicate, "--x2 1 --operation conv2d --kernel_id 0 --num_cu 104 --arch amdgcn-amd-amdhsa:gfx90a:sramecc+:xnack- --groupsize 1 --fil_layout GNCHW --fil_type fp32 --in_layout NGCHW --out_layout NGCHW --in_type fp32 --out_type fp32 --batchsize 100 --in_channels 25 --out_channels 300 --in_h 32 --in_w 32 --out_h 30 --out_w 30 --fil_h 3 --fil_w 3 --dilation_h 1 --dilation_w 1 --conv_stride_h 1 --conv_stride_w 1 --padding_h 0 --padding_w 0 --kernel_name mlir_gen_igemm_conv2d_v4r4_fwd_xdlops0". Simply creating a handle with those command line arguments will result in the vptr issue when the test application is exiting after upgrading to ROCm 6.2.
  • ROCm 6.2 also updated the version of clang.

Copy link
Collaborator

@junliume junliume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep track of this WA and minimize the fix later:

shared_ptr_base.h

@atamazov
Copy link
Contributor

@junliume maybe we can simply disable MLIR in the failing stage as a W/A?

@BrianHarrisonAMD
Copy link
Collaborator Author

I think that could also be a reasonable workaround, but not sure which way we prefer.
Both aren't ideal hehe.

@amberhassaan
Copy link
Contributor

I slightly prefer @atamazov 's suggestion because currently we're saying all best are off with shared_ptr_base which could mean that certain memory problems related to shared_ptr sneak through, while disabling MLIR means that we aren't sure of MLIR's safety so that's more contained in a way to MLIR as opposed to every use of std::shared_ptr.

@BrianHarrisonAMD
Copy link
Collaborator Author

BrianHarrisonAMD commented Aug 14, 2024

I can update this to instead turn off MLIR for the debug steps.

@junliume that good with you?

@DrizztDoUrden
Copy link
Contributor

I think disabling MLIR for the sanitized steps as Artem suggested is less harmful because it is at least somewhat tested anyway and we would really not want to miss some shared_ptr misuse in the library if that would happen.

@junliume
Copy link
Collaborator

I can update this to instead turn off MLIR for the debug steps.

@junliume that good with you?

@BrianHarrisonAMD yes let's do it by only disabling checks when MLIR is in use. Thanks!

@BrianHarrisonAMD
Copy link
Collaborator Author

Sounds good!

Ill close and make a new PR with disabling MLIR for the sanitize steps.

Thanks for the feedback everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants