Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] failed to legalize operation 'math.exp' for exponential op with bf16 dtype #19700

Open
hugomano opened this issue Nov 22, 2024 · 3 comments

Comments

@hugomano
Copy link

hugomano commented Nov 22, 2024

The following MLIR code is not able to compile anymore for the ROCm platform (6.2 used here), since this commit: 6e9eefe

module @main.Exponential.forward attributes {mhlo.num_partitions = 1 : i32, mhlo.num_replicas = 1 : i32} {
  func.func @main(%arg0: tensor<4096x4096xf16>, %arg1: tensor<4096x4096xf16>) -> tensor<4096x4096xf16> {
    %0 = stablehlo.add %arg0, %arg1 : tensor<4096x4096xf16>
    %1 = stablehlo.exponential %0 : tensor<4096x4096xf16>
    return %1 : tensor<4096x4096xf16>
  }
}

Error traceback:

error(pjrt): [PJRT_Client_Compile] <unknown>:0: error: loc(callsite("loop_exponential_fusion" at "loop_exponential_fusion")): failed to legalize operation 'math.exp'
<unknown>:0: note: loc("loop_exponential_fusion"): called from
<unknown>:0: note: loc(callsite("loop_exponential_fusion" at "loop_exponential_fusion")): see current operation: %43 = "math.exp"(%42) <{fastmath = #arith.fastmath<afn>}> : (bf16) -> bf16

HLO dump:

*** Begin module_0001.main.Exponential.forward.before_optimizations.txt ***
HloModule main.Exponential.forward, entry_computation_layout={(bf16[4096,4096]{1,0}, bf16[4096,4096]{1,0})->bf16[4096,4096]{1,0}}

ENTRY main.5 {
  Arg_0.1 = bf16[4096,4096]{1,0} parameter(0)
  Arg_1.2 = bf16[4096,4096]{1,0} parameter(1)
  add.3 = bf16[4096,4096]{1,0} add(Arg_0.1, Arg_1.2), metadata={source_file="external/zml~/zml/tensor.zig" source_line=3661}
  ROOT exponential.4 = bf16[4096,4096]{1,0} exponential(add.3), metadata={source_file="external/zml~/zml/tensor.zig" source_line=1686}
}


*** End module_0001.main.Exponential.forward.before_optimizations.txt ***
*** Begin module_0001.main.Exponential.forward.autotune_results.pbtxt ***
version: 3

*** End module_0001.main.Exponential.forward.autotune_results.pbtxt ***
*** Begin module_0001.main.Exponential.forward.gpu_target_config.pbtxt ***
gpu_device_info {
  threads_per_block_limit: 1024
  threads_per_warp: 32
  shared_memory_per_block: 65536
  shared_memory_per_core: 65536
  threads_per_core_limit: 2048
  core_count: 35
  fpus_per_core: 128
  block_dim_limit_x: 2147483647
  block_dim_limit_y: 65536
  block_dim_limit_z: 65536
  memory_bandwidth: 35968000000
  l2_cache_size: 6291456
  clock_rate_ghz: 1.895
  device_memory_size: 31658606592
  shared_memory_per_block_optin: -1
  rocm_compute_capability {
    gcn_arch_name: "gfx1100"
  }
  registers_per_core_limit: 65536
  registers_per_block_limit: 65536
}
platform_name: "ROCM"
dnn_version_info {
  major: 1
  minor: 3
}

*** End module_0001.main.Exponential.forward.gpu_target_config.pbtxt ***
*** Begin module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations.txt ***
HloModule main.Exponential.forward, is_scheduled=true, entry_computation_layout={(bf16[4096,4096]{1,0}, bf16[4096,4096]{1,0})->bf16[4096,4096]{1,0}}, frontend_attributes={fingerprint_before_lhs="189671c249ae20e507ab215f3337a72e"}

fused_exponential {
  param_1.6 = bf16[4096,4096]{1,0} parameter(1)
  convert.3.3 = f32[4096,4096]{1,0} convert(param_1.6)
  param_0.7 = bf16[4096,4096]{1,0} parameter(0)
  convert.4.3 = f32[4096,4096]{1,0} convert(param_0.7)
  add.2.3 = f32[4096,4096]{1,0} add(convert.3.3, convert.4.3), metadata={source_file="external/zml~/zml/tensor.zig" source_line=3661}
  convert.5.3 = bf16[4096,4096]{1,0} convert(add.2.3)
  ROOT exponential.2.1 = bf16[4096,4096]{1,0} exponential(convert.5.3), metadata={source_file="external/zml~/zml/tensor.zig" source_line=1686}
} // fused_exponential

ENTRY main.5 {
  Arg_1.2.0 = bf16[4096,4096]{1,0} parameter(1)
  Arg_0.1.0 = bf16[4096,4096]{1,0} parameter(0)
  ROOT loop_exponential_fusion = bf16[4096,4096]{1,0} fusion(Arg_1.2.0, Arg_0.1.0), kind=kLoop, calls=fused_exponential, metadata={source_file="external/zml~/zml/tensor.zig" source_line=1686}
}


*** End module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations.txt ***
*** Begin module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations-buffer-assignment.txt ***
BufferAssignment:
allocation 0: size 33554432, output shape is |bf16[4096,4096]|, maybe-live-out:
 value: <9 loop_exponential_fusion @0> (size=33554432,offset=0): bf16[4096,4096]{1,0}
allocation 1: size 33554432, parameter 1, shape |bf16[4096,4096]| at ShapeIndex {}:
 value: <7 Arg_1.2.0 @0> (size=33554432,offset=0): bf16[4096,4096]{1,0}
allocation 2: size 33554432, parameter 0, shape |bf16[4096,4096]| at ShapeIndex {}:
 value: <8 Arg_0.1.0 @0> (size=33554432,offset=0): bf16[4096,4096]{1,0}

Total bytes used: 100663296 (96.00MiB)

Used values:
<7 Arg_1.2.0 @0>
 positions:
  Arg_1.2.0
 uses:
  loop_exponential_fusion, operand 0
 from instruction: %Arg_1.2.0 = bf16[4096,4096]{1,0} parameter(1)
<8 Arg_0.1.0 @0>
 positions:
  Arg_0.1.0
 uses:
  loop_exponential_fusion, operand 1
 from instruction: %Arg_0.1.0 = bf16[4096,4096]{1,0} parameter(0)
<9 loop_exponential_fusion @0>
 positions:
  loop_exponential_fusion
 uses:
 from instruction: %loop_exponential_fusion = bf16[4096,4096]{1,0} fusion(bf16[4096,4096]{1,0} %Arg_1.2.0, bf16[4096,4096]{1,0} %Arg_0.1.0), kind=kLoop, calls=%fused_exponential, metadata={source_file="external/zml~/zml/tensor.zig" source_line=1686}


HloLiveRange (max 3):
  InstructionSequence:
    0:Arg_1.2.0
    1:Arg_0.1.0
    2:loop_exponential_fusion
  BufferLiveRange:
    Arg_1.2.0{}:0-3
    Arg_0.1.0{}:0-3
    loop_exponential_fusion{}:2-3
  Live ranges at 2 (peak):
    Arg_1.2.0: 33554432 bytes
    Arg_0.1.0: 33554432 bytes
    loop_exponential_fusion: 33554432 bytes

*** End module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations-buffer-assignment.txt ***
*** Begin module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations-memory-usage-report.txt ***
Total bytes used: 100663296 (96.00MiB)

Allocations sorted by size:

cumulative_size; total_size - cumulative_size; allocation
------------------------------------------------------------------------------
  32.00MiB( 33%);   64.00MiB; allocation 0: size 32.00MiB, output shape is |bf16[4096,4096]|, maybe-live-out:
  64.00MiB( 67%);   32.00MiB; allocation 1: size 32.00MiB, parameter 1, shape |bf16[4096,4096]| at ShapeIndex {}:
  96.00MiB(100%);         0B; allocation 2: size 32.00MiB, parameter 0, shape |bf16[4096,4096]| at ShapeIndex {}:


Allocations sorted by size with their values:
allocation 0: size 32.00MiB, output shape is |bf16[4096,4096]|, maybe-live-out:
allocation 1: size 32.00MiB, parameter 1, shape |bf16[4096,4096]| at ShapeIndex {}:
allocation 2: size 32.00MiB, parameter 0, shape |bf16[4096,4096]| at ShapeIndex {}:

*** End module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations-memory-usage-report.txt ***
error(pjrt): [PJRT_Client_Compile] <unknown>:0: error: loc(callsite("loop_exponential_fusion" at "loop_exponential_fusion")): failed to legalize operation 'math.exp'
<unknown>:0: note: loc("loop_exponential_fusion"): called from
<unknown>:0: note: loc(callsite("loop_exponential_fusion" at "loop_exponential_fusion")): see current operation: %43 = "math.exp"(%42) <{fastmath = #arith.fastmath<afn>}> : (bf16) -> bf16

error(zml/module): pjrt-rocm failed to compile following valid MLIR:
module @main.Exponential.forward attributes {mhlo.num_partitions = 1 : i32, mhlo.num_replicas = 1 : i32} {
  func.func @main(%arg0: tensor<4096x4096xbf16>, %arg1: tensor<4096x4096xbf16>) -> tensor<4096x4096xbf16> {
    %0 = stablehlo.add %arg0, %arg1 : tensor<4096x4096xbf16>
    %1 = stablehlo.exponential %0 : tensor<4096x4096xbf16>
    return %1 : tensor<4096x4096xbf16>
  }
}

Bests,
Hugo

@akuegel
Copy link
Member

akuegel commented Nov 25, 2024

I think I see the problem: the MathToROCDL pass in mlir doesn't specify a lowering for F32, and the default for BF16 ops is that we convert to F32 and use the lowering for F32. This doesn't work in this case. @draganmladjenovic can you maybe take a look at this?

@akuegel
Copy link
Member

akuegel commented Nov 25, 2024

Seems related to llvm/llvm-project#102971
It should be verified whether this patch actually makes sense? I would have thought that if intrinsics exist, then in the end it would also be lowered to them.

@pifon2a
Copy link
Contributor

pifon2a commented Nov 25, 2024

llvm/llvm-project#102971 does not have any tests for bf16. I think it can be fixed in mlir upstream using a pattern that uses logic similar to maybeCast in https://source.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Conversion/GPUCommon/OpToFuncCallLowering.h;rcl=699896658;l=98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants