Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NNPA] Multiple zAIU support with ZHighForkOp #2681

Closed

Conversation

imaihal
Copy link
Collaborator

@imaihal imaihal commented Jan 15, 2024

This PR replaces #2563

This PR enables to create threads using async dialects to run operations on multiple NNPA devices. ZHighForkOp and ZHighJoinOp are introduced as high-level IR and they are lowered into AsyncExecuteOp and AsyncAwaitOp.

Currently large MatMul ops are supported. Given A(N x K) * B(K x M), M is split for the parallelization. The MatMul ops whose M is greater than or equal to this threshold specified by compiler option are parallelized. The MatMul ops are rewritten in rewrite-onnx-for-zhigh pass by using Split op, Concat op, and ZHighForkOp and ZHighJoinOp which are newly introduced in this PR. ZHighForkOp created a thread to compute sub-Matrix, and ZHighJoinOp waits for completing the thread. They are lowered into AsyncExecuteOp and AsyncAwaitOp in ZHighToZLow pass.

How to run

  • Set an option to specify the number of devices and threshold.
      --nnpa-matmul-parallel=#device : threshold
                            - Enable parallelization with the number of devices and the threshold of dimension size.
                              "string" is  in the format of "#DEVICES":"THRESHOLD".
  • Link and load async runtime library (${LLVM_HOME}/build/lib/libmlir_async_runtime.so)
    Use -L${LLVM_PROJECT_HOME}/build/lib -lmlir_async_runtimef for compilation and set LD_LIBRARY_PATH it at runtime.

Example:
Compile: (4 nnpa devices with threshold 128 )
$ onnx-mlir -O3 --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA --nnpa-matmul-parallel=4:128 <onnx model> -L${LLVM_PROJECT_HOME}/build/lib -lmlir_async_runtime

Summary of implementation

  1. ParallelMatMulPattern in rewrite-onnx-for-zhigh pass
    1)Split Matrix B along M dimension by Split op
    2)Insert ZHigh ForkOp and ZHigh JoinOp to create threads
    3)Use Concat Op to gather the results of each thread
  2. ONNX to ZHigh
    Lower ONNX.MatMul into ZHigh ops as usual
  3. ZHigh to ZLow
    3.1) Move alloc op to outside of ForkOp region to deallocate correctly.
    3.2) Replace the result of forkOp with allocated value.
    3.3) Create Async ExecuteOp and copy ForkOp region into it.
    3.4)Create AsyncAwaitOp and replace ZHighJoinOp with it

imaihal and others added 30 commits July 6, 2023 01:01
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Co-authored-by: Yasushi Negishi <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Co-authored-by: Yasushi Negishi <[email protected]>
Set correct layout for bcast case.
Add code to profile each staick and unstaick time

Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Co-authored-by: Yasushi Negishi <[email protected]>
@imaihal imaihal marked this pull request as ready for review January 18, 2024 15:23
@chentong319
Copy link
Collaborator

Could you add the example for ZHigh dialect to the PR?

@imaihal
Copy link
Collaborator Author

imaihal commented Jan 19, 2024

@chentong319

This is the example for rewriting when using --nnpa-matmul-parallel=2:256 (two devices with threshold 256). Also, I added the same example as comments in the code.

  • Input
func.func @test_matmul_parallel(%arg0: tensor<1x64xf32>, %arg1: tensor<64x512xf32>) -> tensor<1x512xf32> {
  %0 = "onnx.MatMul"(%arg0, %arg1) : (tensor<1x64xf32>, tensor<64x512xf32>) -> tensor<1x512xf32>
  return %0 : tensor<1x512xf32>
}
  • Result of rewriting.
  func.func @test_matmul_parallel(%arg0: tensor<1x64xf32>, %arg1: tensor<64x512xf32>) -> tensor<1x512xf32> {
    %0 = onnx.Constant dense<256> : tensor<2xi64>
    %1:2 = "onnx.Split"(%arg1, %0) {axis = 1 : si64} : (tensor<64x512xf32>, tensor<2xi64>) -> (tensor<64x256xf32>, tensor<64x256xf32>)
    %2 = "zhigh.Fork"() ({
      %5 = "onnx.MatMul"(%arg0, %1#0) : (tensor<1x64xf32>, tensor<64x256xf32>) -> tensor<1x256xf32>
      onnx.Yield %5 : tensor<1x256xf32>
    }) {id = 0 : si64} : () -> tensor<1x256xf32>
    %3 = "zhigh.Fork"() ({
      %5 = "onnx.MatMul"(%arg0, %1#1) : (tensor<1x64xf32>, tensor<64x256xf32>) -> tensor<1x256xf32>
      onnx.Yield %5 : tensor<1x256xf32>
    }) {id = 1 : si64} : () -> tensor<1x256xf32>
    "zhigh.Join"(%2) : (tensor<1x256xf32>) -> ()
    "zhigh.Join"(%3) : (tensor<1x256xf32>) -> ()
    %4 = "onnx.Concat"(%2, %3) {axis = 1 : si64} : (tensor<1x256xf32>, tensor<1x256xf32>) -> tensor<1x512xf32>
    return %4 : tensor<1x512xf32>
  }

@imaihal imaihal changed the title [NNPA] Multiple zAIU support for MatMulOp with ZHighForkOp [NNPA] Multiple zAIU support with ZHighForkOp Jan 25, 2024
@imaihal
Copy link
Collaborator Author

imaihal commented Jan 25, 2024

@AlexandreEichenberger @tungld @chentong319 Any comments on this?

@chentong319
Copy link
Collaborator

The purpose of ZHigh.join is to mark the place that the value returned for fork should be ready. Better to use:
%4 = %zhigh.join(%2) to make sure the use of the result is after the join. Another issue, which I do not know the answer, is how to tell compiler avoid moving join up if possible.

@imaihal
Copy link
Collaborator Author

imaihal commented Mar 21, 2024

This was replaced with PR #2756 using OpenMP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants