-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NNPA] Multiple zAIU support with ZHighForkOp #2681
[NNPA] Multiple zAIU support with ZHighForkOp #2681
Conversation
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]> Co-authored-by: Yasushi Negishi <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]> Co-authored-by: Yasushi Negishi <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Set correct layout for bcast case. Add code to profile each staick and unstaick time Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Yasushi Negishi <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]> Co-authored-by: Yasushi Negishi <[email protected]>
Signed-off-by: Haruki Imai <[email protected]> Co-authored-by: Yasushi Negishi <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Could you add the example for ZHigh dialect to the PR? |
Signed-off-by: Haruki Imai <[email protected]>
This is the example for rewriting when using --nnpa-matmul-parallel=2:256 (two devices with threshold 256). Also, I added the same example as comments in the code.
|
@AlexandreEichenberger @tungld @chentong319 Any comments on this? |
The purpose of ZHigh.join is to mark the place that the value returned for fork should be ready. Better to use: |
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
Signed-off-by: Haruki Imai <[email protected]>
This was replaced with PR #2756 using OpenMP. |
This PR replaces #2563
This PR enables to create threads using async dialects to run operations on multiple NNPA devices. ZHighForkOp and ZHighJoinOp are introduced as high-level IR and they are lowered into AsyncExecuteOp and AsyncAwaitOp.
Currently large MatMul ops are supported. Given A(N x K) * B(K x M), M is split for the parallelization. The MatMul ops whose M is greater than or equal to this threshold specified by compiler option are parallelized. The MatMul ops are rewritten in
rewrite-onnx-for-zhigh
pass by using Split op, Concat op, and ZHighForkOp and ZHighJoinOp which are newly introduced in this PR. ZHighForkOp created a thread to compute sub-Matrix, and ZHighJoinOp waits for completing the thread. They are lowered into AsyncExecuteOp and AsyncAwaitOp in ZHighToZLow pass.How to run
--nnpa-matmul-parallel=#device : threshold
- Enable parallelization with the number of devices and the threshold of dimension size.
"string" is in the format of "#DEVICES":"THRESHOLD".
Use -L${LLVM_PROJECT_HOME}/build/lib -lmlir_async_runtimef for compilation and set LD_LIBRARY_PATH it at runtime.
Example:
Compile: (4 nnpa devices with threshold 128 )
$ onnx-mlir -O3 --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA --nnpa-matmul-parallel=4:128 <onnx model> -L${LLVM_PROJECT_HOME}/build/lib -lmlir_async_runtime
Summary of implementation
1)Split Matrix B along M dimension by Split op
2)Insert ZHigh ForkOp and ZHigh JoinOp to create threads
3)Use Concat Op to gather the results of each thread
Lower ONNX.MatMul into ZHigh ops as usual
3.1) Move alloc op to outside of ForkOp region to deallocate correctly.
3.2) Replace the result of forkOp with allocated value.
3.3) Create Async ExecuteOp and copy ForkOp region into it.
3.4)Create AsyncAwaitOp and replace ZHighJoinOp with it