-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: cpu: aarch64: injectors: eltwise_injector - improve gelu performance for block size 16 #2072
Conversation
Do you have any specific |
Also, I'm not seeing the same failures on this patch (or before) as you. E.g., out of the CI failures, I can only see Also, do you have any measurements of the speedup of this optimization? |
Hi, please checkout these log, the machine is A64FX and uses jit_sve_512. I have used: ./benchdnn --eltwise --batch=inputs/eltwise/test_eltwise_all | grep eltwise_gelu_erf to extract these logs. A64_FX_benchdnn_eltwise_gelu.log Here export ONEDNN_VERBOSE=1 is used before running the command. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.
But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd
is slower than it could be for some reason for SVE 512?
Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, accidentally approved
It is true that we get 1.5x slowdown on G3 machines( SVE_256). That's why it is limited to SVE_512. Extra unit cases are failing on SVE_512 machines already in the main branch. |
I am also checking with latest changes in main. Will update you soon. |
@jondea After merging latest changes from main, errors are resolved. So I have updated the description also. Please consider approval. Thanks. |
@nikhilfujitsu, merge commits are not allowed in production branches. Please rebase your changes. |
Hi I am struggling to get this rebased again could you please help me here. |
This operation can be done from console:
|
f61ac72
to
4f8e934
Compare
Thank you. Means a lot to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes Reviewed. And description changed accordingly.
@vpirogov Kindly let us know if any other change is required |
Can you please rebase your changes so it triggers the aarch64 pipeline we have added since? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will approve after CI passes on rebase.
…ance for block size 16
4f8e934
to
848844b
Compare
@theComputeKid Hi, thanks for the comment. I have rebased the branch. |
@nikhilfujitsu Thanks. Will approve once CI passes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Description
Improvement: gelu performance for block size 16 in jit_uni_eltwise_injector
This commit improves the performance of gelu function jit_uni_eltwise_injector for block size 16:
Major Code changes:
• Added a new function gelu_erf_minimax_approx_compute_vector_fwd(const TRegS &vmm_src) for
computing gelu_erf for block size 16.
• Added new gelu_minimax constants and polynomial constants table.
Checklist
General
[✓] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit? Yes
Test output is same with and without this commit.
make test summary :
95% tests passed, 11 tests failed out of 200
Total Test time (real) = 3750.50 sec
The following tests FAILED:
55 - test_convolution_backward_data_f32 (Subprocess aborted)
123 - test_graph_c_api_compile_parametrized_usm_cpu (Failed)
153 - test_graph_unit_dnnl_conv_usm_cpu (Failed)
157 - test_graph_unit_dnnl_group_norm_usm_cpu (Failed)
159 - test_graph_unit_dnnl_large_partition_usm_cpu (Failed)
160 - test_graph_unit_dnnl_layer_norm_usm_cpu (Failed)
161 - test_graph_unit_dnnl_matmul_usm_cpu (Failed)
162 - test_graph_unit_dnnl_mqa_decomp_usm_cpu (Failed)
163 - test_graph_unit_dnnl_pool_usm_cpu (Failed)
168 - test_graph_unit_dnnl_sdp_decomp_usm_cpu (Failed)
169 - test_graph_unit_dnnl_softmax_usm_cpu (Failed)
Errors while running CTest
Output from these tests are in: /home/nikhil/oneDNN/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
[✓] Have you formatted the code using clang-format? Yes