Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: cpu: aarch64: injectors: eltwise_injector - improve gelu performance for block size 16 #2072

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

nikhilfujitsu
Copy link
Contributor

@nikhilfujitsu nikhilfujitsu commented Sep 3, 2024

Description

Improvement: gelu performance for block size 16 in jit_uni_eltwise_injector
This commit improves the performance of gelu function jit_uni_eltwise_injector for block size 16:

Major Code changes:

• Added a new function gelu_erf_minimax_approx_compute_vector_fwd(const TRegS &vmm_src) for
computing gelu_erf for block size 16.
• Added new gelu_minimax constants and polynomial constants table.

Checklist

General

[✓] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit? Yes
Test output is same with and without this commit.
make test summary :

95% tests passed, 11 tests failed out of 200

Total Test time (real) = 3750.50 sec

The following tests FAILED:
55 - test_convolution_backward_data_f32 (Subprocess aborted)
123 - test_graph_c_api_compile_parametrized_usm_cpu (Failed)
153 - test_graph_unit_dnnl_conv_usm_cpu (Failed)
157 - test_graph_unit_dnnl_group_norm_usm_cpu (Failed)
159 - test_graph_unit_dnnl_large_partition_usm_cpu (Failed)
160 - test_graph_unit_dnnl_layer_norm_usm_cpu (Failed)
161 - test_graph_unit_dnnl_matmul_usm_cpu (Failed)
162 - test_graph_unit_dnnl_mqa_decomp_usm_cpu (Failed)
163 - test_graph_unit_dnnl_pool_usm_cpu (Failed)
168 - test_graph_unit_dnnl_sdp_decomp_usm_cpu (Failed)
169 - test_graph_unit_dnnl_softmax_usm_cpu (Failed)
Errors while running CTest
Output from these tests are in: /home/nikhil/oneDNN/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
[✓] Have you formatted the code using clang-format? Yes

@nikhilfujitsu nikhilfujitsu requested a review from a team as a code owner September 3, 2024 10:15
@vpirogov vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Sep 3, 2024
@vpirogov vpirogov added this to the v3.6 milestone Sep 3, 2024
@abhijain1204fujitsu
Copy link

@vpirogov , @jondea , Kindly support to review the PR

@jondea
Copy link
Contributor

jondea commented Sep 6, 2024

Do you have any specific benchdnn calls which will exercise this path?

@jondea
Copy link
Contributor

jondea commented Sep 6, 2024

Also, I'm not seeing the same failures on this patch (or before) as you. E.g., out of the CI failures, I can only see test_benchdnn_modeC_graph_ci_cpu. Would you be able to investigate this please?

Also, do you have any measurements of the speedup of this optimization?

@nikhilfujitsu
Copy link
Contributor Author

nikhilfujitsu commented Sep 9, 2024

Do you have any specific benchdnn calls which will exercise this path?

Hi, please checkout these log, the machine is A64FX and uses jit_sve_512.

I have used:

./benchdnn --eltwise --batch=inputs/eltwise/test_eltwise_all | grep eltwise_gelu_erf

to extract these logs.

A64_FX_benchdnn_eltwise_gelu.log

Here export ONEDNN_VERBOSE=1 is used before running the command.
A64_FX_benchdnn_VERBOSE_eltwise_gelu.log

@nikhilfujitsu
Copy link
Contributor Author

Also, I'm not seeing the same failures on this patch (or before) as you. E.g., out of the CI failures, I can only see test_benchdnn_modeC_graph_ci_cpu. Would you be able to investigate this please?

Also, do you have any measurements of the speedup of this optimization?
Eltwise_results.xlxs.pdf

Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.

But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd is slower than it could be for some reason for SVE 512?

Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.

Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, accidentally approved

@nikhilfujitsu
Copy link
Contributor Author

Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.

But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd is slower than it could be for some reason for SVE 512?

Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.

It is true that we get 1.5x slowdown on G3 machines( SVE_256). That's why it is limited to SVE_512.

Extra unit cases are failing on SVE_512 machines already in the main branch.
By adding my changes no effect is there on failed test cases. The failed test cases are for SVE_512 machine not for Graviton or SVE_256 machine. If this helps.

@nikhilfujitsu
Copy link
Contributor Author

Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.
But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd is slower than it could be for some reason for SVE 512?
Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.

It is true that we get 1.5x slowdown on G3 machines( SVE_256). That's why it is limited to SVE_512.

Extra unit cases are failing on SVE_512 machines already in the main branch. By adding my changes no effect is there on failed test cases. The failed test cases are for SVE_512 machine not for Graviton or SVE_256 machine. If this helps.

I am also checking with latest changes in main. Will update you soon.

@nikhilfujitsu
Copy link
Contributor Author

Thanks for the logs and numbers, they are really useful. I ran this on a Graviton 3, and there was no effect on performance. Just to check: this is what you expected? I'm guessing this because the bulk of the added code is guarded by SVE length == 512.
But, given that you got a ~5x speedup, I was curious if this optimization could be applied to the Graviton 3, so I removed the check and measured the performance. Surprisingly, I got ~1.5x slowdown on Graviton 3. I don't think this should block this PR because you have added the guard, but it is surprising. Could it be that exp_compute_vector_fwd is slower than it could be for some reason for SVE 512?
Anyways, in summary, I'm happy to approve once you've investigated the extra unit test failures.

It is true that we get 1.5x slowdown on G3 machines( SVE_256). That's why it is limited to SVE_512.
Extra unit cases are failing on SVE_512 machines already in the main branch. By adding my changes no effect is there on failed test cases. The failed test cases are for SVE_512 machine not for Graviton or SVE_256 machine. If this helps.

I am also checking with latest changes in main. Will update you soon.

@jondea After merging latest changes from main, errors are resolved. So I have updated the description also. Please consider approval. Thanks.

@vpirogov
Copy link
Member

vpirogov commented Sep 9, 2024

@nikhilfujitsu, merge commits are not allowed in production branches. Please rebase your changes.

@nikhilfujitsu
Copy link
Contributor Author

@nikhilfujitsu, merge commits are not allowed in production branches. Please rebase your changes.

Hi I am struggling to get this rebased again could you please help me here.
What I did is.
I pressed the sync fork button which merged the main branch changes with my branch.
Now how can I rebase it and push it back here.
Should I revert my merge first then rebase it?

@vpirogov
Copy link
Member

vpirogov commented Sep 9, 2024

Hi I am struggling to get this rebased again could you please help me here.

This operation can be done from console:

git checkout gelu_erf
git rebase main
git push --force

@nikhilfujitsu
Copy link
Contributor Author

Hi I am struggling to get this rebased again could you please help me here.

This operation can be done from console:

git checkout gelu_erf
git rebase main
git push --force

Thank you. Means a lot to me.

@nikhilfujitsu
Copy link
Contributor Author

Hi I am struggling to get this rebased again could you please help me here.

This operation can be done from console:

git checkout gelu_erf
git rebase main
git push --force

Thank you. Means a lot to me.

@jondea @vpirogov Please approve the changes. ThankYou.

Copy link
Contributor Author

@nikhilfujitsu nikhilfujitsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes Reviewed. And description changed accordingly.

@abhijain1204fujitsu
Copy link

@jondea @vpirogov could you please support to check the changes as per feedback received.

@abhijain1204fujitsu
Copy link

@vpirogov Kindly let us know if any other change is required
Kindly support for merger.

@nikhilfujitsu
Copy link
Contributor Author

@vpirogov Kindly let us know if any other change is required Kindly support for merger.

Dear @vpirogov please approve the merger or let us know for any other changes.

@theComputeKid
Copy link
Contributor

Can you please rebase your changes so it triggers the aarch64 pipeline we have added since? Thanks.

Copy link
Contributor

@theComputeKid theComputeKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will approve after CI passes on rebase.

@nikhilfujitsu
Copy link
Contributor Author

@theComputeKid Hi, thanks for the comment. I have rebased the branch.

@theComputeKid
Copy link
Contributor

@nikhilfujitsu Thanks. Will approve once CI passes.

Copy link
Contributor

@theComputeKid theComputeKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@spalicki spalicki merged commit b1b2758 into oneapi-src:main Oct 3, 2024
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants