Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlas] Speed up tanhf activation function #20612

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

r-devulap
Copy link

Description

New faster algorithm for tanhf activation function using Intel SVML.

Motivation and Context

Improves performance of tanhf by nearly 40%. The newer algorithm also fixes a bug in the current tanhf algorithm which goes out of bounds [-1, 1]. Example: for x = +0x1.06417ep+003, tanhf= +0x1.000002p+000.

Benchmark                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------
[BM_Tanh vs. BM_Tanh]/40000/real_time                  -0.3822         -0.3825         15059          9304         15035          9283
[BM_Tanh vs. BM_Tanh]/80000/real_time                  -0.3845         -0.3844         30055         18499         29998         18467
[BM_Tanh vs. BM_Tanh]/160000/real_time                 -0.3146         -0.3144         17803         12203         17762         12178
[BM_Tanh vs. BM_Tanh]/320000/real_time                 -0.3495         -0.3491         32840         21362         32724         21300
[BM_Tanh vs. BM_Tanh]/640000/real_time                 -0.3563         -0.3568         62902         40487         62754         40361
[BM_Tanh vs. BM_Tanh]/1280000/real_time                -0.3326         -0.3333        128536         85780        128102         85408
OVERALL_GEOMEAN                                        -0.3538         -0.3539             0             0             0             0

Use Intel SVML tanhf function which speeds up tanhf computation by up to ~38%.
The algorithm has a max ULP error of 1536. Benchmark numbers comparison v/s
main branch is provided below (generated on TigerLake Dell XPS laptop using:
https://github.com/google/benchmark/blob/main/tools/compare.py)

|-----------------+---------+---------+----------+----------+---------+---------|
| Benchmark       | Time    | CPU     | Time Old | Time New | CPU Old | CPU New |
|-----------------+---------+---------+----------+----------+---------+---------|
| BM_Tanh/40000   | -0.3822 | -0.3825 | 15059    | 9304     | 15035   | 9283    |
| BM_Tanh/80000   | -0.3845 | -0.3844 | 30055    | 18499    | 29998   | 18467   |
| BM_Tanh/160000  | -0.3146 | -0.3144 | 17803    | 12203    | 17762   | 12178   |
| BM_Tanh/320000  | -0.3495 | -0.3491 | 32840    | 21362    | 32724   | 21300   |
| BM_Tanh/640000  | -0.3563 | -0.3568 | 62902    | 40487    | 62754   | 40361   |
| BM_Tanh/1280000 | -0.3326 | -0.3333 | 128536   | 85780    | 128102  | 85408   |
|-----------------+---------+---------+----------+----------+---------+---------|
| OVERALL_GEOMEAN | -0.3538 | -0.3539 | 0        | 0        | 0       | 0       |
|-----------------+---------+---------+----------+----------+---------+---------|
@r-devulap r-devulap requested a review from a team as a code owner May 8, 2024 20:39
@yufenglee
Copy link
Member

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Copy link
Contributor

@yihonglyu yihonglyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a benchmark for the tanh activation function in onnxruntime/test/mlas/bench/. Once you've done that, make sure to record the performance number both with and without your patch in the commit message.

@r-devulap
Copy link
Author

Please add a benchmark for the tanh activation function in onnxruntime/test/mlas/bench/.

There is already a benchmark for tanhf BM_Tanh. Is this not sufficient?

static void BM_Tanh(benchmark::State& state) {
RunSingleNode<Tanh<float>>("Tanh", "", {}, state, -2.0f, 2.0f);
}

Once you've done that, make sure to > record the performance number both with and without your patch in the commit message.

The performance numbers of BM_Tanh before and after have already been included in the commit message: See c6c9309

@yufenglee
Copy link
Member

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).


size_t count = 0;
while (count < N) {
if (N - count >= 4) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there needs a check for each iteration with this change. If N is large, the previous version can save a significant amount of instructions

Copy link
Author

@r-devulap r-devulap May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large values of N, the CPU branch predictor should be able to predict this branch pretty easily. It will only miss at the very last iteration for the tail but when N is large, one single branch miss should hardly matter in terms of performance. It does bring the benefit of processing the entire array contained in a single loop.

@yufenglee
Copy link
Member

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@yufenglee
Copy link
Member

/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline , Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 8 pipeline(s).

yufenglee
yufenglee previously approved these changes May 17, 2024
@yufenglee
Copy link
Member

You need sign the license/cla agreement to move on.

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@r-devulap
Copy link
Author

You need sign the license/cla agreement to move on.

CLA shows up as signed now.

@snnn
Copy link
Member

snnn commented May 30, 2024

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@snnn
Copy link
Member

snnn commented May 30, 2024

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

@snnn snnn requested review from yufenglee and yihonglyu May 30, 2024 05:12
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@snnn
Copy link
Member

snnn commented May 31, 2024

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

@snnn
Copy link
Member

snnn commented May 31, 2024

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@snnn
Copy link
Member

snnn commented Jun 1, 2024

@yufenglee , please help review

@snnn
Copy link
Member

snnn commented Jun 1, 2024

/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@yufenglee
Copy link
Member

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@yufenglee
Copy link
Member

/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline , Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@yufenglee
Copy link
Member

@r-devulap, thanks a lot for your contribution! Could you please also share an analysis on the accuracy loss?

@r-devulap
Copy link
Author

@yufenglee I calculated the worst case ULP error for tanh alone and that turns out to be 1523 ULP (relative error of 9.0803982019e-05) occurring at 4.9999938011e+00. Between [-4.0, 4.0] the max ULP error is 872 (relative error of 5.2006424519e-05). Is there anything else you were interested in?

@r-devulap
Copy link
Author

ping @yufenglee Just a friendly reminder :)

@yufenglee
Copy link
Member

@yufenglee I calculated the worst case ULP error for tanh alone and that turns out to be 1523 ULP (relative error of 9.0803982019e-05) occurring at 4.9999938011e+00. Between [-4.0, 4.0] the max ULP error is 872 (relative error of 5.2006424519e-05). Is there anything else you were interested in?

1523 ULP is a little too large (e-04). I'm not sure if it leads to E2E difference for some models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants