-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlas] Speed up tanhf activation function #20612
base: main
Are you sure you want to change the base?
Conversation
Use Intel SVML tanhf function which speeds up tanhf computation by up to ~38%. The algorithm has a max ULP error of 1536. Benchmark numbers comparison v/s main branch is provided below (generated on TigerLake Dell XPS laptop using: https://github.com/google/benchmark/blob/main/tools/compare.py) |-----------------+---------+---------+----------+----------+---------+---------| | Benchmark | Time | CPU | Time Old | Time New | CPU Old | CPU New | |-----------------+---------+---------+----------+----------+---------+---------| | BM_Tanh/40000 | -0.3822 | -0.3825 | 15059 | 9304 | 15035 | 9283 | | BM_Tanh/80000 | -0.3845 | -0.3844 | 30055 | 18499 | 29998 | 18467 | | BM_Tanh/160000 | -0.3146 | -0.3144 | 17803 | 12203 | 17762 | 12178 | | BM_Tanh/320000 | -0.3495 | -0.3491 | 32840 | 21362 | 32724 | 21300 | | BM_Tanh/640000 | -0.3563 | -0.3568 | 62902 | 40487 | 62754 | 40361 | | BM_Tanh/1280000 | -0.3326 | -0.3333 | 128536 | 85780 | 128102 | 85408 | |-----------------+---------+---------+----------+----------+---------+---------| | OVERALL_GEOMEAN | -0.3538 | -0.3539 | 0 | 0 | 0 | 0 | |-----------------+---------+---------+----------+----------+---------+---------|
/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a benchmark for the tanh activation function in onnxruntime/test/mlas/bench/
. Once you've done that, make sure to record the performance number both with and without your patch in the commit message.
There is already a benchmark for tanhf onnxruntime/onnxruntime/test/onnx/microbenchmark/activation.cc Lines 342 to 344 in 69cfcba
The performance numbers of |
/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
|
||
size_t count = 0; | ||
while (count < N) { | ||
if (N - count >= 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there needs a check for each iteration with this change. If N is large, the previous version can save a significant amount of instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For large values of N, the CPU branch predictor should be able to predict this branch pretty easily. It will only miss at the very last iteration for the tail but when N is large, one single branch miss should hardly matter in terms of performance. It does bring the benefit of processing the entire array contained in a single loop.
/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline , Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 8 pipeline(s). |
You need sign the license/cla agreement to move on. |
Azure Pipelines successfully started running 10 pipeline(s). |
CLA shows up as signed now. |
/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline |
Azure Pipelines successfully started running 10 pipeline(s). |
/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
@yufenglee , please help review |
/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 8 pipeline(s). |
/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline , Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 8 pipeline(s). |
@r-devulap, thanks a lot for your contribution! Could you please also share an analysis on the accuracy loss? |
@yufenglee I calculated the worst case ULP error for |
ping @yufenglee Just a friendly reminder :) |
1523 ULP is a little too large (e-04). I'm not sure if it leads to E2E difference for some models. |
Description
New faster algorithm for
tanhf
activation function using Intel SVML.Motivation and Context
Improves performance of
tanhf
by nearly 40%. The newer algorithm also fixes a bug in the currenttanhf
algorithm which goes out of bounds [-1, 1]. Example: forx = +0x1.06417ep+003
,tanhf= +0x1.000002p+000
.