-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove excessive floating-point divides #4312
Conversation
Loft the loop-invariant divide outside the hot loop, and/or invert the variable to turn FDIV into FMUL.
Do you have timing values from your tests? |
Co-authored-by: Stefan Weil <[email protected]>
It will be CPU specific, but I see +10% on my Ampere Altra. |
That's a very significant improvement! I wonder how this ARM64 cpu compares to Intel / AMD cpus with Tesseract recognition and training. |
If there are standard tests that you run, please do share the results. I was using |
Does Ampere Altra offer additional opcodes which could be used to make Tesseract's neural network code faster? We currently use Neon code for ARM64 (see src/arch/*neon.cpp). |
You can run Here are my results on a Mac mini M2 for running
|
Shaves off 25% runtime on Ampere Altra running OCR using the tessdata_orig Russian language model with --oem 2.
After some wrangling, I was able to get the unit tests running on my machine. Here is a rollup of the tests which run longer than 1ms total. I basically culled this out using
|
Conform to style.
With the latest changes, I get +25% on this cmdline. I have attached the input image here (you need to uncompress it).
|
What does When I run your test on
On another host (virtual machine with Ampere Altra) I also see no clear winner when running 2 x 3 tests: without PR 221...229 s, with PR 214...234 s. |
T inv_prob_total = 1 / prob_total; | ||
for (int i = 0; i < n; i++) { | ||
inout[i] /= prob_total; | ||
inout[i] *= inv_prob_total; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this kind of optimization something which a good compiler should do automatically?
Although the proposed changes replace FP divides by FP multiplications, I could not reproduce the reported positive effect. Maybe others are more lucky and can confirm the results, or I can reproduce them when I have more information. |
@heshpdx : Is it with or without simd usage? |
Good question. This was using the generic path, without intrinsics. |
If |
I made a test on RPi4 (armv7l) with 32bit Debian, gcc (Debian 12.2.0-14) 12.2.0. $ time ./tesseract.main -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out
real 6m26.522s
user 14m21.678s
sys 0m7.456s
$ time ./tesseract.4312 -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out
real 6m26.177s
user 14m21.324s
sys 0m7.456s |
I made one more test on the "prehistoric" machine (CPU: Intel(R) Core(TM)2 CPU 4300 @ 1.80GHz; compiler: gcc (SUSE Linux) 11.3.0; OS: openSUSE Leap 15.5 64bit) (e.g. without simd support) : tesseract 5.4.1-24-g027ad
leptonica-1.84.2
libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.13 : libwebp 1.0.3 : libopenjp2 2.3.0
Found OpenMP 201511
Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0
Found libcurl/8.0.1 OpenSSL/1.1.1l-fips zlib/1.2.13 brotli/1.0.7 zstd/1.5.0 libidn2/2.2.0 libpsl/0.20.1 (+libidn2/2.2.0) libssh/0.9.8/openssl/zlib nghttp2/1.40.0
time ./tesseract.main -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out
real 15m44.186s
user 13m47.928s
sys 1m34.814s
time ./tesseract.4312 -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out
real 15m43.991s
user 13m47.887s
sys 1m36.630s It seems that stweil's expectation is correct—a good compiler should automatically perform this kind of optimization for known architectures/CPUs. In conclusion, it neither benefits nor harms the average user, but it could be helpful in certain edge cases. |
As long as we don't have a use case where we really get significant better performance, I would prefer to keep our existing code because it is easier to read and simpler. @heshpdx, please provide all information which is necessary to reproduce your results (compiler, build environment and build process, hardware, timing results, maybe more). |
Loft the loop-invariant divide outside the hot loops, and/or invert the variable to turn FDIV into FMUL.
Most CPUs are slower at FP division compared to FP multiplication. This should provide some uplift in performance. I was testing with the integer models.