Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove excessive floating-point divides #4312

Closed
wants to merge 5 commits into from

Conversation

heshpdx
Copy link
Contributor

@heshpdx heshpdx commented Sep 3, 2024

Loft the loop-invariant divide outside the hot loops, and/or invert the variable to turn FDIV into FMUL.

Most CPUs are slower at FP division compared to FP multiplication. This should provide some uplift in performance. I was testing with the integer models.

Loft the loop-invariant divide outside the hot loop, and/or
invert the variable to turn FDIV into FMUL.
src/textord/pithsync.cpp Outdated Show resolved Hide resolved
src/textord/pithsync.cpp Outdated Show resolved Hide resolved
src/textord/pithsync.cpp Outdated Show resolved Hide resolved
src/textord/pithsync.cpp Outdated Show resolved Hide resolved
src/lstm/networkio.cpp Outdated Show resolved Hide resolved
src/lstm/networkio.cpp Outdated Show resolved Hide resolved
@stweil
Copy link
Contributor

stweil commented Sep 3, 2024

Do you have timing values from your tests?

@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 3, 2024

Do you have timing values from your tests?

It will be CPU specific, but I see +10% on my Ampere Altra.

@stweil
Copy link
Contributor

stweil commented Sep 3, 2024

It will be CPU specific, but I see +10% on my Ampere Altra.

That's a very significant improvement! I wonder how this ARM64 cpu compares to Intel / AMD cpus with Tesseract recognition and training.

@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 3, 2024

If there are standard tests that you run, please do share the results.

I was using -l deu --tessdata-dir ./tessdata_orig --oem 0 and -l Arabic --tessdata-dir ./tessdata_fast. The -l rus --tessdata-dir ./tessdata_orig --oem 2 did not show much improvement.

@stweil
Copy link
Contributor

stweil commented Sep 3, 2024

Does Ampere Altra offer additional opcodes which could be used to make Tesseract's neural network code faster? We currently use Neon code for ARM64 (see src/arch/*neon.cpp).

@stweil
Copy link
Contributor

stweil commented Sep 3, 2024

If there are standard tests that you run, please do share the results.

You can run make check (after installing required packages, repositories and git submodules) and compare the times for the single tests. lstm_test is the test with the longest execution time.

Here are my results on a Mac mini M2 for running time ./lstm_test:

# git main branch, extract from lstm_test.log and log message from `time`.
[==========] 11 tests from 1 test suite ran. (278833 ms total)
./lstm_test  274,78s user 2,87s system 99% cpu 4:38,88 total

# git main branch with PR applied, extract from lstm_test.log and log message from `time`.
[==========] 11 tests from 1 test suite ran. (276981 ms total)
./lstm_test  273,60s user 2,50s system 99% cpu 4:37,03 total

Shaves off 25% runtime on Ampere Altra running OCR using
the tessdata_orig Russian language model with --oem 2.
@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 23, 2024

After some wrangling, I was able to get the unit tests running on my machine. Here is a rollup of the tests which run longer than 1ms total. I basically culled this out using grep total *_test.log | grep ran | grep -v 0\ | tr '(' ' ' | sort. I hope this is sufficient!

test before after speedup
apiexample_test 454 448 1.013
applybox_test 809 799 1.013
baseapi_test 4582 4485 1.022
baseapi_thread_test 424 415 1.022
commandlineflags_test 39 39 1.000
imagedata_test 281 272 1.033
intfeaturemap_test 1021 1012 1.009
intsimdmatrix_test 825 826 0.999
lang_model_test 7829 7782 1.006
layout_test 7232 6665 1.085
ligature_table_test 4 4 1.000
loadlang_test 141 138 1.022
lstm_recode_test 65028 64371 1.010
lstm_squashed_test 28698 29057 0.988
lstm_test 252497 246625 1.024
normstrngs_test 14 14 1.000
pagesegmode_test 311 296 1.051
pango_font_info_test 13 13 1.000
paragraphs_test 54 54 1.000
progress_test 6 5 1.200
qrsequence_test 81 80 1.013
recodebeam_test 1128 1108 1.018
stringrenderer_test 177 177 1.000
tablefind_test 73 72 1.014
tatweel_test 16 17 0.941
textlineprojection_test 2269 2108 1.076
unicharcompress_test 214 208 1.029

src/lstm/functions.h Outdated Show resolved Hide resolved
Conform to style.
@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 23, 2024

With the latest changes, I get +25% on this cmdline. I have attached the input image here (you need to uncompress it).
math-ru.bmp.gz

./tesseract -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

Thanks @stweil @zdenop

@stweil
Copy link
Contributor

stweil commented Sep 23, 2024

What does time ./tesseract -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out report with/without your pull request? How did you build Tesseract (compiler, with or without OpenMP, configure options)?

When I run your test on tesseract built with configure --disable-openmp --disable-shared CXX=clang++ CXXFLAGS='-g -O2', I get no significant change on an M1 pro:

# without PR
82,71s user 0,30s system 99% cpu 1:23,47 total
82,96s user 0,40s system 99% cpu 1:23,40 total
# with PR
82,95s user 0,29s system 99% cpu 1:23,52 total
82,95s user 0,54s system 99% cpu 1:23,51 total

On another host (virtual machine with Ampere Altra) I also see no clear winner when running 2 x 3 tests: without PR 221...229 s, with PR 214...234 s.

Comment on lines +203 to +205
T inv_prob_total = 1 / prob_total;
for (int i = 0; i < n; i++) {
inout[i] /= prob_total;
inout[i] *= inv_prob_total;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this kind of optimization something which a good compiler should do automatically?

@stweil
Copy link
Contributor

stweil commented Sep 23, 2024

Although the proposed changes replace FP divides by FP multiplications, I could not reproduce the reported positive effect. Maybe others are more lucky and can confirm the results, or I can reproduce them when I have more information.

@zdenop
Copy link
Contributor

zdenop commented Sep 24, 2024

With the latest changes, I get +25% on this cmdline.

@heshpdx : Is it with or without simd usage?

@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 24, 2024

Good question. This was using the generic path, without intrinsics.

@stweil
Copy link
Contributor

stweil commented Sep 24, 2024

If tesseract -v reports "Found NEON", it will use SIMD instructions for NEON by default.

@zdenop
Copy link
Contributor

zdenop commented Sep 24, 2024

I made a test on RPi4 (armv7l) with 32bit Debian, gcc (Debian 12.2.0-14) 12.2.0.
I made a static build for easier comparison (./autogen.sh && ./configure --prefix=/usr --enable-static --disable-shared). The results are like this:

$ time ./tesseract.main  -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

real    6m26.522s
user    14m21.678s
sys     0m7.456s

$ time ./tesseract.4312 -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

real    6m26.177s
user    14m21.324s
sys     0m7.456s

@zdenop
Copy link
Contributor

zdenop commented Sep 25, 2024

I made one more test on the "prehistoric" machine (CPU: Intel(R) Core(TM)2 CPU 4300 @ 1.80GHz; compiler: gcc (SUSE Linux) 11.3.0; OS: openSUSE Leap 15.5 64bit) (e.g. without simd support) :

tesseract 5.4.1-24-g027ad
 leptonica-1.84.2
  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.13 : libwebp 1.0.3 : libopenjp2 2.3.0
 Found OpenMP 201511
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0
 Found libcurl/8.0.1 OpenSSL/1.1.1l-fips zlib/1.2.13 brotli/1.0.7 zstd/1.5.0 libidn2/2.2.0 libpsl/0.20.1 (+libidn2/2.2.0) libssh/0.9.8/openssl/zlib nghttp2/1.40.0

time ./tesseract.main -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

real    15m44.186s
user    13m47.928s
sys     1m34.814s


time ./tesseract.4312 -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

real    15m43.991s
user    13m47.887s
sys     1m36.630s

It seems that stweil's expectation is correct—a good compiler should automatically perform this kind of optimization for known architectures/CPUs.

In conclusion, it neither benefits nor harms the average user, but it could be helpful in certain edge cases.

@stweil
Copy link
Contributor

stweil commented Sep 25, 2024

As long as we don't have a use case where we really get significant better performance, I would prefer to keep our existing code because it is easier to read and simpler. @heshpdx, please provide all information which is necessary to reproduce your results (compiler, build environment and build process, hardware, timing results, maybe more).

@amitdo amitdo closed this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants