Remove excessive floating-point divides #4312

heshpdx · 2024-09-03T04:05:05Z

Loft the loop-invariant divide outside the hot loops, and/or invert the variable to turn FDIV into FMUL.

Most CPUs are slower at FP division compared to FP multiplication. This should provide some uplift in performance. I was testing with the integer models.

Loft the loop-invariant divide outside the hot loop, and/or invert the variable to turn FDIV into FMUL.

src/textord/pithsync.cpp

src/lstm/networkio.cpp

stweil · 2024-09-03T05:01:33Z

Do you have timing values from your tests?

Co-authored-by: Stefan Weil <[email protected]>

heshpdx · 2024-09-03T05:16:35Z

Do you have timing values from your tests?

It will be CPU specific, but I see +10% on my Ampere Altra.

stweil · 2024-09-03T05:18:10Z

It will be CPU specific, but I see +10% on my Ampere Altra.

That's a very significant improvement! I wonder how this ARM64 cpu compares to Intel / AMD cpus with Tesseract recognition and training.

heshpdx · 2024-09-03T05:21:05Z

If there are standard tests that you run, please do share the results.

I was using -l deu --tessdata-dir ./tessdata_orig --oem 0 and -l Arabic --tessdata-dir ./tessdata_fast. The -l rus --tessdata-dir ./tessdata_orig --oem 2 did not show much improvement.

stweil · 2024-09-03T13:17:06Z

Does Ampere Altra offer additional opcodes which could be used to make Tesseract's neural network code faster? We currently use Neon code for ARM64 (see src/arch/*neon.cpp).

stweil · 2024-09-03T13:54:48Z

If there are standard tests that you run, please do share the results.

You can run make check (after installing required packages, repositories and git submodules) and compare the times for the single tests. lstm_test is the test with the longest execution time.

Here are my results on a Mac mini M2 for running time ./lstm_test:

# git main branch, extract from lstm_test.log and log message from `time`.
[==========] 11 tests from 1 test suite ran. (278833 ms total)
./lstm_test  274,78s user 2,87s system 99% cpu 4:38,88 total

# git main branch with PR applied, extract from lstm_test.log and log message from `time`.
[==========] 11 tests from 1 test suite ran. (276981 ms total)
./lstm_test  273,60s user 2,50s system 99% cpu 4:37,03 total

Shaves off 25% runtime on Ampere Altra running OCR using the tessdata_orig Russian language model with --oem 2.

heshpdx · 2024-09-23T05:32:03Z

After some wrangling, I was able to get the unit tests running on my machine. Here is a rollup of the tests which run longer than 1ms total. I basically culled this out using grep total *_test.log | grep ran | grep -v 0\ | tr '(' ' ' | sort. I hope this is sufficient!

test	before	after	speedup
apiexample_test	454	448	1.013
applybox_test	809	799	1.013
baseapi_test	4582	4485	1.022
baseapi_thread_test	424	415	1.022
commandlineflags_test	39	39	1.000
imagedata_test	281	272	1.033
intfeaturemap_test	1021	1012	1.009
intsimdmatrix_test	825	826	0.999
lang_model_test	7829	7782	1.006
layout_test	7232	6665	1.085
ligature_table_test	4	4	1.000
loadlang_test	141	138	1.022
lstm_recode_test	65028	64371	1.010
lstm_squashed_test	28698	29057	0.988
lstm_test	252497	246625	1.024
normstrngs_test	14	14	1.000
pagesegmode_test	311	296	1.051
pango_font_info_test	13	13	1.000
paragraphs_test	54	54	1.000
progress_test	6	5	1.200
qrsequence_test	81	80	1.013
recodebeam_test	1128	1108	1.018
stringrenderer_test	177	177	1.000
tablefind_test	73	72	1.014
tatweel_test	16	17	0.941
textlineprojection_test	2269	2108	1.076
unicharcompress_test	214	208	1.029

src/lstm/functions.h

Conform to style.

heshpdx · 2024-09-23T15:48:19Z

With the latest changes, I get +25% on this cmdline. I have attached the input image here (you need to uncompress it).
math-ru.bmp.gz

./tesseract -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

Thanks @stweil @zdenop

stweil · 2024-09-23T18:02:54Z

What does time ./tesseract -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out report with/without your pull request? How did you build Tesseract (compiler, with or without OpenMP, configure options)?

When I run your test on tesseract built with configure --disable-openmp --disable-shared CXX=clang++ CXXFLAGS='-g -O2', I get no significant change on an M1 pro:

# without PR
82,71s user 0,30s system 99% cpu 1:23,47 total
82,96s user 0,40s system 99% cpu 1:23,40 total
# with PR
82,95s user 0,29s system 99% cpu 1:23,52 total
82,95s user 0,54s system 99% cpu 1:23,51 total

On another host (virtual machine with Ampere Altra) I also see no clear winner when running 2 x 3 tests: without PR 221...229 s, with PR 214...234 s.

stweil · 2024-09-23T19:38:13Z

src/lstm/functions.h

+    T inv_prob_total = 1 / prob_total;
    for (int i = 0; i < n; i++) {
-      inout[i] /= prob_total;
+      inout[i] *= inv_prob_total;


Isn't this kind of optimization something which a good compiler should do automatically?

stweil · 2024-09-23T20:05:27Z

Although the proposed changes replace FP divides by FP multiplications, I could not reproduce the reported positive effect. Maybe others are more lucky and can confirm the results, or I can reproduce them when I have more information.

zdenop · 2024-09-24T05:05:19Z

With the latest changes, I get +25% on this cmdline.

@heshpdx : Is it with or without simd usage?

heshpdx · 2024-09-24T13:58:50Z

Good question. This was using the generic path, without intrinsics.

stweil · 2024-09-24T14:10:18Z

If tesseract -v reports "Found NEON", it will use SIMD instructions for NEON by default.

zdenop · 2024-09-24T19:06:52Z

I made a test on RPi4 (armv7l) with 32bit Debian, gcc (Debian 12.2.0-14) 12.2.0.
I made a static build for easier comparison (./autogen.sh && ./configure --prefix=/usr --enable-static --disable-shared). The results are like this:

$ time ./tesseract.main  -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

real    6m26.522s
user    14m21.678s
sys     0m7.456s

$ time ./tesseract.4312 -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

real    6m26.177s
user    14m21.324s
sys     0m7.456s

zdenop · 2024-09-25T09:26:33Z

I made one more test on the "prehistoric" machine (CPU: Intel(R) Core(TM)2 CPU 4300 @ 1.80GHz; compiler: gcc (SUSE Linux) 11.3.0; OS: openSUSE Leap 15.5 64bit) (e.g. without simd support) :

tesseract 5.4.1-24-g027ad
 leptonica-1.84.2
  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.13 : libwebp 1.0.3 : libopenjp2 2.3.0
 Found OpenMP 201511
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0
 Found libcurl/8.0.1 OpenSSL/1.1.1l-fips zlib/1.2.13 brotli/1.0.7 zstd/1.5.0 libidn2/2.2.0 libpsl/0.20.1 (+libidn2/2.2.0) libssh/0.9.8/openssl/zlib nghttp2/1.40.0

time ./tesseract.main -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

real    15m44.186s
user    13m47.928s
sys     1m34.814s


time ./tesseract.4312 -l rus --tessdata-dir ./tessdata_orig --oem 2 math-ru.bmp math_out

real    15m43.991s
user    13m47.887s
sys     1m36.630s

It seems that stweil's expectation is correct—a good compiler should automatically perform this kind of optimization for known architectures/CPUs.

In conclusion, it neither benefits nor harms the average user, but it could be helpful in certain edge cases.

stweil · 2024-09-25T09:54:12Z

As long as we don't have a use case where we really get significant better performance, I would prefer to keep our existing code because it is easier to read and simpler. @heshpdx, please provide all information which is necessary to reproduce your results (compiler, build environment and build process, hardware, timing results, maybe more).

Remove excessive floating-point divides

41faf69

Loft the loop-invariant divide outside the hot loop, and/or invert the variable to turn FDIV into FMUL.

stweil requested changes Sep 3, 2024

View reviewed changes

heshpdx and others added 2 commits September 2, 2024 22:05

Apply suggestions from code review

7ab67f9

Co-authored-by: Stefan Weil <[email protected]>

Fix function definition and comment to match cpp file

ff0a38d

Replace FDIV with FMUL in hot loop.

7852c79

Shaves off 25% runtime on Ampere Altra running OCR using the tessdata_orig Russian language model with --oem 2.

heshpdx commented Sep 23, 2024

View reviewed changes

src/lstm/functions.h Outdated Show resolved Hide resolved

Update src/lstm/functions.h

a2a85b1

Conform to style.

stweil reviewed Sep 23, 2024

View reviewed changes

amitdo closed this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove excessive floating-point divides #4312

Remove excessive floating-point divides #4312

heshpdx commented Sep 3, 2024

stweil commented Sep 3, 2024

heshpdx commented Sep 3, 2024

stweil commented Sep 3, 2024 •

edited

Loading

heshpdx commented Sep 3, 2024

stweil commented Sep 3, 2024

stweil commented Sep 3, 2024 •

edited

Loading

heshpdx commented Sep 23, 2024 •

edited

Loading

heshpdx commented Sep 23, 2024

stweil commented Sep 23, 2024 •

edited

Loading

stweil Sep 23, 2024

stweil commented Sep 23, 2024

zdenop commented Sep 24, 2024

heshpdx commented Sep 24, 2024

stweil commented Sep 24, 2024 •

edited

Loading

zdenop commented Sep 24, 2024

zdenop commented Sep 25, 2024

stweil commented Sep 25, 2024

Remove excessive floating-point divides #4312

Remove excessive floating-point divides #4312

Conversation

heshpdx commented Sep 3, 2024

stweil commented Sep 3, 2024

heshpdx commented Sep 3, 2024

stweil commented Sep 3, 2024 • edited Loading

heshpdx commented Sep 3, 2024

stweil commented Sep 3, 2024

stweil commented Sep 3, 2024 • edited Loading

heshpdx commented Sep 23, 2024 • edited Loading

heshpdx commented Sep 23, 2024

stweil commented Sep 23, 2024 • edited Loading

stweil Sep 23, 2024

Choose a reason for hiding this comment

stweil commented Sep 23, 2024

zdenop commented Sep 24, 2024

heshpdx commented Sep 24, 2024

stweil commented Sep 24, 2024 • edited Loading

zdenop commented Sep 24, 2024

zdenop commented Sep 25, 2024

stweil commented Sep 25, 2024

stweil commented Sep 3, 2024 •

edited

Loading

stweil commented Sep 3, 2024 •

edited

Loading

heshpdx commented Sep 23, 2024 •

edited

Loading

stweil commented Sep 23, 2024 •

edited

Loading

stweil commented Sep 24, 2024 •

edited

Loading