Bug in foreground/background separator choosing massive block instead of character outline. #52

rmast · 2022-06-25T13:11:33Z

Partly anonymized replay of my previous finding on compressing the bankstatement with downsampling the foreground, revealing a bug in the foreground-binarizer/separator.

Add fg_downsample=12 in compress-pdf-images:

    mrc_gen = create_mrc_hocr_components(pil_image, hocr_word_data,
    #mrc_gen = create_mrc_hocr_components(pil_image, [],
            denoise_mask=DENOISE_FAST,
            bg_downsample=3,
            fg_downsample=12
            )

bankstatementgeknipt8noalphag.zip

ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatementgeknipt8noalphag.tiff outgeknipt8g.pdf
pdfcomp outgeknipt8g.pdf outgeknipt8-12g.pdf
outgeknipt8-12g.pdf
outgeknipt8g.pdf

The text was updated successfully, but these errors were encountered:

rmast · 2022-06-25T17:12:58Z

I just looked up the issue myself. There's something wrong with the ratio-determination:

The image size for determining the amount of 0's is done with the complete image instead of the text-box.
If you correct that the issue is gone:

rmast · 2022-06-25T17:13:30Z

By the way, nice tool, PyCharm. It looks somewhat like Intellij that I tried before.

rmast · 2022-06-25T17:48:15Z

I created a pull-request for the solution. You might be tempted to even merge it in master .

MerlijnWajer · 2022-06-28T21:42:15Z

I just looked up the issue myself. There's something wrong with the ratio-determination: ![image](https://user-

Thanks for finding this, this is indeed a real problem. I will take a look and see if this fix is ok, but will need to do some local testing on my test images to make sure everything looks ok.

MerlijnWajer · 2022-06-28T22:07:11Z

So with the change from your pull request, there are some regressions for some of my tests, for example

Before:

After:

MerlijnWajer · 2022-06-28T22:21:05Z

Removing the *100 makes it better, but some other images still regress, so I will need to spend a bit more time on this later. Thanks for noticing, I also found another typo -- it writes - ones instead of - ones_i.

rmast · 2022-06-28T22:53:45Z

Yes, the inversion choice seems incomplete. Stefan is also working on inversion logic, we might learn from it. I assume when you got a text box the surrounding pixels will often contain the background color. The foregroundcolor will touch all 4 borders from inside the textbox.

MerlijnWajer · 2022-06-28T22:58:54Z

I don't remember exactly what I toyed with, but I definitely tried to do something like that: trying to rely on what makes a character vs noise in the bounding box. I think my idea was to use the "ratio", characters usually don't fill up most of the pixels in the bounding box, and if you apply that for an entirely word or even line, then any outliners (e.g. 'w') will be filtered out. That's what the original code was designed to do. On top of that, I then added some simple noise estimation to filter out noise.

I think I locally have some changes that improve somewhat over the current code in my test cases and don't have the bug you found, but I'll need to do further evaluation.

rmast · 2022-06-29T06:48:23Z

Continuing on your thought I would expect an if-construction like this:

diff --git a/internetarchivepdf/mrc.py b/internetarchivepdf/mrc.py
index f6290db..e2bb6c0 100644
--- a/internetarchivepdf/mrc.py
+++ b/internetarchivepdf/mrc.py
@@ -237,12 +237,11 @@ def create_hocr_mask(img, mask_arr, hocr_word_data, downsample=None, dpi=None, t
             zero_i = thres_invert[np.where(thres_invert == 0)].size
             inv_ratio = (ones_i/(zero_i+ones_i))*100

+
+
             if ratio < 0.3 or inv_ratio < 0.3:
                 th = None

-                perc_larger = 0.
-                if inv_ratio != 0.0:
-                    perc_larger = (ratio / inv_ratio) * 100

                 if inv_ratio > 0.2 and ratio < 0.2:
                     th = thres
@@ -261,9 +260,16 @@ def create_hocr_mask(img, mask_arr, hocr_word_data, downsample=None, dpi=None, t
                         th = thres_invert
                     elif ratio < 0.2:
                         th = thres
-
-                if th is not None:
-                    mask_arr[top:bottom, left:right] = th
+            else:
+                perc_larger = 0.
+                if inv_ratio != 0.0:
+                    perc_larger = (ratio / inv_ratio) * 100
+                if perc_larger < 50:
+                    th = thres
+                else:
+                    th = thres_invert
+            if th is not None:
+                mask_arr[top:bottom, left:right] = th


     if timing_data is not None:

rmast · 2022-06-29T13:41:03Z

By the way, DjVu has an expired patented algorithm for foreground/background separation: https://patents.google.com/patent/US6901169
However it performs less when there's noise in the scan that looks like holes in the mask:
jwilk/didjvu#21

This was found by rmast in Github issue #52: #52

MerlijnWajer · 2022-08-01T21:59:54Z

Just a heads up, I'm branching the current code to a 1.4.x branch so that I can build future archive.org releases based on that, which allows for master to have more "compression" breaking changes.

We performed a lot of QA on the output on current parameters/code, so I don't feel confident just rolling out changes, however minor, so this should set us up to make some more breaking changes in master.

rmast · 2022-08-02T04:16:01Z

At the moment I'm molding EasyOCR detection around Tesseract's PSM 7 to get better segmentation for binarisation. Warped or rotated content will less benefit from JBIG2, so automatic tilting to horizontal like ScanTailor does might be a useful preprocessing-step. At the moment I'm looking for the best way to predict inversion by Otsu-thresholding the surrounding of a 4-point freeform textbox with the Otsu-threshold of it's contents. I'm considering CV2-warping, don't know the cost. I'm afraid what I do will end up as a concept with lots of library-dependencies, so when it has proven to superseed the current InternetArchive-compression there would still be a step to translate it in cheapest alternatieve with respect to performance and (licensing) inclusion.

MerlijnWajer · 2022-08-02T09:30:47Z

If you're looking to fix deskew issues mostly automatically, for text content, we use this: https://git.archive.org/archivecd/tesserotate/

It's only applied to our books and microfilm, but it works wonders, in my experience. It's based on Tesseract. It's combined with some heuristics, but by itself it works pretty decently. (Better than leptonica's deskew imho)

MerlijnWajer · 2022-08-02T09:31:31Z

In the past I decided not to include deskew and such in archive-pdf-tools, as such preprocessing could be done by another tool, prior to invoking recode_pdf, that's why stuff like this is not included.

rmast mentioned this issue Jun 25, 2022

correct ratio determination for noise estimation #53

Open

MerlijnWajer added a commit that referenced this issue Aug 1, 2022

mrc: calculate zero ratio on local area not full image

bef77b2

This was found by rmast in Github issue #52: #52

MerlijnWajer added a commit that referenced this issue Aug 1, 2022

mrc: calculate zero ratio on local area not full image

3c20a46

This was found by rmast in Github issue #52: #52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in foreground/background separator choosing massive block instead of character outline. #52

Bug in foreground/background separator choosing massive block instead of character outline. #52

rmast commented Jun 25, 2022 •

edited

Loading

rmast commented Jun 25, 2022

rmast commented Jun 25, 2022

rmast commented Jun 25, 2022

MerlijnWajer commented Jun 28, 2022 •

edited

Loading

MerlijnWajer commented Jun 28, 2022

MerlijnWajer commented Jun 28, 2022

rmast commented Jun 28, 2022 via email

MerlijnWajer commented Jun 28, 2022

rmast commented Jun 29, 2022

rmast commented Jun 29, 2022

MerlijnWajer commented Aug 1, 2022 •

edited

Loading

rmast commented Aug 2, 2022 via email

MerlijnWajer commented Aug 2, 2022

MerlijnWajer commented Aug 2, 2022

Bug in foreground/background separator choosing massive block instead of character outline. #52

Bug in foreground/background separator choosing massive block instead of character outline. #52

Comments

rmast commented Jun 25, 2022 • edited Loading

rmast commented Jun 25, 2022

rmast commented Jun 25, 2022

rmast commented Jun 25, 2022

MerlijnWajer commented Jun 28, 2022 • edited Loading

MerlijnWajer commented Jun 28, 2022

MerlijnWajer commented Jun 28, 2022

rmast commented Jun 28, 2022 via email

MerlijnWajer commented Jun 28, 2022

rmast commented Jun 29, 2022

rmast commented Jun 29, 2022

MerlijnWajer commented Aug 1, 2022 • edited Loading

rmast commented Aug 2, 2022 via email

MerlijnWajer commented Aug 2, 2022

MerlijnWajer commented Aug 2, 2022

rmast commented Jun 25, 2022 •

edited

Loading

MerlijnWajer commented Jun 28, 2022 •

edited

Loading

MerlijnWajer commented Aug 1, 2022 •

edited

Loading