Flexible word delimiters #2698

JorjMcKie · 2023-09-28T12:10:01Z

This change offers to define up to 64 additional characters for use as delimiters in page.get_text("words"). Delimiters can be set either temporarily in the method itself, or permanently for all subsequent word extractions.

Please note that in rebased, no respective changes were made in extra.i. Any speed boosting will need extra code there.

Other changes:

plain text output characters are now handled the same way as in all other extraction variants. This entails using fz_buffer as intermediate storage for extracted text - not any longer fz_output.
in rare cases, fz_stext_page returns invalid bboxes for images or text: not empty, not infinite, but still with one or more of its edges having coordinates in common with the infinite rectangle. We define a new function "JM_ignore_rect()" to allow filtering out (ignoring) them.
finally fixed typo in issue Typo in set_layer() - NameError: name 'f' is not defined #2522
added TOOLS.store_shrink() to Document method "reload_page()" to ensure the absence of cached, now possibly invalid objects pertaining to this page.
When adding annotation to a page, removed unnecessary (possibly even wrong) extra annotation flag settings, so now MuPDF's defaults will prevail.
Added flexibility to border_width parameter in Page methods "insert_text"/"insert_textbox". This is now interpreted as a fraction of font size, such that e.g. a value of 0.05 corresponds to a border width of 5% of the font size.

This change offers to define up to 64 additional characters for use as delimiters in page.get_text("words"). Delimiters can be set either temporarily in the method itself, or permanently for all subsequent word extractions. Please note that in rebased, no respective changes were made in extra.i. Any speed boosting will need extra code there. Other changes: * plain text output characters are now handled the same way as in all other extraction variants. This entails using fz_buffer as intermediate storage for extracted text - not any longer fz_output. * in rare cases, fz_stext_page returns invalid bboxes for images or text: not empty, not infinite, but still with one or more of its edges having coordinates in common with the infinite rectangle. We define a new function "JM_ignore_rect()" to allow filtering out (ignoring) them. * finally fixed typo in issue #2522 * added TOOLS.store_shrink() to Document method "reload_page()" to ensure the absence of cached, now possibly invalid objects pertaining to this page. * When adding annotation to a page, removed unnecessary (possibly even wrong) extra annotation flag settings, so now MuPDF's defaults will prevail. * Added flexibility to border_width parameter in Page methods "insert_text"/"insert_textbox". This is now interpreted as a fraction of font size, such that e.g. a value of 0.05 corresponds to a border width of 5% of the font size.

julian-smith-artifex-com · 2023-09-28T13:39:24Z

fitz/fitz.i

@@ -12185,6 +12190,8 @@ struct TextWriter
            opacity = self.opacity
        if color is None:
            color = self.color
+        if render_mode < 0:
+            render_mode = 0


Could this be assert render_mode >= 0 ?

julian-smith-artifex-com · 2023-09-28T13:41:32Z

fitz/fitz.i

+        try:
+            delims = set([ord(c) for c in delims if ord(c) > 32])
+        except:
+            print("bad delimiter value(s)")
+            raise


This ignores delimitors <= 32. Could we instead do assert c > 32?
[Or perhaps simply allow all delimiters?]

I dont wan't to allow delimiters that already are ones by their very nature. At the very least, 0x00 cannot be allowed, because it serves as the (premature) end of add'l delimiter array.
Also I want to allow strings and sequences of single characters - therefore I need the try-except.

julian-smith-artifex-com · 2023-09-28T13:42:22Z

fitz/fitz.i

+        if not hasattr(delims, "__getitem__") or len(delims) > 64:
+            raise ValueError("bad delimiter value(s)")


Seems a bit of a shame to have a fixed-size buffer for this.

The idea is to allow 'string.punctuation` as delimiters (32 items) plus any equivalent of those crude Chinese characters like "。" (Unicode 0x3002). Nobody will ever be able to keep track of more than this number. We must not forget that none of the delimiters will ever be part of any extracted word ...

In rebased, i think it would actually be simpler to implement and document if we used an unlimited Python list.

julian-smith-artifex-com · 2023-09-28T13:43:34Z

fitz/fitz.i

+
+        try:
+            delims = set([ord(c) for c in delims if ord(c) > 32])
+        except:


Minor thing, but doing except Exception: ignores Ctrl-C etc, allowing user to interrupt.

ok, no problem

julian-smith-artifex-com · 2023-09-28T13:44:27Z

fitz/fitz.i

+            fz_try(gctx) {
+                for (i = 0; i < len; i++) {
+                    word_delimiters[i] = (int) PyLong_AsLong(PyTuple_GET_ITEM(delims, (Py_ssize_t) i));
+                    word_delimiters[i+1] = 0;


Could we put word_delimiters[len] = 0 outside the loop?

julian-smith-artifex-com · 2023-09-28T13:46:51Z

fitz/helper-geo-c.i

+        if (f[i] <= FZ_MIN_INF_RECT) f[i] = FZ_MIN_INF_RECT;
+        if (f[i] >= FZ_MAX_INF_RECT) f[i] = FZ_MAX_INF_RECT;


I think these changes have no effect...?

In Python, larger/smaller integers than that are possible.

Yes, but in general if x <= y: x = y behaves identically to if x < y: x = y, so i don't think this diff has any effect.

julian-smith-artifex-com · 2023-09-28T13:47:21Z

fitz/helper-geo-c.i

+        if (x[i] <= FZ_MIN_INF_RECT) x[i] = FZ_MIN_INF_RECT;
+        if (x[i] >= FZ_MAX_INF_RECT) x[i] = FZ_MAX_INF_RECT;


This change has no effect?

julian-smith-artifex-com · 2023-09-28T13:48:16Z

fitz/helper-stext.i

+    } else if (ch >= 0xd800 && ch <= 0xdfff) {
+        fz_append_string(ctx, buff, "\\ufffd");


Could do with a comment explaining what these values are.

julian-smith-artifex-com · 2023-09-28T13:50:00Z

fitz/utils.py

 ) -> list:
    """Return the text words as a list with the bbox for each word.

    Args:
        flags: (int) control the amount of data parsed into the textpage.
+        delimiters: (str,list) characters to use as word delimiters


delimiters are extra characters - i think we always treat 32 and 160 as delimiters?

Yes, read as extra delimiters

julian-smith-artifex-com · 2023-09-28T13:50:54Z

fitz/utils.py

+    if delimiters is not None:
+        old_delimiters = TOOLS.get_word_delimiters()
    tp = textpage
    if tp is None:
        tp = page.get_textpage(clip=clip, flags=flags)
    elif getattr(tp, "parent") != page:
        raise ValueError("not a textpage of this page")
+    if delimiters is not None:
+        TOOLS.set_word_delimiters(delimiters)
    words = tp.extractWORDS()
    if textpage is None:
        del tp
    if sort is True:
        words.sort(key=lambda w: (w[3], w[0]))
+    if delimiters is not None:
+        TOOLS.set_word_delimiters(old_delimiters)


Not thread-safe. Ok for classic, but we probably want to avoid in rebased.

The idea is to allow temporary (extra) delimiters - this call only.

fitz/utils.py

julian-smith-artifex-com · 2023-09-28T13:56:44Z

src/__init__.py

@@ -96,6 +96,8 @@ def get_env_bool( name, default):
 # Unset ascender / descender corrections
 g_skip_quad_corrections = 0

+# additional word delmiters
+g_word_delimiters = [0] * 65


Feels like we should do g_word_delimiters = list(). Or, better, pass around as a fn param so things are thread safe.

src/__init__.py

julian-smith-artifex-com · 2023-09-28T14:01:56Z

src/__init__.py

+    elif ch >= 0xd800 and ch <= 0xdfff:
+        mupdf.fz_append_string(buff, "\\ufffd")


As with classic, would like a comment saying what these character ranges are.

julian-smith-artifex-com · 2023-09-28T14:05:09Z

src/__init__.py

+    def get_word_delimiters():
+        """Return extra word delimiting characters."""
+        global g_word_delimiters
+        delims = [chr(c) for c in g_word_delimiters if c != 0]
+        return delims


Is this right if g_word_delimiters is set to 10 characters, then reset to 5 characters? In this case i think it could be ....0.....0, where . is any non-zero character. I.e. we probably want to terminate at the first 0.

Also, don't need global g_word_delimiters - only required if we are modifying it.

julian-smith-artifex-com · 2023-09-28T14:06:46Z

src/__init__.py

+        global g_word_delimiters
+        if delims is None:
+            delims = []
+        if not hasattr(delims, "__getitem__") or len(delims) > 64:
+            raise ValueError("bad delimiter value(s)")
+        try:
+            delims = set([ord(c) for c in delims if ord(c) > 32])
+        except:
+            print("bad delimiter value(s)")
+            raise
+        delims = tuple(delims)
+        if not delims:
+            g_word_delimiters = [0] * 65
+        else:
+            g_word_delimiters = delims
+        return # extra.set_word_delimiters(delims)


If we can set g_word_delimiters to a set here, we probably don't need it to be [0]*65 when we initiallise?

julian-smith-artifex-com

Can we have a test for these changes?

Comment says no change to src/extra.i but there are changes...?

JorjMcKie · 2023-09-29T10:05:49Z

Thanks for review! I will give them a closer look today or over the weekend. PBased on it, I probably have to redesign things.

JorjMcKie · 2023-09-29T10:06:17Z

BTW did you understand the reason for the failed test?

julian-smith-artifex-com · 2023-09-29T10:17:51Z

BTW did you understand the reason for the failed test?

If you click on details on the right of the Test quick failure, this will bring up the page for the failing workflow.
Click on the 6-pointed wheel icon on the top right.
In the resulting menu, choose View raw logs.
This shows the output of the workflow (showing the PyMuPDF build and running of tests).

In this case, the problem is an error when compiling the swig-generated file fitz/fitz.i.c (this is classic implementation):

/project/fitz/fitz.i.c: In function ‘JM_is_word_delimiter’:
/project/fitz/fitz.i.c:8205:10: error: ‘i’ undeclared (first use in this function)
 8205 |     for (i = 0; i < (int) nelem(word_delimiters); i++) {

So this is a bug in fitz/fitz.i. The line number will be different (unfortunately swig doesn't emit #line markers), so you'll have to search for the for (i = 0; i < (int) nelem(word_delimiters); i++).

JorjMcKie requested a review from julian-smith-artifex-com September 28, 2023 12:10

julian-smith-artifex-com reviewed Sep 28, 2023

View reviewed changes

fitz/utils.py Show resolved Hide resolved

julian-smith-artifex-com reviewed Sep 28, 2023

View reviewed changes

src/__init__.py Show resolved Hide resolved

julian-smith-artifex-com reviewed Sep 28, 2023

View reviewed changes

JorjMcKie closed this Oct 2, 2023

github-actions bot locked and limited conversation to collaborators Oct 2, 2023

JorjMcKie deleted the new-word-delimiters branch October 2, 2023 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible word delimiters #2698

Flexible word delimiters #2698

JorjMcKie commented Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

JorjMcKie Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

JorjMcKie Sep 28, 2023

julian-smith-artifex-com Sep 29, 2023

julian-smith-artifex-com Sep 28, 2023

JorjMcKie Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

JorjMcKie Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

JorjMcKie Sep 28, 2023

julian-smith-artifex-com Sep 29, 2023

julian-smith-artifex-com Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

JorjMcKie Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023 •

edited

Loading

JorjMcKie Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

julian-smith-artifex-com Sep 28, 2023

julian-smith-artifex-com left a comment

JorjMcKie commented Sep 29, 2023

JorjMcKie commented Sep 29, 2023

julian-smith-artifex-com commented Sep 29, 2023

		if not hasattr(delims, "__getitem__") or len(delims) > 64:
		raise ValueError("bad delimiter value(s)")

		if (f[i] <= FZ_MIN_INF_RECT) f[i] = FZ_MIN_INF_RECT;
		if (f[i] >= FZ_MAX_INF_RECT) f[i] = FZ_MAX_INF_RECT;

		if (x[i] <= FZ_MIN_INF_RECT) x[i] = FZ_MIN_INF_RECT;
		if (x[i] >= FZ_MAX_INF_RECT) x[i] = FZ_MAX_INF_RECT;

		} else if (ch >= 0xd800 && ch <= 0xdfff) {
		fz_append_string(ctx, buff, "\\ufffd");

		elif ch >= 0xd800 and ch <= 0xdfff:
		mupdf.fz_append_string(buff, "\\ufffd")

Flexible word delimiters #2698

Flexible word delimiters #2698

Conversation

JorjMcKie commented Sep 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julian-smith-artifex-com Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julian-smith-artifex-com left a comment

Choose a reason for hiding this comment

JorjMcKie commented Sep 29, 2023

JorjMcKie commented Sep 29, 2023

julian-smith-artifex-com commented Sep 29, 2023

julian-smith-artifex-com Sep 28, 2023 •

edited

Loading