Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible word delimiters #2698

Closed
wants to merge 1 commit into from
Closed

Flexible word delimiters #2698

wants to merge 1 commit into from

Conversation

JorjMcKie
Copy link
Collaborator

This change offers to define up to 64 additional characters for use as delimiters in page.get_text("words"). Delimiters can be set either temporarily in the method itself, or permanently for all subsequent word extractions.

Please note that in rebased, no respective changes were made in extra.i. Any speed boosting will need extra code there.

Other changes:

  • plain text output characters are now handled the same way as in all other extraction variants. This entails using fz_buffer as intermediate storage for extracted text - not any longer fz_output.

  • in rare cases, fz_stext_page returns invalid bboxes for images or text: not empty, not infinite, but still with one or more of its edges having coordinates in common with the infinite rectangle. We define a new function "JM_ignore_rect()" to allow filtering out (ignoring) them.

  • finally fixed typo in issue Typo in set_layer() - NameError: name 'f' is not defined #2522

  • added TOOLS.store_shrink() to Document method "reload_page()" to ensure the absence of cached, now possibly invalid objects pertaining to this page.

  • When adding annotation to a page, removed unnecessary (possibly even wrong) extra annotation flag settings, so now MuPDF's defaults will prevail.

  • Added flexibility to border_width parameter in Page methods "insert_text"/"insert_textbox". This is now interpreted as a fraction of font size, such that e.g. a value of 0.05 corresponds to a border width of 5% of the font size.

This change offers to define up to 64 additional characters for use as delimiters in page.get_text("words").
Delimiters can be set either temporarily in the method itself, or permanently for all subsequent word extractions.

Please note that in rebased, no respective changes were made in extra.i. Any speed boosting will need extra code there.

Other changes:

* plain text output characters are now handled the same way as in all other extraction variants. This entails using fz_buffer as intermediate storage for extracted text - not any longer fz_output.

* in rare cases, fz_stext_page returns invalid bboxes for images or text: not empty, not infinite, but still with one or more of its edges having coordinates in common with the infinite rectangle. We define a new function "JM_ignore_rect()" to allow filtering out (ignoring) them.

* finally fixed typo in issue #2522

* added TOOLS.store_shrink() to Document method "reload_page()" to ensure the absence of cached, now possibly invalid objects pertaining to this page.

* When adding annotation to a page, removed unnecessary (possibly even wrong) extra annotation flag settings, so now MuPDF's defaults will prevail.

* Added flexibility to border_width parameter in Page methods "insert_text"/"insert_textbox". This is now interpreted as a fraction of font size, such that e.g. a value of 0.05 corresponds to a border width of 5% of the font size.
@@ -12185,6 +12190,8 @@ struct TextWriter
opacity = self.opacity
if color is None:
color = self.color
if render_mode < 0:
render_mode = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be assert render_mode >= 0 ?

Comment on lines +14430 to +14434
try:
delims = set([ord(c) for c in delims if ord(c) > 32])
except:
print("bad delimiter value(s)")
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignores delimitors <= 32. Could we instead do assert c > 32?
[Or perhaps simply allow all delimiters?]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont wan't to allow delimiters that already are ones by their very nature. At the very least, 0x00 cannot be allowed, because it serves as the (premature) end of add'l delimiter array.
Also I want to allow strings and sequences of single characters - therefore I need the try-except.

Comment on lines +14427 to +14428
if not hasattr(delims, "__getitem__") or len(delims) > 64:
raise ValueError("bad delimiter value(s)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit of a shame to have a fixed-size buffer for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to allow 'string.punctuation` as delimiters (32 items) plus any equivalent of those crude Chinese characters like "。" (Unicode 0x3002). Nobody will ever be able to keep track of more than this number. We must not forget that none of the delimiters will ever be part of any extracted word ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rebased, i think it would actually be simpler to implement and document if we used an unlimited Python list.


try:
delims = set([ord(c) for c in delims if ord(c) > 32])
except:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, but doing except Exception: ignores Ctrl-C etc, allowing user to interrupt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, no problem

fz_try(gctx) {
for (i = 0; i < len; i++) {
word_delimiters[i] = (int) PyLong_AsLong(PyTuple_GET_ITEM(delims, (Py_ssize_t) i));
word_delimiters[i+1] = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put word_delimiters[len] = 0 outside the loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Comment on lines +79 to +80
if (f[i] <= FZ_MIN_INF_RECT) f[i] = FZ_MIN_INF_RECT;
if (f[i] >= FZ_MAX_INF_RECT) f[i] = FZ_MAX_INF_RECT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes have no effect...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, larger/smaller integers than that are possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in general if x <= y: x = y behaves identically to if x < y: x = y, so i don't think this diff has any effect.

Comment on lines +134 to +135
if (x[i] <= FZ_MIN_INF_RECT) x[i] = FZ_MIN_INF_RECT;
if (x[i] >= FZ_MAX_INF_RECT) x[i] = FZ_MAX_INF_RECT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has no effect?

Comment on lines +101 to +102
} else if (ch >= 0xd800 && ch <= 0xdfff) {
fz_append_string(ctx, buff, "\\ufffd");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do with a comment explaining what these values are.

) -> list:
"""Return the text words as a list with the bbox for each word.

Args:
flags: (int) control the amount of data parsed into the textpage.
delimiters: (str,list) characters to use as word delimiters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delimiters are extra characters - i think we always treat 32 and 160 as delimiters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, read as extra delimiters

Comment on lines +553 to +568
if delimiters is not None:
old_delimiters = TOOLS.get_word_delimiters()
tp = textpage
if tp is None:
tp = page.get_textpage(clip=clip, flags=flags)
elif getattr(tp, "parent") != page:
raise ValueError("not a textpage of this page")
if delimiters is not None:
TOOLS.set_word_delimiters(delimiters)
words = tp.extractWORDS()
if textpage is None:
del tp
if sort is True:
words.sort(key=lambda w: (w[3], w[0]))
if delimiters is not None:
TOOLS.set_word_delimiters(old_delimiters)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not thread-safe. Ok for classic, but we probably want to avoid in rebased.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to allow temporary (extra) delimiters - this call only.

@@ -96,6 +96,8 @@ def get_env_bool( name, default):
# Unset ascender / descender corrections
g_skip_quad_corrections = 0

# additional word delmiters
g_word_delimiters = [0] * 65
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we should do g_word_delimiters = list(). Or, better, pass around as a fn param so things are thread safe.

Comment on lines +13917 to +13918
elif ch >= 0xd800 and ch <= 0xdfff:
mupdf.fz_append_string(buff, "\\ufffd")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with classic, would like a comment saying what these character ranges are.

Comment on lines +21045 to +21049
def get_word_delimiters():
"""Return extra word delimiting characters."""
global g_word_delimiters
delims = [chr(c) for c in g_word_delimiters if c != 0]
return delims
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right if g_word_delimiters is set to 10 characters, then reset to 5 characters? In this case i think it could be ....0.....0, where . is any non-zero character. I.e. we probably want to terminate at the first 0.

Also, don't need global g_word_delimiters - only required if we are modifying it.

Comment on lines +21055 to +21070
global g_word_delimiters
if delims is None:
delims = []
if not hasattr(delims, "__getitem__") or len(delims) > 64:
raise ValueError("bad delimiter value(s)")
try:
delims = set([ord(c) for c in delims if ord(c) > 32])
except:
print("bad delimiter value(s)")
raise
delims = tuple(delims)
if not delims:
g_word_delimiters = [0] * 65
else:
g_word_delimiters = delims
return # extra.set_word_delimiters(delims)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can set g_word_delimiters to a set here, we probably don't need it to be [0]*65 when we initiallise?

Copy link
Collaborator

@julian-smith-artifex-com julian-smith-artifex-com left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test for these changes?

Comment says no change to src/extra.i but there are changes...?

@JorjMcKie
Copy link
Collaborator Author

Thanks for review! I will give them a closer look today or over the weekend. PBased on it, I probably have to redesign things.

@JorjMcKie
Copy link
Collaborator Author

BTW did you understand the reason for the failed test?

@julian-smith-artifex-com
Copy link
Collaborator

BTW did you understand the reason for the failed test?

  • If you click on details on the right of the Test quick failure, this will bring up the page for the failing workflow.
  • Click on the 6-pointed wheel icon on the top right.
  • In the resulting menu, choose View raw logs.
  • This shows the output of the workflow (showing the PyMuPDF build and running of tests).

In this case, the problem is an error when compiling the swig-generated file fitz/fitz.i.c (this is classic implementation):

/project/fitz/fitz.i.c: In function ‘JM_is_word_delimiter’:
/project/fitz/fitz.i.c:8205:10: error: ‘i’ undeclared (first use in this function)
 8205 |     for (i = 0; i < (int) nelem(word_delimiters); i++) {

So this is a bug in fitz/fitz.i. The line number will be different (unfortunately swig doesn't emit #line markers), so you'll have to search for the for (i = 0; i < (int) nelem(word_delimiters); i++).

@JorjMcKie JorjMcKie closed this Oct 2, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Oct 2, 2023
@JorjMcKie JorjMcKie deleted the new-word-delimiters branch October 2, 2023 12:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants