-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible word delimiters #2698
Flexible word delimiters #2698
Conversation
This change offers to define up to 64 additional characters for use as delimiters in page.get_text("words"). Delimiters can be set either temporarily in the method itself, or permanently for all subsequent word extractions. Please note that in rebased, no respective changes were made in extra.i. Any speed boosting will need extra code there. Other changes: * plain text output characters are now handled the same way as in all other extraction variants. This entails using fz_buffer as intermediate storage for extracted text - not any longer fz_output. * in rare cases, fz_stext_page returns invalid bboxes for images or text: not empty, not infinite, but still with one or more of its edges having coordinates in common with the infinite rectangle. We define a new function "JM_ignore_rect()" to allow filtering out (ignoring) them. * finally fixed typo in issue #2522 * added TOOLS.store_shrink() to Document method "reload_page()" to ensure the absence of cached, now possibly invalid objects pertaining to this page. * When adding annotation to a page, removed unnecessary (possibly even wrong) extra annotation flag settings, so now MuPDF's defaults will prevail. * Added flexibility to border_width parameter in Page methods "insert_text"/"insert_textbox". This is now interpreted as a fraction of font size, such that e.g. a value of 0.05 corresponds to a border width of 5% of the font size.
@@ -12185,6 +12190,8 @@ struct TextWriter | |||
opacity = self.opacity | |||
if color is None: | |||
color = self.color | |||
if render_mode < 0: | |||
render_mode = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be assert render_mode >= 0
?
try: | ||
delims = set([ord(c) for c in delims if ord(c) > 32]) | ||
except: | ||
print("bad delimiter value(s)") | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ignores delimitors <= 32. Could we instead do assert c > 32
?
[Or perhaps simply allow all delimiters?]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont wan't to allow delimiters that already are ones by their very nature. At the very least, 0x00 cannot be allowed, because it serves as the (premature) end of add'l delimiter array.
Also I want to allow strings and sequences of single characters - therefore I need the try-except.
if not hasattr(delims, "__getitem__") or len(delims) > 64: | ||
raise ValueError("bad delimiter value(s)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit of a shame to have a fixed-size buffer for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to allow 'string.punctuation` as delimiters (32 items) plus any equivalent of those crude Chinese characters like "。" (Unicode 0x3002). Nobody will ever be able to keep track of more than this number. We must not forget that none of the delimiters will ever be part of any extracted word ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In rebased, i think it would actually be simpler to implement and document if we used an unlimited Python list.
|
||
try: | ||
delims = set([ord(c) for c in delims if ord(c) > 32]) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor thing, but doing except Exception:
ignores Ctrl-C etc, allowing user to interrupt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, no problem
fz_try(gctx) { | ||
for (i = 0; i < len; i++) { | ||
word_delimiters[i] = (int) PyLong_AsLong(PyTuple_GET_ITEM(delims, (Py_ssize_t) i)); | ||
word_delimiters[i+1] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put word_delimiters[len] = 0
outside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
if (f[i] <= FZ_MIN_INF_RECT) f[i] = FZ_MIN_INF_RECT; | ||
if (f[i] >= FZ_MAX_INF_RECT) f[i] = FZ_MAX_INF_RECT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes have no effect...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python, larger/smaller integers than that are possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in general if x <= y: x = y
behaves identically to if x < y: x = y
, so i don't think this diff has any effect.
if (x[i] <= FZ_MIN_INF_RECT) x[i] = FZ_MIN_INF_RECT; | ||
if (x[i] >= FZ_MAX_INF_RECT) x[i] = FZ_MAX_INF_RECT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change has no effect?
} else if (ch >= 0xd800 && ch <= 0xdfff) { | ||
fz_append_string(ctx, buff, "\\ufffd"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do with a comment explaining what these values are.
) -> list: | ||
"""Return the text words as a list with the bbox for each word. | ||
|
||
Args: | ||
flags: (int) control the amount of data parsed into the textpage. | ||
delimiters: (str,list) characters to use as word delimiters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delimiters are extra characters - i think we always treat 32 and 160 as delimiters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, read as extra delimiters
if delimiters is not None: | ||
old_delimiters = TOOLS.get_word_delimiters() | ||
tp = textpage | ||
if tp is None: | ||
tp = page.get_textpage(clip=clip, flags=flags) | ||
elif getattr(tp, "parent") != page: | ||
raise ValueError("not a textpage of this page") | ||
if delimiters is not None: | ||
TOOLS.set_word_delimiters(delimiters) | ||
words = tp.extractWORDS() | ||
if textpage is None: | ||
del tp | ||
if sort is True: | ||
words.sort(key=lambda w: (w[3], w[0])) | ||
if delimiters is not None: | ||
TOOLS.set_word_delimiters(old_delimiters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not thread-safe. Ok for classic, but we probably want to avoid in rebased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to allow temporary (extra) delimiters - this call only.
@@ -96,6 +96,8 @@ def get_env_bool( name, default): | |||
# Unset ascender / descender corrections | |||
g_skip_quad_corrections = 0 | |||
|
|||
# additional word delmiters | |||
g_word_delimiters = [0] * 65 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like we should do g_word_delimiters = list()
. Or, better, pass around as a fn param so things are thread safe.
elif ch >= 0xd800 and ch <= 0xdfff: | ||
mupdf.fz_append_string(buff, "\\ufffd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with classic, would like a comment saying what these character ranges are.
def get_word_delimiters(): | ||
"""Return extra word delimiting characters.""" | ||
global g_word_delimiters | ||
delims = [chr(c) for c in g_word_delimiters if c != 0] | ||
return delims |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right if g_word_delimiters is set to 10 characters, then reset to 5 characters? In this case i think it could be ....0.....0, where .
is any non-zero character. I.e. we probably want to terminate at the first 0.
Also, don't need global g_word_delimiters
- only required if we are modifying it.
global g_word_delimiters | ||
if delims is None: | ||
delims = [] | ||
if not hasattr(delims, "__getitem__") or len(delims) > 64: | ||
raise ValueError("bad delimiter value(s)") | ||
try: | ||
delims = set([ord(c) for c in delims if ord(c) > 32]) | ||
except: | ||
print("bad delimiter value(s)") | ||
raise | ||
delims = tuple(delims) | ||
if not delims: | ||
g_word_delimiters = [0] * 65 | ||
else: | ||
g_word_delimiters = delims | ||
return # extra.set_word_delimiters(delims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can set g_word_delimiters
to a set here, we probably don't need it to be [0]*65
when we initiallise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test for these changes?
Comment says no change to src/extra.i
but there are changes...?
Thanks for review! I will give them a closer look today or over the weekend. PBased on it, I probably have to redesign things. |
BTW did you understand the reason for the failed test? |
In this case, the problem is an error when compiling the swig-generated file fitz/fitz.i.c (this is classic implementation):
So this is a bug in |
This change offers to define up to 64 additional characters for use as delimiters in page.get_text("words"). Delimiters can be set either temporarily in the method itself, or permanently for all subsequent word extractions.
Please note that in rebased, no respective changes were made in extra.i. Any speed boosting will need extra code there.
Other changes:
plain text output characters are now handled the same way as in all other extraction variants. This entails using fz_buffer as intermediate storage for extracted text - not any longer fz_output.
in rare cases, fz_stext_page returns invalid bboxes for images or text: not empty, not infinite, but still with one or more of its edges having coordinates in common with the infinite rectangle. We define a new function "JM_ignore_rect()" to allow filtering out (ignoring) them.
finally fixed typo in issue Typo in set_layer() - NameError: name 'f' is not defined #2522
added TOOLS.store_shrink() to Document method "reload_page()" to ensure the absence of cached, now possibly invalid objects pertaining to this page.
When adding annotation to a page, removed unnecessary (possibly even wrong) extra annotation flag settings, so now MuPDF's defaults will prevail.
Added flexibility to border_width parameter in Page methods "insert_text"/"insert_textbox". This is now interpreted as a fraction of font size, such that e.g. a value of 0.05 corresponds to a border width of 5% of the font size.