From 2779ca0f85a06f8e831a20b116a573866bc3e5ea Mon Sep 17 00:00:00 2001 From: "Jorj X. McKie" Date: Mon, 11 Sep 2023 06:21:24 -0400 Subject: [PATCH 1/2] Document new word delimiter support --- docs/page.rst | 31 +++++++++++++++++-------------- docs/textpage.rst | 18 ++++++++++++++++-- 2 files changed, 33 insertions(+), 16 deletions(-) diff --git a/docs/page.rst b/docs/page.rst index 350023eb2..3e4646c3c 100644 --- a/docs/page.rst +++ b/docs/page.rst @@ -1030,7 +1030,7 @@ In a nutshell, this is what you can do with PyMuPDF: * New in v1.21.0 - Delete the image at xref. This is slightly misleading: actually the image is being replaced with a small transparent :ref:`Pixmap` using above :meth:`Page.replace_image`. The visible effect however is equivalent. + Delete the image at xref. This is slightly misleading: actually the image is being replaced with a small transparent :ref:`Pixmap` using above :meth:`Page.replace_image`. The visible effect however is equivalent to deleting the image. :arg int xref: the :data:`xref` of the image. @@ -1058,24 +1058,25 @@ In a nutshell, this is what you can do with PyMuPDF: pair: textpage; Page.get_text pair: sort; Page.get_text - .. method:: get_text(opt,*, clip=None, flags=None, textpage=None, sort=False) + .. method:: get_text(opt,*, clip=None, flags=None, textpage=None, sort=False, delimiters=None) * Changed in v1.19.0: added `textpage` parameter * Changed in v1.19.1: added `sort` parameter * Changed in v1.19.6: added new constants for defining default flags per method. + * Changed in v1.23.4: added new parameter to set delimiters for "words" extractions. - Retrieves the content of a page in a variety of formats. This is a wrapper for :ref:`TextPage` methods by choosing the output option as follows: + Retrieves the content of a page in a variety of formats. This is a wrapper for :ref:`TextPage` methods by choosing the output option "opt" as follows: - * "text" -- :meth:`TextPage.extractTEXT`, default - * "blocks" -- :meth:`TextPage.extractBLOCKS` - * "words" -- :meth:`TextPage.extractWORDS` - * "html" -- :meth:`TextPage.extractHTML` - * "xhtml" -- :meth:`TextPage.extractXHTML` - * "xml" -- :meth:`TextPage.extractXML` - * "dict" -- :meth:`TextPage.extractDICT` - * "json" -- :meth:`TextPage.extractJSON` - * "rawdict" -- :meth:`TextPage.extractRAWDICT` - * "rawjson" -- :meth:`TextPage.extractRAWJSON` + * `opt="text"` -- :meth:`TextPage.extractTEXT`, default + * `opt="blocks"` -- :meth:`TextPage.extractBLOCKS` + * `opt="words"` -- :meth:`TextPage.extractWORDS` + * `opt="html"` -- :meth:`TextPage.extractHTML` + * `opt="xhtml"` -- :meth:`TextPage.extractXHTML` + * `opt="xml"` -- :meth:`TextPage.extractXML` + * `opt="dict"` -- :meth:`TextPage.extractDICT` + * `opt="json"` -- :meth:`TextPage.extractJSON` + * `opt="rawdict"` -- :meth:`TextPage.extractRAWDICT` + * `opt="rawjson"` -- :meth:`TextPage.extractRAWJSON` :arg str opt: A string indicating the requested format, one of the above. A mixture of upper and lower case is supported. @@ -1089,12 +1090,14 @@ In a nutshell, this is what you can do with PyMuPDF: :arg bool sort: (new in v1.19.1) sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a "natural" reading order. Has no effect on (X)HTML and XML. Output option **"words"** sorts by `(y1, x0)` of the words' bboxes. Similar is true for "blocks", "dict", "json", "rawdict", "rawjson": they all are sorted by `(y1, x0)` of the resp. block bbox. If specified for "text", then internally "blocks" is used. + :arg str,list delimiters: temporarily set characters to function as word delimiters. For instance, `delimiters=",.@"` causes word breaks at these characters (in addition to white spaces), and returned word strings will never contain them. Valid for `opt="words"` only and only for this execution. To permanently set additional delimiters, use :meth:`Tools.set_word_delimiters`. + :rtype: *str, list, dict* :returns: The page's content as a string, a list or a dictionary. Refer to the corresponding :ref:`TextPage` method for details. .. note:: - 1. You can use this method as a **document conversion tool** from :ref:`any supported document type` to one of TEXT, HTML, XHTML or XML documents. + 1. You can use this method as a **document conversion tool** from :ref:`any supported document type` to TEXT, JSON, HTML, XHTML or XML documents. 2. The inclusion of text via the *clip* parameter is decided on a by-character level: **(changed in v1.18.2)** a character becomes part of the output, if its bbox is contained in *clip*. This **deviates** from the algorithm used in redaction annotations: a character will be **removed if its bbox intersects** any redaction annotation. .. index:: diff --git a/docs/textpage.rst b/docs/textpage.rst index 8b8ac3a2f..cbbf2c2c1 100644 --- a/docs/textpage.rst +++ b/docs/textpage.rst @@ -58,11 +58,25 @@ For a description of what this class is all about, see Appendix 2. .. method:: extractWORDS - Textpage content as a list of single words with bbox information. An item of this list looks like this:: + * Changed in v1.23.4: Support arbitrary word delimiting characters. + + Return the Textpage content as a list of single *words* with their bbox information. An item of this list looks like this:: (x0, y0, x1, y1, "word", block_no, line_no, word_no) - Everything delimited by spaces is treated as a *"word"*. This is a high-speed method which e.g. allows extracting text from within given areas or recovering the text reading sequence. + This is a high-speed method, which extracts strings (called "words") **that do not contain** word delimiting characters. Standard word delimiters are all white space characters *(characters with a unicode value <= 32 and the non-breaking space 0xA0 = 160)*. This means, that the string "some.name@emailservice.com" will be returned as one "word" -- because it contains no spaces. + + If you want to know details about the components of this e-mail address set some **additional delimiters** "." and "@" and execute the extraction again:: + + fitz.TOOLS.set_word_delimiters(".@") # sets a global value + words = page.get_text("words") + # additional delimiters remain active until changed again + + The returned list will now contain the 4 separate words: "some", "name", "emailservice", "com" -- each with its boundary box. To revert to standard behavior, execute :meth:`Tools.set_word_delimiters` without parameter. + + To see active delimiters execute :meth:`Tools.get_word_delimiters`. + + .. note:: Please note that in the above `TOOLS == Tools()` are used interchangeably. :rtype: list From 2a7d4d8d0fe92fc1f5bd982ce832166a6e0aa1c8 Mon Sep 17 00:00:00 2001 From: "Jorj X. McKie" Date: Mon, 11 Sep 2023 06:24:45 -0400 Subject: [PATCH 2/2] Update tools.rst --- docs/tools.rst | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/docs/tools.rst b/docs/tools.rst index 0d6f72330..623257564 100644 --- a/docs/tools.rst +++ b/docs/tools.rst @@ -17,6 +17,8 @@ This class is a collection of utility methods and attributes, mainly around memo :meth:`Tools.reset_mupdf_warnings` empty MuPDF messages on STDOUT :meth:`Tools.set_aa_level` set the anti-aliasing values :meth:`Tools.set_annot_stem` set the prefix of new annotation / link ids +:meth:`Tools.get_word_delimiters` inquire additional word delimiters +:meth:`Tools.set_word_delimiters` set / unset additional word delimiters :meth:`Tools.set_small_glyph_heights` search and extract using small bbox heights :meth:`Tools.set_subset_fontnames` control suppression of subset fontname tags :meth:`Tools.show_aa_level` return the anti-aliasing values @@ -55,6 +57,36 @@ This class is a collection of utility methods and attributes, mainly around memo :returns: the current value. + .. method:: set_word_delimiters(delims=None) + + * New in v1.23.4 + + Set or unset additional word delimiters. These are characters to be used with :meth:`Page.get_text` variant "words". Every character specified in `delims` causes a word break -- in addition to the standard behavior, which only breaks words at (white) spaces. + + For example `delims=".@"` will return the 4 words "some", "name", "emailservice", "com" for the string "some.name@emailservice.com", instead of just one. + + The method sets a global variable, which remains in effect until changed. To fall back to the standard, execute the method without parameters. + + :arg str,list delims: a string of characters or a list of single (!) characters. For instance `delims=".@"` **or** `delims=[".", "@"]`. To skip all punctuation characters, you could use `delims=string.punctuation`. + + .. method:: get_word_delimiters() + + * New in v1.23.4 + + Inquire additional word delimiting characters. Example session:: + + In [1]: import fitz + In [2]: fitz.TOOLS.set_word_delimiters(".@") + Out[2]: True + In [3]: fitz.TOOLS.get_word_delimiters() + Out[3]: ['.', '@'] + In [4]: fitz.TOOLS.set_word_delimiters() + Out[4]: False + In [5]: fitz.TOOLS.get_word_delimiters() + Out[5]: [] + + + .. method:: set_small_glyph_heights(on=None) * New in v1.18.5