pymupdf · JorjMcKie · Sep 11, 2023 · Sep 11, 2023
diff --git a/docs/page.rst b/docs/page.rst
@@ -1030,7 +1030,7 @@ In a nutshell, this is what you can do with PyMuPDF:
 
       * New in v1.21.0
 
-      Delete the image at xref. This is slightly misleading: actually the image is being replaced with a small transparent :ref:`Pixmap` using above :meth:`Page.replace_image`. The visible effect however is equivalent.
+      Delete the image at xref. This is slightly misleading: actually the image is being replaced with a small transparent :ref:`Pixmap` using above :meth:`Page.replace_image`. The visible effect however is equivalent to deleting the image.
 
       :arg int xref: the :data:`xref` of the image.
 
@@ -1058,24 +1058,25 @@ In a nutshell, this is what you can do with PyMuPDF:
       pair: textpage; Page.get_text
       pair: sort; Page.get_text
 
-   .. method:: get_text(opt,*, clip=None, flags=None, textpage=None, sort=False)
+   .. method:: get_text(opt,*, clip=None, flags=None, textpage=None, sort=False, delimiters=None)
 
       * Changed in v1.19.0: added `textpage` parameter
       * Changed in v1.19.1: added `sort` parameter
       * Changed in v1.19.6: added new constants for defining default flags per method.
+      * Changed in v1.23.4: added new parameter to set delimiters for "words" extractions.
 
-      Retrieves the content of a page in a variety of formats. This is a wrapper for :ref:`TextPage` methods by choosing the output option as follows:
+      Retrieves the content of a page in a variety of formats. This is a wrapper for :ref:`TextPage` methods by choosing the output option "opt" as follows:
 
-      * "text" -- :meth:`TextPage.extractTEXT`, default
-      * "blocks" -- :meth:`TextPage.extractBLOCKS`
-      * "words" -- :meth:`TextPage.extractWORDS`
-      * "html" -- :meth:`TextPage.extractHTML`
-      * "xhtml" -- :meth:`TextPage.extractXHTML`
-      * "xml" -- :meth:`TextPage.extractXML`
-      * "dict" -- :meth:`TextPage.extractDICT`
-      * "json" -- :meth:`TextPage.extractJSON`
-      * "rawdict" -- :meth:`TextPage.extractRAWDICT`
-      * "rawjson" -- :meth:`TextPage.extractRAWJSON`
+      * `opt="text"` -- :meth:`TextPage.extractTEXT`, default
+      * `opt="blocks"` -- :meth:`TextPage.extractBLOCKS`
+      * `opt="words"` -- :meth:`TextPage.extractWORDS`
+      * `opt="html"` -- :meth:`TextPage.extractHTML`
+      * `opt="xhtml"` -- :meth:`TextPage.extractXHTML`
+      * `opt="xml"` -- :meth:`TextPage.extractXML`
+      * `opt="dict"` -- :meth:`TextPage.extractDICT`
+      * `opt="json"` -- :meth:`TextPage.extractJSON`
+      * `opt="rawdict"` -- :meth:`TextPage.extractRAWDICT`
+      * `opt="rawjson"` -- :meth:`TextPage.extractRAWJSON`
 
       :arg str opt: A string indicating the requested format, one of the above. A mixture of upper and lower case is supported.
 
@@ -1089,12 +1090,14 @@ In a nutshell, this is what you can do with PyMuPDF:
 
       :arg bool sort: (new in v1.19.1) sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a "natural" reading order. Has no effect on (X)HTML and XML. Output option **"words"** sorts by `(y1, x0)` of the words' bboxes. Similar is true for "blocks", "dict", "json", "rawdict", "rawjson": they all are sorted by `(y1, x0)` of the resp. block bbox. If specified for "text", then internally "blocks" is used.
 
+      :arg str,list delimiters: temporarily set characters to function as word delimiters. For instance, `delimiters=",.@"` causes word breaks at these characters (in addition to white spaces), and returned word strings will never contain them. Valid for `opt="words"` only and only for this execution. To permanently set additional delimiters, use :meth:`Tools.set_word_delimiters`.
+
       :rtype: *str, list, dict*
       :returns: The page's content as a string, a list or a dictionary. Refer to the corresponding :ref:`TextPage` method for details.
 
       .. note::
 
-        1. You can use this method as a **document conversion tool** from :ref:`any supported document type<Supported_File_Types>` to one of TEXT, HTML, XHTML or XML documents.
+        1. You can use this method as a **document conversion tool** from :ref:`any supported document type<Supported_File_Types>` to TEXT, JSON, HTML, XHTML or XML documents.
         2. The inclusion of text via the *clip* parameter is decided on a by-character level: **(changed in v1.18.2)** a character becomes part of the output, if its bbox is contained in *clip*. This **deviates** from the algorithm used in redaction annotations: a character will be **removed if its bbox intersects** any redaction annotation.
 
    .. index::

diff --git a/docs/textpage.rst b/docs/textpage.rst
@@ -58,11 +58,25 @@ For a description of what this class is all about, see Appendix 2.
 
    .. method:: extractWORDS
 
-      Textpage content as a list of single words with bbox information. An item of this list looks like this::
+      * Changed in v1.23.4:  Support arbitrary word delimiting characters.
+
+      Return the Textpage content as a list of single *words* with their bbox information. An item of this list looks like this::
 
          (x0, y0, x1, y1, "word", block_no, line_no, word_no)
 
-      Everything delimited by spaces is treated as a *"word"*. This is a high-speed method which e.g. allows extracting text from within given areas or recovering the text reading sequence.
+      This is a high-speed method, which extracts strings (called "words") **that do not contain** word delimiting characters. Standard word delimiters are all white space characters *(characters with a unicode value <= 32 and the non-breaking space 0xA0 = 160)*. This means, that the string "[email protected]" will be returned as one "word" -- because it contains no spaces.
+
+      If you want to know details about the components of this e-mail address set some **additional delimiters** "." and "@" and execute the extraction again::
+
+          fitz.TOOLS.set_word_delimiters(".@")  # sets a global value
+          words = page.get_text("words")
+          # additional delimiters remain active until changed again
+
+      The returned list will now contain the 4 separate words: "some", "name", "emailservice", "com" -- each with its boundary box. To revert to standard behavior, execute :meth:`Tools.set_word_delimiters` without parameter.
+
+      To see active delimiters execute :meth:`Tools.get_word_delimiters`.
+
+      .. note:: Please note that in the above `TOOLS == Tools()` are used interchangeably.
 
       :rtype: list
 

diff --git a/docs/tools.rst b/docs/tools.rst
@@ -17,6 +17,8 @@ This class is a collection of utility methods and attributes, mainly around memo
 :meth:`Tools.reset_mupdf_warnings`     empty MuPDF messages on STDOUT
 :meth:`Tools.set_aa_level`             set the anti-aliasing values
 :meth:`Tools.set_annot_stem`           set the prefix of new annotation / link ids
+:meth:`Tools.get_word_delimiters`      inquire additional word delimiters
+:meth:`Tools.set_word_delimiters`      set / unset additional word delimiters
 :meth:`Tools.set_small_glyph_heights`  search and extract using small bbox heights
 :meth:`Tools.set_subset_fontnames`     control suppression of subset fontname tags
 :meth:`Tools.show_aa_level`            return the anti-aliasing values
@@ -55,6 +57,36 @@ This class is a collection of utility methods and attributes, mainly around memo
       :returns: the current value.
 
 
+   .. method:: set_word_delimiters(delims=None)
+
+      * New in v1.23.4
+
+      Set or unset additional word delimiters. These are characters to be used with :meth:`Page.get_text` variant "words". Every character specified in `delims` causes a word break -- in addition to the standard behavior, which only breaks words at (white) spaces.
+
+      For example `delims=".@"` will return the 4 words "some", "name", "emailservice", "com" for the string "[email protected]", instead of just one.
+
+      The method sets a global variable, which remains in effect until changed. To fall back to the standard, execute the method without parameters.
+
+      :arg str,list delims: a string of characters or a list of single (!) characters. For instance `delims=".@"` **or** `delims=[".", "@"]`. To skip all punctuation characters, you could use `delims=string.punctuation`.
+
+   .. method:: get_word_delimiters()
+
+      * New in v1.23.4
+
+      Inquire additional word delimiting characters. Example session::
+
+         In [1]: import fitz
+         In [2]: fitz.TOOLS.set_word_delimiters(".@")
+         Out[2]: True
+         In [3]: fitz.TOOLS.get_word_delimiters()
+         Out[3]: ['.', '@']
+         In [4]: fitz.TOOLS.set_word_delimiters()
+         Out[4]: False
+         In [5]: fitz.TOOLS.get_word_delimiters()
+         Out[5]: []
+
+
+
    .. method:: set_small_glyph_heights(on=None)
 
       * New in v1.18.5