Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word delimiters documentation #2662

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 17 additions & 14 deletions docs/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1030,7 +1030,7 @@ In a nutshell, this is what you can do with PyMuPDF:

* New in v1.21.0

Delete the image at xref. This is slightly misleading: actually the image is being replaced with a small transparent :ref:`Pixmap` using above :meth:`Page.replace_image`. The visible effect however is equivalent.
Delete the image at xref. This is slightly misleading: actually the image is being replaced with a small transparent :ref:`Pixmap` using above :meth:`Page.replace_image`. The visible effect however is equivalent to deleting the image.

:arg int xref: the :data:`xref` of the image.

Expand Down Expand Up @@ -1058,24 +1058,25 @@ In a nutshell, this is what you can do with PyMuPDF:
pair: textpage; Page.get_text
pair: sort; Page.get_text

.. method:: get_text(opt,*, clip=None, flags=None, textpage=None, sort=False)
.. method:: get_text(opt,*, clip=None, flags=None, textpage=None, sort=False, delimiters=None)

* Changed in v1.19.0: added `textpage` parameter
* Changed in v1.19.1: added `sort` parameter
* Changed in v1.19.6: added new constants for defining default flags per method.
* Changed in v1.23.4: added new parameter to set delimiters for "words" extractions.

Retrieves the content of a page in a variety of formats. This is a wrapper for :ref:`TextPage` methods by choosing the output option as follows:
Retrieves the content of a page in a variety of formats. This is a wrapper for :ref:`TextPage` methods by choosing the output option "opt" as follows:

* "text" -- :meth:`TextPage.extractTEXT`, default
* "blocks" -- :meth:`TextPage.extractBLOCKS`
* "words" -- :meth:`TextPage.extractWORDS`
* "html" -- :meth:`TextPage.extractHTML`
* "xhtml" -- :meth:`TextPage.extractXHTML`
* "xml" -- :meth:`TextPage.extractXML`
* "dict" -- :meth:`TextPage.extractDICT`
* "json" -- :meth:`TextPage.extractJSON`
* "rawdict" -- :meth:`TextPage.extractRAWDICT`
* "rawjson" -- :meth:`TextPage.extractRAWJSON`
* `opt="text"` -- :meth:`TextPage.extractTEXT`, default
* `opt="blocks"` -- :meth:`TextPage.extractBLOCKS`
* `opt="words"` -- :meth:`TextPage.extractWORDS`
* `opt="html"` -- :meth:`TextPage.extractHTML`
* `opt="xhtml"` -- :meth:`TextPage.extractXHTML`
* `opt="xml"` -- :meth:`TextPage.extractXML`
* `opt="dict"` -- :meth:`TextPage.extractDICT`
* `opt="json"` -- :meth:`TextPage.extractJSON`
* `opt="rawdict"` -- :meth:`TextPage.extractRAWDICT`
* `opt="rawjson"` -- :meth:`TextPage.extractRAWJSON`

:arg str opt: A string indicating the requested format, one of the above. A mixture of upper and lower case is supported.

Expand All @@ -1089,12 +1090,14 @@ In a nutshell, this is what you can do with PyMuPDF:

:arg bool sort: (new in v1.19.1) sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a "natural" reading order. Has no effect on (X)HTML and XML. Output option **"words"** sorts by `(y1, x0)` of the words' bboxes. Similar is true for "blocks", "dict", "json", "rawdict", "rawjson": they all are sorted by `(y1, x0)` of the resp. block bbox. If specified for "text", then internally "blocks" is used.

:arg str,list delimiters: temporarily set characters to function as word delimiters. For instance, `delimiters=",.@"` causes word breaks at these characters (in addition to white spaces), and returned word strings will never contain them. Valid for `opt="words"` only and only for this execution. To permanently set additional delimiters, use :meth:`Tools.set_word_delimiters`.

:rtype: *str, list, dict*
:returns: The page's content as a string, a list or a dictionary. Refer to the corresponding :ref:`TextPage` method for details.

.. note::

1. You can use this method as a **document conversion tool** from :ref:`any supported document type<Supported_File_Types>` to one of TEXT, HTML, XHTML or XML documents.
1. You can use this method as a **document conversion tool** from :ref:`any supported document type<Supported_File_Types>` to TEXT, JSON, HTML, XHTML or XML documents.
2. The inclusion of text via the *clip* parameter is decided on a by-character level: **(changed in v1.18.2)** a character becomes part of the output, if its bbox is contained in *clip*. This **deviates** from the algorithm used in redaction annotations: a character will be **removed if its bbox intersects** any redaction annotation.

.. index::
Expand Down
18 changes: 16 additions & 2 deletions docs/textpage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,25 @@ For a description of what this class is all about, see Appendix 2.

.. method:: extractWORDS

Textpage content as a list of single words with bbox information. An item of this list looks like this::
* Changed in v1.23.4: Support arbitrary word delimiting characters.

Return the Textpage content as a list of single *words* with their bbox information. An item of this list looks like this::

(x0, y0, x1, y1, "word", block_no, line_no, word_no)

Everything delimited by spaces is treated as a *"word"*. This is a high-speed method which e.g. allows extracting text from within given areas or recovering the text reading sequence.
This is a high-speed method, which extracts strings (called "words") **that do not contain** word delimiting characters. Standard word delimiters are all white space characters *(characters with a unicode value <= 32 and the non-breaking space 0xA0 = 160)*. This means, that the string "[email protected]" will be returned as one "word" -- because it contains no spaces.

If you want to know details about the components of this e-mail address set some **additional delimiters** "." and "@" and execute the extraction again::

fitz.TOOLS.set_word_delimiters(".@") # sets a global value
words = page.get_text("words")
# additional delimiters remain active until changed again

The returned list will now contain the 4 separate words: "some", "name", "emailservice", "com" -- each with its boundary box. To revert to standard behavior, execute :meth:`Tools.set_word_delimiters` without parameter.

To see active delimiters execute :meth:`Tools.get_word_delimiters`.

.. note:: Please note that in the above `TOOLS == Tools()` are used interchangeably.

:rtype: list

Expand Down
32 changes: 32 additions & 0 deletions docs/tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ This class is a collection of utility methods and attributes, mainly around memo
:meth:`Tools.reset_mupdf_warnings` empty MuPDF messages on STDOUT
:meth:`Tools.set_aa_level` set the anti-aliasing values
:meth:`Tools.set_annot_stem` set the prefix of new annotation / link ids
:meth:`Tools.get_word_delimiters` inquire additional word delimiters
:meth:`Tools.set_word_delimiters` set / unset additional word delimiters
:meth:`Tools.set_small_glyph_heights` search and extract using small bbox heights
:meth:`Tools.set_subset_fontnames` control suppression of subset fontname tags
:meth:`Tools.show_aa_level` return the anti-aliasing values
Expand Down Expand Up @@ -55,6 +57,36 @@ This class is a collection of utility methods and attributes, mainly around memo
:returns: the current value.


.. method:: set_word_delimiters(delims=None)

* New in v1.23.4

Set or unset additional word delimiters. These are characters to be used with :meth:`Page.get_text` variant "words". Every character specified in `delims` causes a word break -- in addition to the standard behavior, which only breaks words at (white) spaces.

For example `delims=".@"` will return the 4 words "some", "name", "emailservice", "com" for the string "[email protected]", instead of just one.

The method sets a global variable, which remains in effect until changed. To fall back to the standard, execute the method without parameters.

:arg str,list delims: a string of characters or a list of single (!) characters. For instance `delims=".@"` **or** `delims=[".", "@"]`. To skip all punctuation characters, you could use `delims=string.punctuation`.

.. method:: get_word_delimiters()

* New in v1.23.4

Inquire additional word delimiting characters. Example session::

In [1]: import fitz
In [2]: fitz.TOOLS.set_word_delimiters(".@")
Out[2]: True
In [3]: fitz.TOOLS.get_word_delimiters()
Out[3]: ['.', '@']
In [4]: fitz.TOOLS.set_word_delimiters()
Out[4]: False
In [5]: fitz.TOOLS.get_word_delimiters()
Out[5]: []



.. method:: set_small_glyph_heights(on=None)

* New in v1.18.5
Expand Down
Loading