diff --git a/main/404.html b/main/404.html new file mode 100644 index 00000000..efac6378 --- /dev/null +++ b/main/404.html @@ -0,0 +1,2292 @@ + + + +
+ + + + + + + + + + + + + +EDS-PDF was developed to propose a more modular and extendable approach to PDF extraction than PDFBox, the legacy implementation at APHP's clinical data warehouse.
+EDS-PDF takes inspiration from Explosion's spaCy pipelining system and closely follows its API. Therefore, the core object within EDS-PDF is the Pipeline, which organises the processing of PDF documents into multiple components. However, unlike spaCy, the library is built around a single deep learning framework, pytorch, which makes model development easier.
+huggingface-embedding
) with windowing optionsrender_page
option to pdfminer
extractor, for multi-modal PDF featuresaccelerators
), with simple mono process support and multi gpu / cpu supportpipeline.package(...)
) to make a pip installable package from a pipelineconfit
to 0.4.2 (better errors) and foldedtensor
to 0.3.0 (better multiprocess support)pipeline.score
. You should use pipeline.pipe
, a custom scorer and pipeline.select_pipes
instead.hatch
instead of setuptools
to build the package / docs and run the testsattrs
dependency only being installed in dev modeMajor refactoring of the library:
+PDFDoc
, Page
, Box
, ...) for representing PDF documentsCast bytes-like extractor inputs as bytes
+Performance and cuda related fixes.
+Many, many changes: +- added torch as the main deep learning framework instead of spaCy and thinc +- added poppler and mupdf as alternatives to pdfminer +- new pipeline / config / registry system to facilitate consistency between training and inference +- standardization of the exchange format between components with dataclass models (attrs more specifically) instead of pandas dataframes
+title
and body
)pdf2image
dependency, replacing it with pypdfium2
(easier installation)aggregation
) to plural names (aggregators
).package-resource.v1
in the misc registryimportlib.metadata
dependency, which led to issues with Python 3.7sklearn-pipeline.v1
compare_results
in visualisationstyled.v1
aggregator, to handle stylesrescale.v1
transform, to go back to the original height and widthline
object is not carried around any moreparams
entry in the EDS-PDF registry.merge_lines
bug when lines were emptyaggregation
submodule to handle the specifics of aggregating text blocslabels
typer
legacy dependencymodels
submodule, which handled the configurations for Spark distribution (deferred to another package)orbis
context, which was APHP-specificInception !
+dummy.v1
, that classifies everything to body
mask.v1
, for simple rule-based classificationsklearn.v1
, that uses a Scikit-Learn pipelinerandom.v1
, to better sow chaosEDS-PDF is built on top of the confit
configuration system.
The following catalogue registries are included within EDS-PDF:
+Section | +Description | +
---|---|
factory |
+Components factories (most often classes) | +
adapter |
+Raw data preprocessing functions | +
EDS-PDF pipelines are meant to be reproducible and serializable, such that you can always define a pipeline through the configuration system.
+To wit, compare the API-based approach to the configuration-based approach (the two are strictly equivalent):
+import edspdf
+from pathlib import Path
+
+model = edspdf.Pipeline()
+model.add_pipe("pdfminer-extractor", name="extractor")
+model.add_pipe("mask-classifier", name="classifier", config=dict(
+ x0=0.2,
+ x1=0.9,
+ y0=0.3,
+ y1=0.6,
+ threshold=0.1,
+)
+model.add_pipe("simple-aggregator", name="aggregator")
+
+# Get a PDF
+pdf = Path("letter.pdf").read_bytes()
+
+pdf = model(pdf)
+
+str(pdf.aggregated_texts["body"])
+# Out: Cher Pr ABC, Cher DEF,\n...
+
[pipeline]
+pipeline = ["extractor", "classifier", "aggregator"]
+
+[components.extractor]
+@factory = "pdfminer-extractor"
+
+[components.classifier]
+@factory = "mask-classifier"
+x0 = 0.2
+x1 = 0.9
+y0 = 0.3
+y1 = 0.6
+threshold = 0.1
+
+[components.aggregator]
+@factory = "simple-aggregator"
+
import edspdf
+from pathlib import Path
+
+pipeline = edspdf.load("config.cfg")
+
+# Get a PDF
+pdf = Path("letter.pdf").read_bytes()
+
+pdf = pipeline(pdf)
+
+str(pdf.aggregated_texts["body"])
+# Out: Cher Pr ABC, Cher DEF,\n...
+
The configuration-based approach strictly separates the definition of the pipeline +to its application and avoids tucking away important configuration details. +Changes to the pipeline are transparent as there is a single source of truth: the configuration file.
+We welcome contributions ! There are many ways to help. For example, you can:
+To be able to run the test suite and develop your own pipeline, you should clone the repo and install it locally. We use the hatch
package manager to manage the project.
color:gray # Clone the repository and change directory
+$ git clone ssh://git@github.com/aphp/edspdf.git
+---> 100%
+
+color:gray # Ensure hatch is installed, preferably via pipx
+$ pipx install hatch
+
+$ cd edspdf
+
+color:gray # Enter a shell to develop / test the project. This will install everything required in a virtual environment. You can also `source` the path shown by hatch.
+$ hatch shell
+$ ...
+$ exit # when you're done
+
To make sure the pipeline will not fail because of formatting errors, we added pre-commit hooks using the pre-commit
Python library. To use it, simply install it:
$ pre-commit install
+
The pre-commit hooks defined in the configuration will automatically run when you commit your changes, letting you know if something went wrong.
+The hooks only run on staged changes. To force-run it on all files, run:
+$ pre-commit run --all-files
+---> 100%
+color:green All good !
+
At the very least, your changes should :
+We use the Pytest test suite.
+The following command will run the test suite. Writing your own tests is encouraged !
+pytest
+
Should your contribution propose a bug fix, we require the bug be thoroughly tested.
+We use Black to reformat the code. While other formatter only enforce PEP8 compliance, Black also makes the code uniform. In short :
+++Black reformats entire files in place. It is not configurable.
+
Moreover, the CI/CD pipeline enforces a number of checks on the "quality" of the code. To wit, non black-formatted code will make the test pipeline fail. We use pre-commit
to keep our codebase clean.
Refer to the development install tutorial for tips on how to format your files automatically. +Most modern editors propose extensions that will format files on save.
+Make sure to document your improvements, both within the code with comprehensive docstrings, +as well as in the documentation itself if need be.
+We use MkDocs
for EDS-PDF's documentation. You can view your changes with
color:gray # Run the documentation
+$ hatch run docs:serve
+
Go to localhost:8000
to see your changes. MkDocs watches for changes in the documentation folder
+and automatically reloads the page.
EDS-PDF stores PDFs and their annotation in a custom data structures that are +designed to be easy to use and manipulate. We must distinguish between:
+A PDF is first converted to a PDFDoc object, which contains the raw PDF content. This task is usually performed a PDF extractor component. Once the PDF is converted, the same object will be used and updated by the different components, and returned at the end of the pipeline.
+When running a trainable component, the PDFDoc is preprocessed and converted to tensors containing relevant features for the task. This task is performed in the preprocess
method of the component. The resulting tensors are then collated together to form a batch, in the collate
method of the component. After running the forward
method of the component, the tensor predictions are finally assigned as annotations to original PDFDoc objects in the postprocess
method.
The main data structure is the [PDFDoc][edspdf.structures.PDFDoc], which represents full a PDF document. It contains the raw PDF content, annotations for the full document, regardless of pages. A PDF is split into Page
objects that stores their number, dimension and optionally an image of the rendered page.
The PDF annotations are stored in Box
objects, which represent a rectangular region of the PDF. At the moment, box can only be specialized into TextBox
to represent text regions, such as lines extracted by a PDF extractor. Aggregated texts are stored in Text
objects, that are not associated with a specific box.
A TextBox
contains a list of TextProperties
objects to store the style properties of a styled spans of the text.
PDFDoc
+
+
+
+ Bases: BaseModel
This is the main data structure of the library to hold PDFs. +It contains the content of the PDF, as well as box annotations and text outputs.
+ + + +ATTRIBUTE | +DESCRIPTION | +
---|---|
content |
+
+
+
+ The content of the PDF document. +
+
+ TYPE:
+ |
+
id |
+
+
+
+ The ID of the PDF document. +
+
+ TYPE:
+ |
+
pages |
+
+
+
+ The pages of the PDF document. +
+
+ TYPE:
+ |
+
error |
+
+
+
+ Whether there was an error when processing this PDF document. +
+
+ TYPE:
+ |
+
content_boxes |
+
+
+
+ The content boxes/annotations of the PDF document. +
+
+ TYPE:
+ |
+
aggregated_texts |
+
+
+
+ The aggregated text outputs of the PDF document. +
+
+ TYPE:
+ |
+
text_boxes |
+
+
+
+ The text boxes of the PDF document. +
+
+ TYPE:
+ |
+
Page
+
+
+
+ Bases: BaseModel
The Page
class represents a page of a PDF document.
ATTRIBUTE | +DESCRIPTION | +
---|---|
page_num |
+
+
+
+ The page number of the page. +
+
+ TYPE:
+ |
+
width |
+
+
+
+ The width of the page. +
+
+ TYPE:
+ |
+
height |
+
+
+
+ The height of the page. +
+
+ TYPE:
+ |
+
doc |
+
+
+
+ The PDF document that this page belongs to. +
+
+ TYPE:
+ |
+
image |
+
+
+
+ The rendered image of the page, stored as a NumPy array. +
+
+ TYPE:
+ |
+
text_boxes |
+
+
+
+ The text boxes of the page. +
+
+ TYPE:
+ |
+
TextProperties
+
+
+
+ Bases: BaseModel
The TextProperties
class represents the style properties of a span of text in a
+TextBox.
ATTRIBUTE | +DESCRIPTION | +
---|---|
italic |
+
+
+
+ Whether the text is italic. +
+
+ TYPE:
+ |
+
bold |
+
+
+
+ Whether the text is bold. +
+
+ TYPE:
+ |
+
begin |
+
+
+
+ The beginning index of the span of text. +
+
+ TYPE:
+ |
+
end |
+
+
+
+ The ending index of the span of text. +
+
+ TYPE:
+ |
+
fontname |
+
+
+
+ The font name of the span of text. +
+
+ TYPE:
+ |
+
Box
+
+
+
+ Bases: BaseModel
The Box
class represents a box annotation in a PDF document. It is the base class
+of TextBox.
ATTRIBUTE | +DESCRIPTION | +
---|---|
doc |
+
+
+
+ The PDF document that this box belongs to. +
+
+ TYPE:
+ |
+
page_num |
+
+
+
+ The page number of the box. +
+
+ TYPE:
+ |
+
x0 |
+
+
+
+ The left x-coordinate of the box. +
+
+ TYPE:
+ |
+
x1 |
+
+
+
+ The right x-coordinate of the box. +
+
+ TYPE:
+ |
+
y0 |
+
+
+
+ The top y-coordinate of the box. +
+
+ TYPE:
+ |
+
y1 |
+
+
+
+ The bottom y-coordinate of the box. +
+
+ TYPE:
+ |
+
label |
+
+
+
+ The label of the box. +
+
+ TYPE:
+ |
+
page |
+
+
+
+ The page object that this box belongs to. +
+
+ TYPE:
+ |
+
Text
+
+
+
+ Bases: BaseModel
The TextBox
class represents text object, not bound to any box.
It can be used to store aggregated text from multiple boxes for example.
+ + + +ATTRIBUTE | +DESCRIPTION | +
---|---|
text |
+
+
+
+ The text content. +
+
+ TYPE:
+ |
+
properties |
+
+
+
+ The style properties of the text. +
+
+ TYPE:
+ |
+
TextBox
+
+
+
+ Bases: Box
The TextBox
class represents a text box annotation in a PDF document.
ATTRIBUTE | +DESCRIPTION | +
---|---|
text |
+
+
+
+ The text content of the text box. +
+
+ TYPE:
+ |
+
props |
+
+
+
+ The style properties of the text box. +
+
+ TYPE:
+ |
+
The tensors used to process PDFs with deep learning models usually contain 4 main dimensions, in addition to the standard embedding dimensions:
+samples
: one entry per PDF in the batchpages
: one entry per page in a PDFboxes
: one entry per box in a pagetoken
: one entry per token in a box (only for text boxes)These tensors use a special FoldedTensor format to store the data in a compact way and reshape the data depending on the requirements of a layer.
+EDS-PDF provides modular framework to extract text information from PDF documents.
+You can use it out-of-the-box, or extend it to fit your use-case.
+Install the library with pip:
+$ pip install edspdf
+---> 100%
+color:green Installation successful
+
Let's build a simple PDF extractor that uses a rule-based classifier. There are two +ways to do this, either by using the configuration system or by using +the pipeline API.
+Create a configuration file:
+[pipeline]
+pipeline = ["extractor", "classifier", "aggregator"]
+
+[components.extractor]
+@factory = "pdfminer-extractor"
+
+[components.classifier]
+@factory = "mask-classifier"
+x0 = 0.2
+x1 = 0.9
+y0 = 0.3
+y1 = 0.6
+threshold = 0.1
+
+[components.aggregator]
+@factory = "simple-aggregator"
+
and load it from Python:
+import edspdf
+from pathlib import Path
+
+model = edspdf.load("config.cfg") # (1)
+
Or create a pipeline directly from Python:
+from edspdf import Pipeline
+
+model = Pipeline()
+model.add_pipe("pdfminer-extractor")
+model.add_pipe(
+ "mask-classifier",
+ config=dict(
+ x0=0.2,
+ x1=0.9,
+ y0=0.3,
+ y1=0.6,
+ threshold=0.1,
+ ),
+)
+model.add_pipe("simple-aggregator")
+
This pipeline can then be applied (for instance with this PDF):
+# Get a PDF
+pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
+pdf = model(pdf)
+
+body = pdf.aggregated_texts["body"]
+
+text, style = body.text, body.properties
+
See the rule-based recipe for a step-by-step explanation of what is happening.
+If you use EDS-PDF, please cite us as below.
+@software{edspdf,
+ author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
+ doi = {10.5281/zenodo.6902977},
+ license = {BSD-3-Clause},
+ title = {{EDS-PDF: Smart text extraction from PDF documents}},
+ url = {https://github.com/aphp/edspdf}
+}
+
We would like to thank Assistance Publique – Hôpitaux de Paris and +AP-HP Foundation for funding this project.
+Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.
+In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
+from pathlib import Path
+
+pipeline = ...
+content = Path("path/to/.pdf").read_bytes()
+doc = pipeline(content)
+
If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the multiprocessing accelerator description below.
+pipeline.to("cuda") # same semantics as pytorch
+doc = pipeline(content)
+
When processing multiple documents, it is usually more efficient to use the pipeline.pipe(...)
method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the Accelerators section for more details). By default, the .pipe()
method uses the simple
accelerator but you can switch to a different one by passing the accelerator
argument.
pipeline = ...
+docs = pipeline.pipe(
+ [content1, content2, ...],
+ batch_size=16, # optional, default to the one defined in the pipeline
+ accelerator=my_accelerator,
+)
+
The pipe
method supports the following arguments :
PARAMETER | +DESCRIPTION | +
---|---|
inputs |
+
+ The inputs to create the PDFDocs from, or the PDFDocs directly. +
+
+ TYPE:
+ |
+
batch_size |
+
+ The batch size to use. If not provided, the batch size of the pipeline +object will be used. +
+
+ TYPE:
+ |
+
accelerator |
+
+ The accelerator to use for processing the documents. If not provided, +the default accelerator will be used. +
+
+ TYPE:
+ |
+
to_doc |
+
+ The function to use to convert the inputs to PDFDoc objects. By default,
+the
+
+ TYPE:
+ |
+
from_doc |
+
+ The function to use to convert the PDFDoc objects to outputs. By default, +the PDFDoc objects will be returned directly. +
+
+ TYPE:
+ |
+
This is the simplest accelerator which batches the documents and process each batch
+on the main process (the one calling .pipe()
).
docs = list(pipeline.pipe([content1, content2, ...]))
+
or, if you want to override the model defined batch size
+docs = list(pipeline.pipe([content1, content2, ...], batch_size=8))
+
which is equivalent to passing a confit dict
+docs = list(
+ pipeline.pipe(
+ [content1, content2, ...],
+ accelerator={
+ "@accelerator": "simple",
+ "batch_size": 8,
+ },
+ )
+)
+
or the instantiated accelerator directly
+from edspdf.accelerators.simple import SimpleAccelerator
+
+accelerator = SimpleAccelerator(batch_size=8)
+docs = list(pipeline.pipe([content1, content2, ...], accelerator=accelerator))
+
If you have a GPU, make sure to move the model to the appropriate device before
+calling .pipe()
. If you have multiple GPUs, use the
+multiprocessing
+accelerator instead.
pipeline.to("cuda")
+docs = list(pipeline.pipe([content1, content2, ...]))
+
PARAMETER | +DESCRIPTION | +
---|---|
batch_size |
+
+ The number of documents to process in each batch. +
+
+ TYPE:
+ |
+
If you have multiple CPU cores, and optionally multiple GPUs, we provide a
+multiprocessing
accelerator that allows to run the inference on multiple
+processes.
This accelerator dispatches the batches between multiple workers +(data-parallelism), and distribute the computation of a given batch on one or two +workers (model-parallelism). This is done by creating two types of workers:
+CPUWorker
which handles the non deep-learning components and the
+ preprocessing, collating and postprocessing of deep-learning componentsGPUWorker
which handles the forward call of the deep-learning componentsThe advantage of dedicating a worker to the deep-learning components is that it
+allows to prepare multiple batches in parallel in multiple CPUWorker
, and ensure
+that the GPUWorker
never wait for a batch to be ready.
The overall architecture described in the following figure, for 3 CPU workers and 2 +GPU workers.
+ + +Here is how a small pipeline with rule-based components and deep-learning components +is distributed between the workers:
+ +docs = list(
+ pipeline.pipe(
+ [content1, content2, ...],
+ accelerator={
+ "@accelerator": "multiprocessing",
+ "num_cpu_workers": 3,
+ "num_gpu_workers": 2,
+ "batch_size": 8,
+ },
+ )
+)
+
PARAMETER | +DESCRIPTION | +
---|---|
batch_size |
+
+ Number of documents to process at a time in a CPU/GPU worker +
+
+ TYPE:
+ |
+
num_cpu_workers |
+
+ Number of CPU workers. A CPU worker handles the non deep-learning components +and the preprocessing, collating and postprocessing of deep-learning components. +
+
+ TYPE:
+ |
+
num_gpu_workers |
+
+ Number of GPU workers. A GPU worker handles the forward call of the +deep-learning components. +
+
+ TYPE:
+ |
+
gpu_pipe_names |
+
+ List of pipe names to accelerate on a GPUWorker, defaults to all pipes +that inherit from TrainablePipe +
+
+ TYPE:
+ |
+
BoxTransformerLayer combining a self attention layer and a +linear->activation->linear transformation. This layer is used in the +BoxTransformerModule module.
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
input_size |
+
+ Input embedding size +
+
+ TYPE:
+ |
+
num_heads |
+
+ Number of attention heads in the attention layer +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability both for the attention layer and embedding projections +
+
+ TYPE:
+ |
+
head_size |
+
+ Head sizes of the attention layer +
+
+ TYPE:
+ |
+
activation |
+
+ Activation function used in the linear->activation->linear transformation +
+
+ TYPE:
+ |
+
init_resweight |
+
+ Initial weight of the residual gates. +At 0, the layer acts (initially) as an identity function, and at 1 as +a standard Transformer layer. +Initializing with a value close to 0 can help the training converge. +
+
+ TYPE:
+ |
+
attention_mode |
+
+ Mode of relative position infused attention layer. +See the +relative attention +documentation for more information. +
+
+ TYPE:
+ |
+
position_embedding |
+
+ Position embedding to use as key/query position embedding in the attention +computation. +
+
+ TYPE:
+ |
+
forward
+
+Forward pass of the BoxTransformerLayer
+ +PARAMETER | +DESCRIPTION | +
---|---|
embeds |
+
+ Embeddings to contextualize
+Shape:
+
+ TYPE:
+ |
+
mask |
+
+ Mask of the embeddings. 0 means padding element.
+Shape:
+
+ TYPE:
+ |
+
relative_positions |
+
+ Position of the keys relatively to the query elements
+Shape:
+
+ TYPE:
+ |
+
no_position_mask |
+
+ Key / query pairs for which the position attention terms should
+be disabled.
+Shape:
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Tuple[FloatTensor, FloatTensor]
+
+ |
+
+
+
+
|
+
Box Transformer architecture combining a multiple +BoxTransformerLayer +modules. It is mainly used in +BoxTransformer.
+ +PARAMETER | +DESCRIPTION | +
---|---|
input_size |
+
+ Input embedding size +
+
+ TYPE:
+ |
+
num_heads |
+
+ Number of attention heads in the attention layers +
+
+ TYPE:
+ |
+
n_relative_positions |
+
+ Maximum range of embeddable relative positions between boxes (further +distances are capped to ±n_relative_positions // 2) +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability both for the attention layers and embedding projections +
+
+ TYPE:
+ |
+
head_size |
+
+ Head sizes of the attention layers +
+
+ TYPE:
+ |
+
activation |
+
+ Activation function used in the linear->activation->linear transformations +
+
+ TYPE:
+ |
+
init_resweight |
+
+ Initial weight of the residual gates. +At 0, the layer acts (initially) as an identity function, and at 1 as +a standard Transformer layer. +Initializing with a value close to 0 can help the training converge. +
+
+ TYPE:
+ |
+
attention_mode |
+
+ Mode of relative position infused attention layer. +See the +relative attention +documentation for more information. +
+
+ TYPE:
+ |
+
n_layers |
+
+ Number of layers in the Transformer +
+
+ TYPE:
+ |
+
forward
+
+Forward pass of the BoxTransformer
+ +PARAMETER | +DESCRIPTION | +
---|---|
embeds |
+
+ Embeddings to contextualize
+Shape:
+
+ TYPE:
+ |
+
boxes |
+
+ Layout features of the input elements +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Tuple[FloatTensor, List[FloatTensor]]
+
+ |
+
+
+
+
|
+
EDS-PDF provides a set of specialized deep learning layers that can be used to build trainable +components. These layers are built on top of the PyTorch framework and can be used in +any PyTorch model.
+Layer | +Description | +
---|---|
BoxTransformerModule |
+Contextualize box embeddings with a 2d Transformer with relative position representations | +
BoxTransformerLayer |
+A single layer of the above BoxTransformerModule layer |
+
RelativeAttention |
+A 2d attention layer that optionally uses relative position to compute its attention scores | +
SinusoidalEmbedding |
+A position embedding that uses trigonometric functions to encode positions | +
Vocabulary |
+A non deep learning layer to encodes / decode vocabularies | +
A self/cross-attention layer that takes relative position of elements into +account to compute the attention weights. +When running a relative attention layer, key and queries are represented using +content and position embeddings, where position embeddings are retrieved using +the relative position of keys relative to queries
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
size |
+
+ The size of the output embeddings +Also serves as default if query_size, pos_size, or key_size is None +
+
+ TYPE:
+ |
+
n_heads |
+
+ The number of attention heads +
+
+ TYPE:
+ |
+
query_size |
+
+ The size of the query embeddings. +
+
+ TYPE:
+ |
+
key_size |
+
+ The size of the key embeddings. +
+
+ TYPE:
+ |
+
value_size |
+
+ The size of the value embeddings +
+
+ TYPE:
+ |
+
head_size |
+
+ The size of each query / key / value chunk used in the attention dot product
+Default:
+
+ TYPE:
+ |
+
position_embedding |
+
+ The position embedding used as key and query embeddings +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability applied on the attention weights +Default: 0.1 +
+
+ TYPE:
+ |
+
same_key_query_proj |
+
+ Whether to use the same projection operator for content key and queries +when computing the pre-attention key and query embedding chunks +Default: False +
+
+ TYPE:
+ |
+
same_positional_key_query_proj |
+
+ Whether to use the same projection operator for content key and queries +when computing the pre-attention key and query embedding chunks +Default: False +
+
+ TYPE:
+ |
+
n_coordinates |
+
+ The number of positional coordinates +For instance, text is 1D so 1 coordinate, images are 2D so 2 coordinates ... +Default: 1 +
+
+ TYPE:
+ |
+
head_bias |
+
+ Whether to learn a bias term to add to the attention logits +This is only useful if you plan to use the attention logits for subsequent +operations, since attention weights are unaffected by bias terms. +
+
+ TYPE:
+ |
+
do_pooling |
+
+ Whether to compute the output embedding. +If you only plan to use attention logits, you should disable this parameter. +Default: True +
+
+ TYPE:
+ |
+
mode |
+
+ Whether to compute content to content (c2c), content to position (c2p)
+or position to content (p2c) attention terms.
+Setting
+
+ TYPE:
+ |
+
n_additional_heads |
+
+ The number of additional head logits to compute. +Those are not used to compute output embeddings, but may be useful in +subsequent operation. +Default: 0 +
+
+ TYPE:
+ |
+
forward
+
+Forward pass of the RelativeAttention layer.
+ +PARAMETER | +DESCRIPTION | +
---|---|
content_queries |
+
+ The content query embedding to use in the attention computation
+Shape:
+
+ TYPE:
+ |
+
content_keys |
+
+ The content key embedding to use in the attention computation.
+If None, defaults to the
+
+ TYPE:
+ |
+
content_values |
+
+ The content values embedding to use in the final pooling computation.
+If None, pooling won't be performed.
+Shape:
+
+ TYPE:
+ |
+
mask |
+
+ The content key embedding to use in the attention computation.
+If None, defaults to the
+
+ TYPE:
+ |
+
relative_positions |
+
+ The relative position of keys relative to queries
+If None, positional attention terms won't be computed.
+Shape:
+
+ TYPE:
+ |
+
no_position_mask |
+
+ Key / query pairs for which the position attention terms should
+be disabled.
+Shape:
+
+ TYPE:
+ |
+
base_attn |
+
+ Attention logits to add to the computed attention logits
+Shape:
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Union[Tuple[FloatTensor, FloatTensor], FloatTensor]
+
+ |
+
+
+
+
|
+
A position embedding lookup table that stores embeddings for a fixed number
+of positions.
+The value of each of the embedding_dim
channels of the generated embedding
+is generated according to a trigonometric function (sin for even channels,
+cos for odd channels).
+The frequency of the signal in each pair of channels varies according to the
+temperature parameter.
Any input position above the maximum value num_embeddings
will be capped to
+num_embeddings - 1
PARAMETER | +DESCRIPTION | +
---|---|
num_embeddings |
+
+ The maximum number of position embeddings store in this table +
+
+ TYPE:
+ |
+
embedding_dim |
+
+ The embedding size +
+
+ TYPE:
+ |
+
temperature |
+
+ The temperature controls the range of frequencies used by each +channel of the embedding +
+
+ TYPE:
+ |
+
forward
+
+Forward pass of the SinusoidalEmbedding module
+ +PARAMETER | +DESCRIPTION | +
---|---|
indices |
+
+ Shape: any +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ FloatTensor
+
+ |
+
+
+
+ Shape: |
+
Vocabulary layer.
+This is not meant to be used as a torch.nn.Module
but subclassing
+torch.nn.Module
makes the instances appear when printing a model, which is nice.
PARAMETER | +DESCRIPTION | +
---|---|
items |
+
+ Initial vocabulary elements if any. +Specific elements such as padding and unk can be set here to enforce their +index in the vocabulary. +
+
+ TYPE:
+ |
+
default |
+
+ Default index to use for out of vocabulary elements +Defaults to -100 +
+
+ TYPE:
+ |
+
initialization
+
+Enters the initialization mode. +Out of vocabulary elements will be assigned an index.
+ +encode
+
+Converts an element into its vocabulary index
+If the layer is in its initialization mode (with vocab.initialization(): ...
),
+and the element is out of vocabulary, a new index will be created and returned.
+Otherwise, any oov element will be encoded with the default
index.
PARAMETER | +DESCRIPTION | +
---|---|
item |
+
+
+ + + |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ int
+
+ |
+
+
+
+
+ |
+
decode
+
+Converts an index into its original value
+ +PARAMETER | +DESCRIPTION | +
---|---|
idx |
+
+
+ + + |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ InputT
+
+ |
+
+
+
+
+ |
+
The goal of EDS-PDF is to provide a framework for processing PDF documents, along with some utilities and a few components, stitched together by a robust pipeline and configuration system.
+Processing PDFs usually involves many steps such as extracting lines, running OCR models, detecting and classifying boxes, filtering and aggregating parts of the extracted texts, etc. Organising these steps together, combining static and deep learning components, while remaining modular and efficient is a challenge. This is why EDS-PDF is built on top of a new pipelining system.
+Deep learning frameworks
+The EDS-PDF trainable components are built around the PyTorch framework. While you +can use any technology in static components, we do not provide tools to train +components built with other deep learning frameworks.
+A pipe is a processing block (like a function) that applies a transformation on its input and returns a modified object.
+At the moment, four types of pipes are implemented in the library:
+PDFDoc
object filled with these text boxes.body
, header
, footer
...To create your first pipeline, execute the following code:
+from edspdf import Pipeline
+
+model = Pipeline()
+# will extract text lines from a document
+model.add_pipe(
+ "pdfminer-extractor",
+ config=dict(
+ extract_style=False,
+ ),
+)
+# classify everything inside the `body` bounding box as `body`
+model.add_pipe(
+ "mask-classifier", config=dict(body={"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.9})
+)
+# aggregates the lines together to re-create the original text
+model.add_pipe("simple-aggregator")
+
This pipeline can then be run on one or more PDF documents. +As the pipeline process documents, components will be called in the order +they were added to the pipeline.
+from pathlib import Path
+
+pdf_bytes = Path("path/to/your/pdf").read_bytes()
+
+# Processing one document
+model(pdf_bytes)
+
+# Processing multiple documents
+model.pipe([pdf_bytes, ...])
+
For more information on how to use the pipeline, refer to the Inference page.
+EDS-PDF was designed to facilitate the training and inference of hybrid models that +arbitrarily chain static components or trained deep learning components. Static components are callable objects that take a PDFDoc object as input, perform arbitrary transformations over the input, and return the modified object. Trainable pipes, on the other hand, allow for deep learning operations to be performed on the PDFDoc object and must be trained to be used.
+Pipelines can be saved and loaded using the save
and load
methods. The saved pipeline is not a pickled objet but a folder containing the config file, the weights and extra resources for each pipeline. This allows for easy inspection and modification of the pipeline, and avoids the execution of arbitrary code when loading a pipeline.
model.save("path/to/your/model")
+model = edspdf.load("path/to/your/model")
+
To share the pipeline and turn it into a pip installable package, you can use the package
method, which will use or create a pyproject.toml file, fill it accordingly, and create a wheel file. At the moment, we only support the poetry package manager.
model.package(
+ name="your-package-name", # leave None to reuse name in pyproject.toml
+ version="0.0.1",
+ root_dir="path/to/project/root", # optional, to retrieve an existing pyproject.toml file
+ # if you don't have a pyproject.toml, you can provide the metadata here instead
+ metadata=dict(
+ authors="Firstname Lastname <your.email@domain.fr>",
+ description="A short description of your package",
+ ),
+)
+
This will create a wheel file in the root_dir/dist folder, which you can share and install with pip
+The aggregation step compiles extracted text blocs together according to their detected class.
+ + +Factory name | +Description | +
---|---|
simple-aggregator |
+Returns a dictionary with one key for each detected class | +
SimpleAggregator
+
+Aggregator that returns texts and styles. It groups all text boxes with the same
+label under the aggregated_text
, and additionally aggregates the
+styles of the text boxes.
Create a pipeline
+pipeline = ...
+pipeline.add_pipe(
+ "simple-aggregator",
+ name="aggregator",
+ config={
+ "new_line_threshold": 0.2,
+ "new_paragraph_threshold": 1.5,
+ "label_map": {
+ "body": "text",
+ "table": "text",
+ },
+ },
+)
+
...
+
+[components.aggregator]
+@factory = "simple-aggregator"
+new_line_threshold = 0.2
+new_paragraph_threshold = 1.5
+label_map = { body = "text", table = "text" }
+
+...
+
and run it on a document:
+doc = pipeline(doc)
+print(doc.aggregated_texts)
+# {
+# "text": "This is the body of the document, followed by a table | A | B |"
+# }
+
PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component +
+
+ TYPE:
+ |
+
sort |
+
+ Whether to sort text boxes inside each label group by (page, y, x) position +before merging them. +
+
+ TYPE:
+ |
+
new_line_threshold |
+
+ Minimum ratio of the distance between two lines to the median height of +lines to consider them as being on separate lines +
+
+ TYPE:
+ |
+
new_paragraph_threshold |
+
+ Minimum ratio of the distance between two lines to the median height of +lines to consider them as being on separate paragraphs and thus add a +newline character between them. +
+
+ TYPE:
+ |
+
label_map |
+
+ A dictionary mapping labels to new labels. This is useful to group labels +together, for instance, to output both "body" and "table" as "text". +
+
+ TYPE:
+ |
+
edspdf/pipes/aggregators/simple.py
84 +85 +86 +87 +88 +89 +90 +91 +92 +93 +94 +95 +96 +97 |
|
Dummy classifier, for chaos purposes. Classifies each line to a random element.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object. +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component. +
+
+ TYPE:
+ |
+
label |
+
+ The label to assign to each line. +
+
+ TYPE:
+ |
+
We developed EDS-PDF with modularity in mind. To that end, you can choose between multiple classification methods.
+ + +Factory name | +Description | +
---|---|
mask-classifier |
+Simple rule-based classification | +
multi-mask-classifier |
+Simple rule-based classification | +
dummy-classifier |
+Dummy classifier, for testing purposes. | +
random-classifier |
+To sow chaos | +
trainable-classifier |
+Trainable box classification model | +
We developed a simple classifier that roughly uses the same strategy as PDFBox, namely:
+Two factories are available in the classifiers
registry: mask-classifier
and multi-mask-classifier
.
mask-classifier
The simplest form of mask classification. You define the mask, everything else +is tagged as pollution.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component +
+
+ TYPE:
+ |
+
x0 |
+
+ The x0 coordinate of the mask +
+
+ TYPE:
+ |
+
y0 |
+
+ The y0 coordinate of the mask +
+
+ TYPE:
+ |
+
x1 |
+
+ The x1 coordinate of the mask +
+
+ TYPE:
+ |
+
y1 |
+
+ The y1 coordinate of the mask +
+
+ TYPE:
+ |
+
threshold |
+
+ The threshold for the alignment +
+
+ TYPE:
+ |
+
pipeline.add_pipe(
+ "mask-classifier",
+ name="classifier",
+ config={
+ "threshold": 0.9,
+ "x0": 0.1,
+ "y0": 0.1,
+ "x1": 0.9,
+ "y1": 0.9,
+ },
+)
+
[components.classifier]
+@classifiers = "mask-classifier"
+x0 = 0.1
+y0 = 0.1
+x1 = 0.9
+y1 = 0.9
+threshold = 0.9
+
multi-mask-classifier
A generalisation, wherein the user defines a number of regions.
+The following configuration produces exactly the same classifier as mask.v1
+example above.
Any bloc that is not part of a mask is tagged as pollution
.
PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+
+
+
+ TYPE:
+ |
+
threshold |
+
+ The threshold for the alignment +
+
+ TYPE:
+ |
+
masks |
+
+ The masks +
+
+ TYPE:
+ |
+
pipeline.add_pipe(
+ "multi-mask-classifier",
+ name="classifier",
+ config={
+ "threshold": 0.9,
+ "mymask": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "body"},
+ },
+)
+
[components.classifier]
+@factory = "multi-mask-classifier"
+threshold = 0.9
+
+[components.classifier.mymask]
+label = "body"
+x0 = 0.1
+y0 = 0.1
+x1 = 0.9
+y1 = 0.9
+
The following configuration defines a header
region.
pipeline.add_pipe(
+ "multi-mask-classifier",
+ name="classifier",
+ config={
+ "threshold": 0.9,
+ "body": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "header"},
+ "header": {"x0": 0.1, "y0": 0.3, "x1": 0.9, "y1": 0.9, "label": "body"},
+ },
+)
+
[components.classifier]
+@factory = "multi-mask-classifier"
+threshold = 0.9
+
+[components.classifier.header]
+label = "header"
+x0 = 0.1
+y0 = 0.1
+x1 = 0.9
+y1 = 0.3
+
+[components.classifier.body]
+label = "body"
+x0 = 0.1
+y0 = 0.3
+x1 = 0.9
+y1 = 0.9
+
Random classifier, for chaos purposes. Classifies each box to a random element.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object. +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component. +
+
+ TYPE:
+ |
+
labels |
+
+ The labels to assign to each line. If a list is passed, each label is assigned +with equal probability. If a dict is passed, the keys are the labels and the +values are the probabilities. +
+
+ TYPE:
+ |
+
This component predicts a label for each box over the whole document using machine +learning.
+Note
+You must train the model your model to use this classifier. +See Model training for more information
+The classifier is composed of the following blocks:
+In this example, we use a box-embedding
layer to generate the embeddings
+of the boxes. It is composed of a text encoder that embeds the text features of the
+boxes and a layout encoder that embeds the layout features of the boxes.
+These two embeddings are summed and passed through an optional contextualizer
,
+here a box-transformer
.
pipeline.add_pipe(
+ "trainable-classifier",
+ name="classifier",
+ config={
+ # simple embedding computed by pooling embeddings of words in each box
+ "embedding": {
+ "@factory": "sub-box-cnn-pooler",
+ "out_channels": 64,
+ "kernel_sizes": (3, 4, 5),
+ "embedding": {
+ "@factory": "simple-text-embedding",
+ "size": 72,
+ },
+ },
+ "labels": ["body", "pollution"],
+ },
+)
+
[components.classifier]
+@factory = "trainable-classifier"
+labels = ["body", "pollution"]
+
+[components.classifier.embedding]
+@factory = "sub-box-cnn-pooler"
+out_channels = 64
+kernel_sizes = (3, 4, 5)
+
+[components.classifier.embedding.embedding]
+@factory = "simple-text-embedding"
+size = 72
+
PARAMETER | +DESCRIPTION | +
---|---|
labels |
+
+ Initial labels of the classifier (will be completed during initialization) +
+
+ TYPE:
+ |
+
embedding |
+
+ Embedding module to encode the PDF boxes +
+
+ TYPE:
+ |
+
This component encodes the geometrical features of a box, as extracted by the +BoxLayoutPreprocessor module, into an embedding. For position modes, use:
+"sin"
to embed positions with a fixed
+ SinusoidalEmbedding"learned"
to embed positions using a learned standard pytorch embedding layerEach produces embedding is the concatenation of the box width, height and the top,
+left, bottom and right coordinates, each embedded depending on the *_mode
param.
PARAMETER | +DESCRIPTION | +
---|---|
size |
+
+ Size of the output box embedding +
+
+ TYPE:
+ |
+
n_positions |
+
+ Number of position embeddings stored in the PositionEmbedding module +
+
+ TYPE:
+ |
+
x_mode |
+
+ Position embedding mode of the x coordinates +
+
+ TYPE:
+ |
+
y_mode |
+
+ Position embedding mode of the x coordinates +
+
+ TYPE:
+ |
+
w_mode |
+
+ Position embedding mode of the width features +
+
+ TYPE:
+ |
+
h_mode |
+
+ Position embedding mode of the height features +
+
+ TYPE:
+ |
+
BoxTransformer using +BoxTransformerModule +under the hood.
+Note
+This module is a TrainablePipe +and can be used in a Pipeline, while +BoxTransformerModule +is a standard PyTorch module, which does not take care of the +preprocessing, collating, etc. of the input documents.
+PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ Pipeline instance +
+
+ TYPE:
+ |
+
name |
+
+ Name of the component +
+
+ TYPE:
+ |
+
num_heads |
+
+ Number of attention heads in the attention layers +
+
+ TYPE:
+ |
+
n_relative_positions |
+
+ Maximum range of embeddable relative positions between boxes (further +distances are capped to ±n_relative_positions // 2) +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability both for the attention layers and embedding projections +
+
+ TYPE:
+ |
+
head_size |
+
+ Head sizes of the attention layers +
+
+ TYPE:
+ |
+
activation |
+
+ Activation function used in the linear->activation->linear transformations +
+
+ TYPE:
+ |
+
init_resweight |
+
+ Initial weight of the residual gates. +At 0, the layer acts (initially) as an identity function, and at 1 as +a standard Transformer layer. +Initializing with a value close to 0 can help the training converge. +
+
+ TYPE:
+ |
+
attention_mode |
+
+ Mode of relative position infused attention layer. +See the relative attention +documentation for more information. +
+
+ TYPE:
+ |
+
n_layers |
+
+ Number of layers in the Transformer +
+
+ TYPE:
+ |
+
Encodes boxes using a combination of multiple encoders
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+ The name of the pipe +
+
+ TYPE:
+ |
+
mode |
+
+ The mode to use to combine the encoders: +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability used on the output of the box and textual encoders +
+
+ TYPE:
+ |
+
encoders |
+
+ The encoders to use. The keys are the names of the encoders and the values +are the encoders themselves. +
+
+ TYPE:
+ |
+
The HuggingfaceEmbeddings component is a wrapper around the Huggingface multi-modal +models. Such pre-trained models should offer better results than a model trained +from scratch. Compared to using the raw Huggingface model, we offer a simple +mechanism to split long documents into strided windows before feeding them to the +model.
+The HuggingfaceEmbedding component splits long documents into smaller windows before
+feeding them to the model. This is done to avoid hitting the maximum number of
+tokens that can be processed by the model on a single device. The window size and
+stride can be configured using the window
and stride
parameters. The default
+values are 510 and 255 respectively, which means that the model will process windows
+of 510 tokens, each separated by 255 tokens. Whenever a token appears in multiple
+windows, the embedding of the "most contextualized" occurrence is used, i.e. the
+occurrence that is the closest to the center of its window.
Here is an overview how this works in a classifier model : +
+Here is an example of how to define a pipeline with the HuggingfaceEmbedding +component:
+from edspdf import Pipeline
+
+model = Pipeline()
+model.add_pipe(
+ "pdfminer-extractor",
+ name="extractor",
+ config={
+ "render_pages": True,
+ },
+)
+model.add_pipe(
+ "huggingface-embedding",
+ name="embedding",
+ config={
+ "model": "microsoft/layoutlmv3-base",
+ "use_image": False,
+ "window": 128,
+ "stride": 64,
+ "line_pooling": "mean",
+ },
+)
+model.add_pipe(
+ "trainable-classifier",
+ name="classifier",
+ config={
+ "embedding": model.get_pipe("embedding"),
+ "labels": [],
+ },
+)
+
This model can then be trained following the +training recipe.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline instance +
+
+ TYPE:
+ |
+
name |
+
+ The component name +
+
+ TYPE:
+ |
+
model |
+
+ The Huggingface model name or path +
+
+ TYPE:
+ |
+
use_image |
+
+ Whether to use the image or not in the model +
+
+ TYPE:
+ |
+
window |
+
+ The window size to use when splitting long documents into smaller windows +before feeding them to the Transformer model (default: 510 = 512 - 2) +
+
+ TYPE:
+ |
+
stride |
+
+ The stride (distance between windows) to use when splitting long documents into +smaller windows: (default: 510 / 2 = 255) +
+
+ TYPE:
+ |
+
line_pooling |
+
+ The pooling strategy to use when combining the embeddings of the tokens in a +line into a single line embedding +
+
+ TYPE:
+ |
+
max_tokens_per_device |
+
+ The maximum number of tokens that can be processed by the model on a single +device. This does not affect the results but can be used to reduce the memory +usage of the model, at the cost of a longer processing time. +
+
+ TYPE:
+ |
+
We offer multiple embedding methods to encode the text and layout information of the PDFs. The following components can be added to a pipeline or composed together, and contain preprocessing and postprocessing logic to convert and batch documents.
+ + + + +Factory name | +Description | +
---|---|
simple-text-embedding |
+A module that embeds the textual features of the blocks. | +
embedding-combiner |
+Encodes boxes using a combination of multiple encoders | +
sub-box-cnn-pooler |
+Pools the output of a CNN over the elements of a box (like words) | +
box-layout-embedding |
+Encodes the layout of the boxes | +
box-transformer |
+Contextualizes box representations using a transformer | +
huggingface-embedding |
+Box representations using a Huggingface multi-modal model. | +
Layers
+These components are not to be confused with layers
, which are standard
+PyTorch modules that can be used to build trainable components, such as the ones
+described here.
A module that embeds the textual features of the blocks
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
size |
+
+ Size of the output box embedding +
+
+ TYPE:
+ |
+
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+ Name of the component +
+
+ TYPE:
+ |
+
One dimension CNN encoding multi-kernel layer.
+Input embeddings are convoluted using linear kernels each parametrized with
+a (window) size of kernel_size[kernel_i]
+The output of the kernels are concatenated together, max-pooled and finally
+projected to a size of output_size
.
PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ Pipeline instance +
+
+ TYPE:
+ |
+
name |
+
+ Name of the component +
+
+ TYPE:
+ |
+
output_size |
+
+ Size of the output embeddings
+Defaults to the
+
+ TYPE:
+ |
+
out_channels |
+
+ Number of channels +
+
+ TYPE:
+ |
+
kernel_sizes |
+
+ Window size of each kernel +
+
+ TYPE:
+ |
+
activation |
+
+ Activation function to use +
+
+ TYPE:
+ |
+
The extraction phase consists of reading the PDF document and gather text blocs, along with their dimensions and position within the document. Said blocs will go on to the classification phase to separate the body from the rest.
+We provide a multiple extractor architectures for text-based PDFs :
+ + +Factory name | +Description | +
---|---|
pdfminer-extractor |
+Extracts text lines with the pdfminer library |
+
mupdf-extractor |
+Extracts text lines with the pymupdf library |
+
poppler-extractor |
+Extracts text lines with the poppler library |
+
Image-based PDF documents require an OCR1 step, which is not natively supported by EDS-PDF. +However, you can easily extend EDS-PDF by adding such a method to the registry.
+We plan on adding such an OCR extractor component in the future.
+Optical Character Recognition, or OCR, is the process of extracting characters and words from an image. ↩
+We provide a PDF line extractor built on top of +PdfMiner.
+This is the most portable extractor, since it is pure-python and can therefore +be run on any platform. Be sure to have a look at their documentation, +especially the part providing a bird's eye view of the PDF extraction process.
+pipeline.add_pipe(
+ "pdfminer-extractor",
+ config=dict(
+ extract_style=False,
+ ),
+)
+
[components.extractor]
+@factory = "pdfminer-extractor"
+extract_style = false
+
And use the pipeline on a PDF document:
+from pathlib import Path
+
+# Apply on a new document
+pipeline(Path("path/to/your/pdf/document").read_bytes())
+
PARAMETER | +DESCRIPTION | +
---|---|
line_overlap |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
char_margin |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
line_margin |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
word_margin |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
boxes_flow |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
detect_vertical |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
all_texts |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
extract_style |
+
+ Whether to extract style (font, size, ...) information for each line of +the document. +Default: False +
+
+ TYPE:
+ |
+
render_pages |
+
+ Whether to extract the rendered page as a numpy array in the
+
+ TYPE:
+ |
+
render_dpi |
+
+ DPI to use when rendering the page (defaults to 200) +
+
+ TYPE:
+ |
+
raise_on_error |
+
+ Whether to raise an error if the PDF cannot be parsed. +Default: False +
+
+ TYPE:
+ |
+
EDS-PDF provides easy-to-use components for defining PDF processing pipelines.
+Factory name | +Description | +
---|---|
pdfminer-extractor |
+Extracts text lines with the pdfminer library |
+
mupdf-extractor |
+Extracts text lines with the pymupdf library |
+
poppler-extractor |
+Extracts text lines with the poppler library |
+
Factory name | +Description | +
---|---|
mask-classifier |
+Simple rule-based classification | +
multi-mask-classifier |
+Simple rule-based classification | +
dummy-classifier |
+Dummy classifier, for testing purposes. | +
random-classifier |
+To sow chaos | +
trainable-classifier |
+Trainable box classification model | +
Factory name | +Description | +
---|---|
simple-aggregator |
+Returns a dictionary with one key for each detected class | +
Factory name | +Description | +
---|---|
simple-text-embedding |
+A module that embeds the textual features of the blocks. | +
embedding-combiner |
+Encodes boxes using a combination of multiple encoders | +
sub-box-cnn-pooler |
+Pools the output of a CNN over the elements of a box (like words) | +
box-layout-embedding |
+Encodes the layout of the boxes | +
box-transformer |
+Contextualizes box representations using a transformer | +
huggingface-embedding |
+Box representations using a Huggingface multi-modal model. | +
You can add them to your EDS-PDF pipeline by simply calling add_pipe
, for instance:
# ↑ Omitted code that defines the pipeline object ↑
+pipeline.add_pipe("pdfminer-extractor", name="component-name", config=...)
+
In this section, we will cover one methodology to annotate PDF documents.
+Data annotation at AP-HP's CDW
+At AP-HP's CDW1, we recently moved away from a rule- and Java-based PDF extraction pipeline +(using PDFBox) to one using EDS-PDF. Hence, EDS-PDF is used in production, helping +extract text from around 100k PDF documents every day.
+To train our pipeline presently in production, we annotated around 270 documents, and reached +a f1-score of 0.98 on the body classification.
+We will frame the annotation phase as an image segmentation task,
+where annotators are asked to draw bounding boxes around the different sections.
+Hence, the very first step is to convert PDF documents to images. We suggest using the
+library pdf2image
for that step.
The following script will convert the PDF documents located in a data/pdfs
directory
+to PNG images inside the data/images
folder.
import pdf2image
+from pathlib import Path
+
+DATA_DIR = Path("data")
+PDF_DIR = DATA_DIR / "pdfs"
+IMAGE_DIR = DATA_DIR / "images"
+
+for pdf in PDF_DIR.glob("*.pdf"):
+ imgs = pdf2image.convert_from_bytes(pdf)
+
+ for page, img in enumerate(imgs):
+ path = IMAGE_DIR / f"{pdf.stem}_{page}.png"
+ img.save(path)
+
You can use any annotation tool to annotate the images. If you're looking for a simple +way to annotate from within a Jupyter Notebook, +ipyannotations +might be a good fit.
+You will need to post-process the output +to convert the annotations to the following format:
+Key | +Description | +
---|---|
page |
+Page within the PDF (0-indexed) | +
x0 |
+Horizontal position of the top-left corner of the bounding box | +
x1 |
+Horizontal position of the bottom-right corner of the bounding box | +
y0 |
+Vertical position of the top-left corner of the bounding box | +
y1 |
+Vertical position of the bottom-right corner of the bounding box | +
label |
+Class of the bounding box (eg body , header ...) |
+
All dimensions should be normalised by the height and width of the page.
+Once the annotation phase is complete, make sure the train/test split is performed +once and for all when you create the dataset.
+We suggest the following structure:
+dataset/
+├── train/
+│ ├── <note_id_1>.pdf
+│ ├── <note_id_1>.json
+│ ├── <note_id_2>.pdf
+│ ├── <note_id_2>.json
+│ └── ...
+└── test/
+ ├── <note_id_n>.pdf
+ ├── <note_id_n>.json
+ └── ...
+
Where the normalised annotation resides in a JSON file living next to the related PDF, +and uses the following schema:
+Key | +Description | +
---|---|
note_id |
+Reference to the document | +
<properties> |
+Optional property of the document itself | +
annotations |
+List of annotations, following the schema above | +
This structure presents the advantage of being machine- and human-friendly. +The JSON file contains annotated regions as well as any document property that +could be useful to adapt the pipeline (typically for the classification step).
+The following snippet extracts the annotations into a workable format:
+from pathlib import Path
+import pandas as pd
+
+
+def get_annotations(
+ directory: Path,
+) -> pd.DataFrame:
+ """
+ Read annotations from the dataset directory.
+
+ Parameters
+ ----------
+ directory : Path
+ Dataset directory
+
+ Returns
+ -------
+ pd.DataFrame
+ Pandas DataFrame containing the annotations.
+ """
+ dfs = []
+
+ iterator = tqdm(list(directory.glob("*.json")))
+
+ for path in iterator:
+ meta = json.loads(path.read_text())
+ df = pd.DataFrame.from_records(meta.pop("annotations"))
+
+ for k, v in meta.items(): # (1)
+ df[k] = v
+
+ dfs.append(df)
+
+ return pd.concat(dfs)
+
+
+train_path = Path("dataset/train")
+
+annotations = get_annotations(train_path)
+
The annotations compiled this way can be used to train a pipeline. +See the trained pipeline recipe for more detail.
+Greater Paris University Hospital's Clinical Data Warehouse ↩
+EDS-PDF is organised around a function registry powered by catalogue and a custom configuration system. The result is a powerful framework that is easy to extend - and we'll see how in this section.
+For this recipe, let's imagine we're not entirely satisfied with the aggregation +proposed by EDS-PDF. For instance, we might want an aggregator that outputs the +text in Markdown format.
+Note
+Properly converting to markdown is no easy task. For this example, +we will limit ourselves to detecting bold and italics sections.
+Our aggregator will inherit from the SimpleAggregator
,
+and use the style to detect italics and bold sections.
from edspdf import registry
+from edspdf.pipes.aggregators.simple import SimpleAggregator
+from edspdf.structures import PDFDoc, Text
+
+
+@registry.factory.register("markdown-aggregator") # (1)
+class MarkdownAggregator(SimpleAggregator):
+ def __call__(self, doc: PDFDoc) -> PDFDoc:
+ doc = super().__call__(doc)
+
+ for label in doc.aggregated_texts.keys():
+ text = doc.aggregated_texts[label].text
+
+ fragments = []
+
+ offset = 0
+ for s in doc.aggregated_texts[label].properties:
+ if s.begin >= s.end:
+ continue
+ if offset < s.begin:
+ fragments.append(text[offset : s.begin])
+
+ offset = s.end
+ snippet = text[s.begin : s.end]
+ if s.bold:
+ snippet = f"**{snippet}**"
+ if s.italic:
+ snippet = f"_{snippet}_"
+ fragments.append(snippet)
+
+ if offset < len(text):
+ fragments.append(text[offset:])
+
+ doc.aggregated_texts[label] = Text(text="".join(fragments))
+
+ return doc
+
__call__
method.
+ It will output a single string, corresponding to the markdown-formatted output.That's it! You can use this new aggregator with the API:
+from edspdf import Pipeline
+from markdown_aggregator import MarkdownAggregator # (1)
+
+model = Pipeline()
+# will extract text lines from a document
+model.add_pipe(
+ "pdfminer-extractor",
+ config=dict(
+ extract_style=False,
+ ),
+)
+# classify everything inside the `body` bounding box as `body`
+model.add_pipe("mask-classifier", config={"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.9})
+# aggregates the lines together to generate the markdown formatted text
+model.add_pipe("markdown-aggregator")
+
It all works relatively smoothly!
+Now, how can we instantiate the pipeline using the configuration system?
+The registry needs to be aware of the new function, but we shouldn't have to
+import mardown_aggregator.py
just so that the module is registered as a side-effect...
Catalogue solves this problem by using Python entry points.
+[project.entry-points."edspdf_factories"]
+"markdown-aggregator" = "markdown_aggregator:MarkdownAggregator"
+
from setuptools import setup
+
+setup(
+ name="edspdf-markdown-aggregator",
+ entry_points={
+ "edspdf_factories": [
+ "markdown-aggregator = markdown_aggregator:MarkdownAggregator"
+ ]
+ },
+)
+
By declaring the new aggregator as an entrypoint, it will become discoverable by EDS-PDF +as long as it is installed in your environment!
+This section goes over a few use-cases for PDF extraction. +It is meant as a more hands-on tutorial to get a grip on the library.
+Let's create a rule-based extractor for PDF documents.
+Note
+This pipeline will likely perform poorly as soon as your PDF documents +come in varied forms. In that case, even a very simple trained pipeline +may give you a substantial performance boost (see next section).
+First, download this example PDF.
+We will use the following configuration:
+[pipeline]
+components = ["extractor", "classifier", "aggregator"]
+components_config = ${components}
+
+[components.extractor]
+@factory = "pdfminer-extractor" # (2)
+extract_style = true
+
+[components.classifier]
+@factory = "mask-classifier" # (3)
+x0 = 0.2
+x1 = 0.9
+y0 = 0.3
+y1 = 0.6
+threshold = 0.1
+
+[components.aggregator]
+@factory = "styled-aggregator" # (4)
+
body
label, everything else will be tagged as pollution.Save the configuration as config.cfg
and run the following snippet:
import edspdf
+import pandas as pd
+from pathlib import Path
+
+model = edspdf.load("config.cfg") # (1)
+
+# Get a PDF
+pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
+pdf = model(pdf)
+
+body = pdf.aggregated_texts["body"]
+
+text, style = body.text, body.properties
+print(text)
+print(pd.DataFrame(style))
+
This code will output the following results:
+Cher Pr ABC, Cher DEF,
+
+Nous souhaitons remercier le CSE pour son avis favorable quant à l’accès aux données de
+l’Entrepôt de Données de Santé du projet n° XXXX.
+
+Nous avons bien pris connaissance des conditions requises pour cet avis favorable, c’est
+pourquoi nous nous engageons par la présente à :
+
+• Informer individuellement les patients concernés par la recherche, admis à l'AP-HP
+avant juillet 2017, sortis vivants, et non réadmis depuis.
+
+• Effectuer une demande d'autorisation à la CNIL en cas d'appariement avec d’autres
+cohortes.
+
+Bien cordialement,
+
The start
and end
columns refer to the character indices within the extracted text.
italic | +bold | +fontname | +start | +end | +
---|---|---|---|---|
False | +False | +BCDFEE+Calibri | +0 | +22 | +
False | +False | +BCDFEE+Calibri | +24 | +90 | +
False | +False | +BCDHEE+Calibri | +90 | +91 | +
False | +False | +BCDFEE+Calibri | +91 | +111 | +
False | +False | +BCDFEE+Calibri | +112 | +113 | +
False | +False | +BCDHEE+Calibri | +113 | +114 | +
False | +False | +BCDFEE+Calibri | +114 | +161 | +
False | +False | +BCDFEE+Calibri | +163 | +247 | +
False | +False | +BCDHEE+Calibri | +247 | +248 | +
False | +False | +BCDFEE+Calibri | +248 | +251 | +
False | +False | +BCDFEE+Calibri | +252 | +300 | +
False | +False | +SymbolMT | +302 | +303 | +
False | +False | +BCDFEE+Calibri | +304 | +386 | +
False | +False | +BCDFEE+Calibri | +387 | +445 | +
False | +False | +SymbolMT | +447 | +448 | +
False | +False | +BCDFEE+Calibri | +449 | +523 | +
False | +False | +BCDHEE+Calibri | +523 | +524 | +
False | +False | +BCDFEE+Calibri | +524 | +530 | +
False | +False | +BCDFEE+Calibri | +531 | +540 | +
False | +False | +BCDFEE+Calibri | +542 | +560 | +
In this chapter, we'll see how we can train a deep-learning based classifier to better classify the lines of the +document and extract texts from the document.
+Training supervised models consists in feeding batches of samples taken from a training corpus +to a model instantiated from a given architecture and optimizing the learnable weights of the +model to decrease a given loss. The process of training a pipeline with EDS-PDF is as follows:
+We first start by seeding the random states and instantiating a new trainable pipeline. Here we show two examples of pipeline, the first one based on a custom embedding architecture and the second one based on a pre-trained HuggingFace transformer model.
+The architecture of the trainable classifier of this recipe is described in the following figure: +
+from edspdf import Pipeline
+from edspdf.utils.random import set_seed
+
+set_seed(42)
+
+model = Pipeline()
+model.add_pipe("pdfminer-extractor", name="extractor") # (1)
+model.add_pipe(
+ "box-transformer",
+ name="embedding",
+ config={
+ "num_heads": 4,
+ "dropout_p": 0.1,
+ "activation": "gelu",
+ "init_resweight": 0.01,
+ "head_size": 16,
+ "attention_mode": ["c2c", "c2p", "p2c"],
+ "n_layers": 1,
+ "n_relative_positions": 64,
+ "embedding": {
+ "@factory": "embedding-combiner",
+ "dropout_p": 0.1,
+ "text_encoder": {
+ "@factory": "sub-box-cnn-pooler",
+ "out_channels": 64,
+ "kernel_sizes": (3, 4, 5),
+ "embedding": {
+ "@factory": "simple-text-embedding",
+ "size": 72,
+ },
+ },
+ "layout_encoder": {
+ "@factory": "box-layout-embedding",
+ "n_positions": 64,
+ "x_mode": "learned",
+ "y_mode": "learned",
+ "w_mode": "learned",
+ "h_mode": "learned",
+ "size": 72,
+ },
+ },
+ },
+)
+model.add_pipe(
+ "trainable-classifier",
+ name="classifier",
+ config={
+ "embedding": model.get_pipe("embedding"),
+ "labels": [],
+ },
+)
+
model = Pipeline()
+model.add_pipe(
+ "mupdf-extractor",
+ name="extractor",
+ config={
+ "render_pages": True,
+ },
+) # (1)
+model.add_pipe(
+ "huggingface-embedding",
+ name="embedding",
+ config={
+ "model": "microsoft/layoutlmv3-base",
+ "use_image": False,
+ "window": 128,
+ "stride": 64,
+ "line_pooling": "mean",
+ },
+)
+model.add_pipe(
+ "trainable-classifier",
+ name="classifier",
+ config={
+ "embedding": model.get_pipe("embedding"),
+ "labels": [],
+ },
+)
+
We then load and adapt (i.e., convert into PDFDoc) the training and validation dataset, which is often a combination of JSON and PDF files. The recommended way of doing this is to make a Python generator of PDFDoc objects. +
train_docs = list(segmentation_adapter(train_path)(model))
+val_docs = list(segmentation_adapter(val_path)(model))
+
We initialize the missing or incomplete components attributes (such as vocabularies) with the training dataset +
model.post_init(train_docs)
+
The training dataset is then preprocessed into features. The resulting preprocessed dataset is then wrapped into a pytorch DataLoader to be fed to the model during the training loop with the model's own collate method. +
preprocessed = list(model.preprocess_many(train_docs, supervision=True))
+dataloader = DataLoader(
+ preprocessed,
+ batch_size=batch_size,
+ collate_fn=model.collate,
+ shuffle=True,
+)
+
We instantiate an optimizer and start the training loop +
from itertools import chain, repeat
+
+optimizer = torch.optim.AdamW(
+ params=model.parameters(),
+ lr=lr,
+)
+
+# We will loop over the dataloader
+iterator = chain.from_iterable(repeat(dataloader))
+
+for step in tqdm(range(max_steps), "Training model", leave=True):
+ batch = next(iterator)
+ optimizer.zero_grad()
+
The trainable components are fed the collated batches from the dataloader with the TrainablePipe.module_forward
methods to compute the losses. Since outputs of shared subcomponents are reused between components, we enable caching by wrapping this step in a cache context. The training loop is otherwise carried in a similar fashion to a standard pytorch training loop
+
with model.cache():
+ loss = torch.zeros((), device="cpu")
+ for name, component in model.trainable_pipes():
+ output = component.module_forward(batch[component.name])
+ if "loss" in output:
+ loss += output["loss"]
+
+ loss.backward()
+
+ optimizer.step()
+
Finally, the model is evaluated on the validation dataset at regular intervals and saved at the end of the training. To score the model, we only want to run "classifier" component and not the extractor, otherwise we would overwrite annotated text boxes on documents in the val_docs
dataset, and have mismatching text boxes between the gold and predicted documents. To save the model, although you can use torch.save
to save your model, we provide a safer method to avoid the security pitfalls of pickle models
+
from edspdf import Pipeline
+from sklearn.metrics import classification_report
+from copy import deepcopy
+
+
+def score(golds, preds):
+ return classification_report(
+ [b.label for gold in golds for b in gold.text_boxes if b.text != ""],
+ [b.label for pred in preds for b in pred.text_boxes if b.text != ""],
+ output_dict=True,
+ zero_division=0,
+ )
+
+
+...
+
+if (step % 100) == 0:
+ # we only want to run "classifier" component, not overwrite the text boxes
+ with model.select_pipes(enable=["classifier"]):
+ print(score(val_docs, model.pipe(deepcopy(val_docs))))
+
+# torch.save(model, "model.pt")
+model.save("model")
+
The first step of training a pipeline is to adapt the dataset to the pipeline. This is done by converting the dataset into a list of PDFDoc objects, using an extractor. The following function loads a dataset of .pdf
and .json
files, where each .json
file contain box annotations represented with page
, x0
, x1
, y0
, y1
and label
.
from edspdf.utils.alignment import align_box_labels
+from pathlib import Path
+from pydantic import DirectoryPath
+from edspdf.registry import registry
+from edspdf.structures import Box
+import json
+
+
+@registry.adapter.register("my-segmentation-adapter")
+def segmentation_adapter(
+ path: DirectoryPath,
+):
+ def adapt_to(model):
+ for anns_filepath in sorted(Path(path).glob("*.json")):
+ pdf_filepath = str(anns_filepath).replace(".json", ".pdf")
+ with open(anns_filepath) as f:
+ sample = json.load(f)
+ pdf = Path(pdf_filepath).read_bytes()
+
+ if len(sample["annotations"]) == 0:
+ continue
+
+ doc = model.components.extractor(pdf)
+ doc.id = pdf_filepath.split(".")[0].split("/")[-1]
+ doc.lines = [
+ line
+ for page in sorted(set(b.page for b in doc.lines))
+ for line in align_box_labels(
+ src_boxes=[
+ Box(
+ page_num=b["page"],
+ x0=b["x0"],
+ x1=b["x1"],
+ y0=b["y0"],
+ y1=b["y1"],
+ label=b["label"],
+ )
+ for b in sample["annotations"]
+ if b["page"] == page
+ ],
+ dst_boxes=doc.lines,
+ pollution_label=None,
+ )
+ if line.text == "" or line.label is not None
+ ]
+ yield doc
+
+ return adapt_to
+
Let's wrap the training code in a function, and make it callable from the command line using confit !
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 |
|
python train.py --seed 42
+
At the end of the training, the pipeline is ready to use (with the .pipe
method) since every trained component of the pipeline is self-sufficient, ie contains the preprocessing, inference and postprocessing code required to run it.
To decouple the configuration and the code of our training script, let's define a configuration file where we will describe both our training parameters and the pipeline. You can either write the config of the pipeline by hand, or generate it from an instantiated pipeline by running:
+print(pipeline.config.to_str())
+
# This is this equivalent of the API-based declaration at the beginning of the tutorial
+[pipeline]
+pipeline = ["extractor", "embedding", "classifier"]
+disabled = []
+components = ${components}
+
+[components]
+
+[components.extractor]
+@factory = "pdfminer-extractor"
+
+[components.embedding]
+@factory = "box-transformer"
+num_heads = 4
+dropout_p = 0.1
+activation = "gelu"
+init_resweight = 0.01
+head_size = 16
+attention_mode = ["c2c", "c2p", "p2c"]
+n_layers = 1
+n_relative_positions = 64
+
+[components.embedding.embedding]
+@factory = "embedding-combiner"
+dropout_p = 0.1
+
+[components.embedding.embedding.text_encoder]
+@factory = "sub-box-cnn-pooler"
+out_channels = 64
+kernel_sizes = (3, 4, 5)
+
+[components.embedding.embedding.text_encoder.embedding]
+@factory = "simple-text-embedding"
+size = 72
+
+[components.embedding.embedding.layout_encoder]
+@factory = "box-layout-embedding"
+n_positions = 64
+x_mode = "learned"
+y_mode = "learned"
+w_mode = "learned"
+h_mode = "learned"
+size = 72
+
+[components.classifier]
+@factory = "trainable-classifier"
+embedding = ${components.embedding}
+labels = []
+
+# This is were we define the training script parameters
+# the "train" section refers to the name of the command in the training script
+[train]
+model = ${pipeline}
+train_data = {"@adapter": "my-segmentation-adapter", "path": "data/train"}
+val_data = {"@adapter": "my-segmentation-adapter", "path": "data/val"}
+max_steps = 1000
+seed = 42
+lr = 3e-4
+batch_size = 4
+
[pipeline]
+pipeline = ["extractor", "embedding", "classifier"]
+disabled = []
+components = ${components}
+
+[components]
+
+[components.extractor]
+@factory = "mupdf-extractor"
+render_pages = true
+
+[components.embedding]
+@factory = "huggingface-embedding"
+model = "microsoft/layoutlmv3-base"
+use_image = false
+window = 128
+stride = 64
+line_pooling = "mean"
+
+[components.classifier]
+@factory = "trainable-classifier"
+embedding = ${components.embedding}
+labels = []
+
+[train]
+model = ${pipeline}
+max_steps = 1000
+lr = 5e-5
+seed = 42
+train_data = {"@adapter": "my-segmentation-adapter", "path": "data/train"}
+val_data = {"@adapter": "my-segmentation-adapter", "path": "data/val"}
+batch_size = 8
+
and update our training script to use the pipeline and the data adapters defined in the configuration file instead of the Python declaration :
+@app.command(name="train")
+def train_my_model(
++ model: Pipeline,
++ train_path: DirectoryPath = "data/train",
+- train_data: Callable = segmentation_adapter("data/train"),
++ val_path: DirectoryPath = "data/val",
+- val_data: Callable = segmentation_adapter("data/val"),
+ seed: int = 42,
+ max_steps: int = 1000,
+ batch_size: int = 4,
+ lr: float = 3e-4,
+):
+ # Seed will be set by the CLI util, before `model` is instanciated
+- set_seed(seed)
+
+ # Model will be defined from the config file using registries
+- model = Pipeline()
+- model.add_pipe("mupdf-extractor", name="extractor")
+- model.add_pipe(
+- "box-transformer",
+- name="embedding",
+- config={
+- "num_heads": 4,
+- "dropout_p": 0.1,
+- "activation": "gelu",
+- "init_resweight": 0.01,
+- "head_size": 16,
+- "attention_mode": ["c2c", "c2p", "p2c"],
+- "n_layers": 1,
+- "n_relative_positions": 64,
+- "embedding": {
+- "@factory": "embedding-combiner",
+- "dropout_p": 0.1,
+- "text_encoder": {
+- "@factory": "sub-box-cnn-pooler",
+- "out_channels": 64,
+- "kernel_sizes": (3, 4, 5),
+- "embedding": {
+- "@factory": "simple-text-embedding",
+- "size": 72,
+- },
+- },
+- "layout_encoder": {
+- "@factory": "box-layout-embedding",
+- "n_positions": 64,
+- "x_mode": "learned",
+- "y_mode": "learned",
+- "w_mode": "learned",
+- "h_mode": "learned",
+- "size": 72,
+- },
+- },
+- },
+- )
+- model.add_pipe(
+- "trainable-classifier",
+- name="classifier",
+- config={
+- "embedding": model.get_pipe("embedding"),
+- "labels": [],
+- },
+- )
+
+ # Loading and adapting the training and validation data
+- train_docs = list(segmentation_adapter(train_path)(model))
++ train_docs = list(train_data(model))
+- val_docs = list(segmentation_adapter(val_path)(model))
++ val_docs = list(val_data(model))
+
+ # Taking the first `initialization_subset` samples to initialize the model
+ ...
+
That's it ! We can now call the training script with the configuration file as a parameter, and override some of its defaults values:
+python train.py --config config.cfg --components.extractor.extract_styles=true --seed 43
+
edspdf.accelerators.base
FromDoc
+
+
+A FromDoc converter (from a PDFDoc to an arbitrary type) can be either:
+edspdf.accelerators
edspdf.accelerators.multiprocessing
MultiprocessingAccelerator
+
+
+ Bases: Accelerator
If you have multiple CPU cores, and optionally multiple GPUs, we provide a
+multiprocessing
accelerator that allows to run the inference on multiple
+processes.
This accelerator dispatches the batches between multiple workers +(data-parallelism), and distribute the computation of a given batch on one or two +workers (model-parallelism). This is done by creating two types of workers:
+CPUWorker
which handles the non deep-learning components and the
+ preprocessing, collating and postprocessing of deep-learning componentsGPUWorker
which handles the forward call of the deep-learning componentsThe advantage of dedicating a worker to the deep-learning components is that it
+allows to prepare multiple batches in parallel in multiple CPUWorker
, and ensure
+that the GPUWorker
never wait for a batch to be ready.
The overall architecture described in the following figure, for 3 CPU workers and 2 +GPU workers.
+ + +Here is how a small pipeline with rule-based components and deep-learning components +is distributed between the workers:
+ +docs = list(
+ pipeline.pipe(
+ [content1, content2, ...],
+ accelerator={
+ "@accelerator": "multiprocessing",
+ "num_cpu_workers": 3,
+ "num_gpu_workers": 2,
+ "batch_size": 8,
+ },
+ )
+)
+
PARAMETER | +DESCRIPTION | +
---|---|
batch_size |
+
+ Number of documents to process at a time in a CPU/GPU worker +
+
+ TYPE:
+ |
+
num_cpu_workers |
+
+ Number of CPU workers. A CPU worker handles the non deep-learning components +and the preprocessing, collating and postprocessing of deep-learning components. +
+
+ TYPE:
+ |
+
num_gpu_workers |
+
+ Number of GPU workers. A GPU worker handles the forward call of the +deep-learning components. +
+
+ TYPE:
+ |
+
gpu_pipe_names |
+
+ List of pipe names to accelerate on a GPUWorker, defaults to all pipes +that inherit from TrainablePipe +
+
+ TYPE:
+ |
+
__call__
+
+Stream of documents to process. Each document can be a string or a tuple
+ +PARAMETER | +DESCRIPTION | +
---|---|
inputs |
+
+
+
+
+ TYPE:
+ |
+
model |
+
+
+
+
+ TYPE:
+ |
+
YIELDS | +DESCRIPTION | +
---|---|
+
+ Any
+
+ |
+
+
+
+ Processed outputs of the pipeline + |
+
edspdf.accelerators.simple
SimpleAccelerator
+
+
+ Bases: Accelerator
This is the simplest accelerator which batches the documents and process each batch
+on the main process (the one calling .pipe()
).
docs = list(pipeline.pipe([content1, content2, ...]))
+
or, if you want to override the model defined batch size
+docs = list(pipeline.pipe([content1, content2, ...], batch_size=8))
+
which is equivalent to passing a confit dict
+docs = list(
+ pipeline.pipe(
+ [content1, content2, ...],
+ accelerator={
+ "@accelerator": "simple",
+ "batch_size": 8,
+ },
+ )
+)
+
or the instantiated accelerator directly
+from edspdf.accelerators.simple import SimpleAccelerator
+
+accelerator = SimpleAccelerator(batch_size=8)
+docs = list(pipeline.pipe([content1, content2, ...], accelerator=accelerator))
+
If you have a GPU, make sure to move the model to the appropriate device before
+calling .pipe()
. If you have multiple GPUs, use the
+multiprocessing
+accelerator instead.
pipeline.to("cuda")
+docs = list(pipeline.pipe([content1, content2, ...]))
+
PARAMETER | +DESCRIPTION | +
---|---|
batch_size |
+
+ The number of documents to process in each batch. +
+
+ TYPE:
+ |
+
edspdf
edspdf.layers.box_transformer
BoxTransformerLayer
+
+
+ Bases: Module
BoxTransformerLayer combining a self attention layer and a +linear->activation->linear transformation. This layer is used in the +BoxTransformerModule module.
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
input_size |
+
+ Input embedding size +
+
+ TYPE:
+ |
+
num_heads |
+
+ Number of attention heads in the attention layer +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability both for the attention layer and embedding projections +
+
+ TYPE:
+ |
+
head_size |
+
+ Head sizes of the attention layer +
+
+ TYPE:
+ |
+
activation |
+
+ Activation function used in the linear->activation->linear transformation +
+
+ TYPE:
+ |
+
init_resweight |
+
+ Initial weight of the residual gates. +At 0, the layer acts (initially) as an identity function, and at 1 as +a standard Transformer layer. +Initializing with a value close to 0 can help the training converge. +
+
+ TYPE:
+ |
+
attention_mode |
+
+ Mode of relative position infused attention layer. +See the +relative attention +documentation for more information. +
+
+ TYPE:
+ |
+
position_embedding |
+
+ Position embedding to use as key/query position embedding in the attention +computation. +
+
+ TYPE:
+ |
+
forward
+
+Forward pass of the BoxTransformerLayer
+ +PARAMETER | +DESCRIPTION | +
---|---|
embeds |
+
+ Embeddings to contextualize
+Shape:
+
+ TYPE:
+ |
+
mask |
+
+ Mask of the embeddings. 0 means padding element.
+Shape:
+
+ TYPE:
+ |
+
relative_positions |
+
+ Position of the keys relatively to the query elements
+Shape:
+
+ TYPE:
+ |
+
no_position_mask |
+
+ Key / query pairs for which the position attention terms should
+be disabled.
+Shape:
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Tuple[FloatTensor, FloatTensor]
+
+ |
+
+
+
+
|
+
BoxTransformerModule
+
+
+ Bases: Module
Box Transformer architecture combining a multiple +BoxTransformerLayer +modules. It is mainly used in +BoxTransformer.
+ +PARAMETER | +DESCRIPTION | +
---|---|
input_size |
+
+ Input embedding size +
+
+ TYPE:
+ |
+
num_heads |
+
+ Number of attention heads in the attention layers +
+
+ TYPE:
+ |
+
n_relative_positions |
+
+ Maximum range of embeddable relative positions between boxes (further +distances are capped to ±n_relative_positions // 2) +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability both for the attention layers and embedding projections +
+
+ TYPE:
+ |
+
head_size |
+
+ Head sizes of the attention layers +
+
+ TYPE:
+ |
+
activation |
+
+ Activation function used in the linear->activation->linear transformations +
+
+ TYPE:
+ |
+
init_resweight |
+
+ Initial weight of the residual gates. +At 0, the layer acts (initially) as an identity function, and at 1 as +a standard Transformer layer. +Initializing with a value close to 0 can help the training converge. +
+
+ TYPE:
+ |
+
attention_mode |
+
+ Mode of relative position infused attention layer. +See the +relative attention +documentation for more information. +
+
+ TYPE:
+ |
+
n_layers |
+
+ Number of layers in the Transformer +
+
+ TYPE:
+ |
+
forward
+
+Forward pass of the BoxTransformer
+ +PARAMETER | +DESCRIPTION | +
---|---|
embeds |
+
+ Embeddings to contextualize
+Shape:
+
+ TYPE:
+ |
+
boxes |
+
+ Layout features of the input elements +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Tuple[FloatTensor, List[FloatTensor]]
+
+ |
+
+
+
+
|
+
edspdf.layers
edspdf.layers.relative_attention
RelativeAttention
+
+
+ Bases: Module
A self/cross-attention layer that takes relative position of elements into +account to compute the attention weights. +When running a relative attention layer, key and queries are represented using +content and position embeddings, where position embeddings are retrieved using +the relative position of keys relative to queries
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
size |
+
+ The size of the output embeddings +Also serves as default if query_size, pos_size, or key_size is None +
+
+ TYPE:
+ |
+
n_heads |
+
+ The number of attention heads +
+
+ TYPE:
+ |
+
query_size |
+
+ The size of the query embeddings. +
+
+ TYPE:
+ |
+
key_size |
+
+ The size of the key embeddings. +
+
+ TYPE:
+ |
+
value_size |
+
+ The size of the value embeddings +
+
+ TYPE:
+ |
+
head_size |
+
+ The size of each query / key / value chunk used in the attention dot product
+Default:
+
+ TYPE:
+ |
+
position_embedding |
+
+ The position embedding used as key and query embeddings +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability applied on the attention weights +Default: 0.1 +
+
+ TYPE:
+ |
+
same_key_query_proj |
+
+ Whether to use the same projection operator for content key and queries +when computing the pre-attention key and query embedding chunks +Default: False +
+
+ TYPE:
+ |
+
same_positional_key_query_proj |
+
+ Whether to use the same projection operator for content key and queries +when computing the pre-attention key and query embedding chunks +Default: False +
+
+ TYPE:
+ |
+
n_coordinates |
+
+ The number of positional coordinates +For instance, text is 1D so 1 coordinate, images are 2D so 2 coordinates ... +Default: 1 +
+
+ TYPE:
+ |
+
head_bias |
+
+ Whether to learn a bias term to add to the attention logits +This is only useful if you plan to use the attention logits for subsequent +operations, since attention weights are unaffected by bias terms. +
+
+ TYPE:
+ |
+
do_pooling |
+
+ Whether to compute the output embedding. +If you only plan to use attention logits, you should disable this parameter. +Default: True +
+
+ TYPE:
+ |
+
mode |
+
+ Whether to compute content to content (c2c), content to position (c2p)
+or position to content (p2c) attention terms.
+Setting
+
+ TYPE:
+ |
+
n_additional_heads |
+
+ The number of additional head logits to compute. +Those are not used to compute output embeddings, but may be useful in +subsequent operation. +Default: 0 +
+
+ TYPE:
+ |
+
forward
+
+Forward pass of the RelativeAttention layer.
+ +PARAMETER | +DESCRIPTION | +
---|---|
content_queries |
+
+ The content query embedding to use in the attention computation
+Shape:
+
+ TYPE:
+ |
+
content_keys |
+
+ The content key embedding to use in the attention computation.
+If None, defaults to the
+
+ TYPE:
+ |
+
content_values |
+
+ The content values embedding to use in the final pooling computation.
+If None, pooling won't be performed.
+Shape:
+
+ TYPE:
+ |
+
mask |
+
+ The content key embedding to use in the attention computation.
+If None, defaults to the
+
+ TYPE:
+ |
+
relative_positions |
+
+ The relative position of keys relative to queries
+If None, positional attention terms won't be computed.
+Shape:
+
+ TYPE:
+ |
+
no_position_mask |
+
+ Key / query pairs for which the position attention terms should
+be disabled.
+Shape:
+
+ TYPE:
+ |
+
base_attn |
+
+ Attention logits to add to the computed attention logits
+Shape:
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Union[Tuple[FloatTensor, FloatTensor], FloatTensor]
+
+ |
+
+
+
+
|
+
edspdf.layers.sinusoidal_embedding
SinusoidalEmbedding
+
+
+ Bases: Module
A position embedding lookup table that stores embeddings for a fixed number
+of positions.
+The value of each of the embedding_dim
channels of the generated embedding
+is generated according to a trigonometric function (sin for even channels,
+cos for odd channels).
+The frequency of the signal in each pair of channels varies according to the
+temperature parameter.
Any input position above the maximum value num_embeddings
will be capped to
+num_embeddings - 1
PARAMETER | +DESCRIPTION | +
---|---|
num_embeddings |
+
+ The maximum number of position embeddings store in this table +
+
+ TYPE:
+ |
+
embedding_dim |
+
+ The embedding size +
+
+ TYPE:
+ |
+
temperature |
+
+ The temperature controls the range of frequencies used by each +channel of the embedding +
+
+ TYPE:
+ |
+
forward
+
+Forward pass of the SinusoidalEmbedding module
+ +PARAMETER | +DESCRIPTION | +
---|---|
indices |
+
+ Shape: any +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ FloatTensor
+
+ |
+
+
+
+ Shape: |
+
edspdf.layers.vocabulary
Vocabulary
+
+
+ Bases: Module
, Generic[T]
Vocabulary layer.
+This is not meant to be used as a torch.nn.Module
but subclassing
+torch.nn.Module
makes the instances appear when printing a model, which is nice.
PARAMETER | +DESCRIPTION | +
---|---|
items |
+
+ Initial vocabulary elements if any. +Specific elements such as padding and unk can be set here to enforce their +index in the vocabulary. +
+
+ TYPE:
+ |
+
default |
+
+ Default index to use for out of vocabulary elements +Defaults to -100 +
+
+ TYPE:
+ |
+
initialization
+
+Enters the initialization mode. +Out of vocabulary elements will be assigned an index.
+ +encode
+
+Converts an element into its vocabulary index
+If the layer is in its initialization mode (with vocab.initialization(): ...
),
+and the element is out of vocabulary, a new index will be created and returned.
+Otherwise, any oov element will be encoded with the default
index.
PARAMETER | +DESCRIPTION | +
---|---|
item |
+
+
+ + + |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ int
+
+ |
+
+
+
+
+ |
+
decode
+
+Converts an index into its original value
+ +PARAMETER | +DESCRIPTION | +
---|---|
idx |
+
+
+ + + |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ InputT
+
+ |
+
+
+
+
+ |
+
edspdf.pipeline
Pipeline
+
+Pipeline to build hybrid and multitask PDF processing pipeline. +It uses PyTorch as the deep-learning backend and allows components to share +subcomponents.
+See the documentation for more details.
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
batch_size |
+
+ Batch size to use in the
+
+ TYPE:
+ |
+
meta |
+
+ Meta information about the pipeline +
+
+ TYPE:
+ |
+
disabled
+
+
+ property
+
+
+The names of the disabled components
+cfg: Config
+
+
+ property
+
+
+Returns the config of the pipeline, including the config of all components. +Updated from spacy to allow references between components.
+get_pipe
+
+Get a component by its name.
+ +PARAMETER | +DESCRIPTION | +
---|---|
name |
+
+ The name of the component to get. +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Pipe
+
+ |
+
+
+
+
+ |
+
has_pipe
+
+Check if a component exists in the pipeline.
+ +PARAMETER | +DESCRIPTION | +
---|---|
name |
+
+ The name of the component to check. +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ bool
+
+ |
+
+
+
+
+ |
+
create_pipe
+
+Create a component from a factory name.
+ +PARAMETER | +DESCRIPTION | +
---|---|
factory |
+
+ The name of the factory to use +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component +
+
+ TYPE:
+ |
+
config |
+
+ The config to pass to the factory +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Pipe
+
+ |
+
+
+
+
+ |
+
add_pipe
+
+Add a component to the pipeline.
+ +PARAMETER | +DESCRIPTION | +
---|---|
factory |
+
+ The name of the component to add or the component itself +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component. If not provided, the name of the component +will be used if it has one (.name), otherwise the factory name will be used. +
+
+ TYPE:
+ |
+
first |
+
+ Whether to add the component to the beginning of the pipeline. This argument
+is mutually exclusive with
+
+ TYPE:
+ |
+
before |
+
+ The name of the component to add the new component before. This argument is
+mutually exclusive with
+
+ TYPE:
+ |
+
after |
+
+ The name of the component to add the new component after. This argument is
+mutually exclusive with
+
+ TYPE:
+ |
+
config |
+
+ The arguments to pass to the component factory. +Note that instead of replacing arguments with the same keys, the config +will be merged with the default config of the component. This means that +you can override specific nested arguments without having to specify the +entire config. +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Pipe
+
+ |
+
+
+
+ The component that was added to the pipeline. + |
+
__call__
+
+Apply each component successively on a document.
+ +PARAMETER | +DESCRIPTION | +
---|---|
doc |
+
+ The doc to create the PDFDoc from, or a PDFDoc. +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ PDFDoc
+
+ |
+
+
+
+
+ |
+
pipe
+
+Process a stream of documents by applying each component successively on +batches of documents.
+ +PARAMETER | +DESCRIPTION | +
---|---|
inputs |
+
+ The inputs to create the PDFDocs from, or the PDFDocs directly. +
+
+ TYPE:
+ |
+
batch_size |
+
+ The batch size to use. If not provided, the batch size of the pipeline +object will be used. +
+
+ TYPE:
+ |
+
accelerator |
+
+ The accelerator to use for processing the documents. If not provided, +the default accelerator will be used. +
+
+ TYPE:
+ |
+
to_doc |
+
+ The function to use to convert the inputs to PDFDoc objects. By default,
+the
+
+ TYPE:
+ |
+
from_doc |
+
+ The function to use to convert the PDFDoc objects to outputs. By default, +the PDFDoc objects will be returned directly. +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Iterable[PDFDoc]
+
+ |
+
+
+
+
+ |
+
cache
+
+Enable caching for all (trainable) components in the pipeline
+ +trainable_pipes
+
+Yields components that are PyTorch modules.
+ +PARAMETER | +DESCRIPTION | +
---|---|
disable |
+
+ The names of disabled components, which will be skipped. +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Iterable[Tuple[str, TrainablePipe]]
+
+ |
+
+
+
+
+ |
+
post_init
+
+Completes the initialization of the pipeline by calling the post_init +method of all components that have one. +This is useful for components that need to see some data to build +their vocabulary, for instance.
+ +PARAMETER | +DESCRIPTION | +
---|---|
gold_data |
+
+ The documents to use for initialization. +Each component will not necessarily see all the data. +
+
+ TYPE:
+ |
+
exclude |
+
+ The names of components to exclude from initialization. +This argument will be gradually updated with the names of initialized +components +
+
+ TYPE:
+ |
+
from_config
+
+
+ classmethod
+
+
+Create a pipeline from a config object
+ +PARAMETER | +DESCRIPTION | +
---|---|
config |
+
+ The config to use +
+
+ TYPE:
+ |
+
disable |
+
+ Components to disable +
+
+ TYPE:
+ |
+
enable |
+
+ Components to enable +
+
+ TYPE:
+ |
+
exclude |
+
+ Components to exclude +
+
+ TYPE:
+ |
+
meta |
+
+ Metadata to add to the pipeline +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Pipeline
+
+ |
+
+
+
+
+ |
+
__get_validators__
+
+
+ classmethod
+
+
+Pydantic validators generator
+ +validate
+
+
+ classmethod
+
+
+Pydantic validator, used in the validate_arguments
decorated functions
preprocess
+
+Run the preprocessing methods of each component in the pipeline +on a document and returns a dictionary containing the results, with the +component names as keys.
+ +PARAMETER | +DESCRIPTION | +
---|---|
doc |
+
+ The document to preprocess +
+
+ TYPE:
+ |
+
supervision |
+
+ Whether to include supervision information in the preprocessing +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Dict[str, Any]
+
+ |
+
+
+
+
+ |
+
preprocess_many
+
+Runs the preprocessing methods of each component in the pipeline on +a collection of documents and returns an iterable of dictionaries containing +the results, with the component names as keys.
+ +PARAMETER | +DESCRIPTION | +
---|---|
docs |
+
+
+
+
+ TYPE:
+ |
+
compress |
+
+ Whether to deduplicate identical preprocessing outputs of the results +if multiple documents share identical subcomponents. This step is required +to enable the cache mechanism when training or running the pipeline over a +tabular datasets such as pyarrow tables that do not store referential +equality information. +
+
+ DEFAULT:
+ |
+
supervision |
+
+ Whether to include supervision information in the preprocessing +
+
+ DEFAULT:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Iterable[OutputT]
+
+ |
+
+
+
+
+ |
+
collate
+
+Collates a batch of preprocessed samples into a single (maybe nested) +dictionary of tensors by calling the collate method of each component.
+ +PARAMETER | +DESCRIPTION | +
---|---|
batch |
+
+ The batch of preprocessed samples +
+
+ TYPE:
+ |
+
device |
+
+ The device to move the tensors to before returning them +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Dict[str, Any]
+
+ |
+
+
+
+ The collated batch + |
+
parameters
+
+Returns an iterator over the Pytorch parameters of the components in the +pipeline
+ +named_parameters
+
+Returns an iterator over the Pytorch parameters of the components in the +pipeline
+ +to
+
+Moves the pipeline to a given device
+ +train
+
+Enables training mode on pytorch modules
+ +PARAMETER | +DESCRIPTION | +
---|---|
mode |
+
+ Whether to enable training or not +
+
+ DEFAULT:
+ |
+
save
+
+Save the pipeline to a directory.
+ +PARAMETER | +DESCRIPTION | +
---|---|
path |
+
+ The path to the directory to save the pipeline to. Every component will be +saved to separated subdirectories of this directory, except for tensors +that will be saved to a shared files depending on the references between +the components. +
+
+ TYPE:
+ |
+
exclude |
+
+ The names of the components, or attributes to exclude from the saving +process. This list will be gradually filled in place as components are +saved +
+
+ TYPE:
+ |
+
load_state_from_disk
+
+Load the pipeline from a directory. Components will be updated in-place.
+ +PARAMETER | +DESCRIPTION | +
---|---|
path |
+
+ The path to the directory to load the pipeline from +
+
+ TYPE:
+ |
+
exclude |
+
+ The names of the components, or attributes to exclude from the loading +process. This list will be gradually filled in place as components are +loaded +
+
+ TYPE:
+ |
+
select_pipes
+
+Temporarily disable and enable components in the pipeline.
+ +PARAMETER | +DESCRIPTION | +
---|---|
disable |
+
+ The name of the component to disable, or a list of names. +
+
+ TYPE:
+ |
+
enable |
+
+ The name of the component to enable, or a list of names. +
+
+ TYPE:
+ |
+
edspdf.pipes.aggregators
edspdf.pipes.aggregators.simple
SimpleAggregator
+
+Aggregator that returns texts and styles. It groups all text boxes with the same
+label under the aggregated_text
, and additionally aggregates the
+styles of the text boxes.
Create a pipeline
+pipeline = ...
+pipeline.add_pipe(
+ "simple-aggregator",
+ name="aggregator",
+ config={
+ "new_line_threshold": 0.2,
+ "new_paragraph_threshold": 1.5,
+ "label_map": {
+ "body": "text",
+ "table": "text",
+ },
+ },
+)
+
...
+
+[components.aggregator]
+@factory = "simple-aggregator"
+new_line_threshold = 0.2
+new_paragraph_threshold = 1.5
+label_map = { body = "text", table = "text" }
+
+...
+
and run it on a document:
+doc = pipeline(doc)
+print(doc.aggregated_texts)
+# {
+# "text": "This is the body of the document, followed by a table | A | B |"
+# }
+
PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component +
+
+ TYPE:
+ |
+
sort |
+
+ Whether to sort text boxes inside each label group by (page, y, x) position +before merging them. +
+
+ TYPE:
+ |
+
new_line_threshold |
+
+ Minimum ratio of the distance between two lines to the median height of +lines to consider them as being on separate lines +
+
+ TYPE:
+ |
+
new_paragraph_threshold |
+
+ Minimum ratio of the distance between two lines to the median height of +lines to consider them as being on separate paragraphs and thus add a +newline character between them. +
+
+ TYPE:
+ |
+
label_map |
+
+ A dictionary mapping labels to new labels. This is useful to group labels +together, for instance, to output both "body" and "table" as "text". +
+
+ TYPE:
+ |
+
edspdf.pipes.classifiers.dummy
DummyClassifier
+
+Dummy classifier, for chaos purposes. Classifies each line to a random element.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object. +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component. +
+
+ TYPE:
+ |
+
label |
+
+ The label to assign to each line. +
+
+ TYPE:
+ |
+
edspdf.pipes.classifiers
edspdf.pipes.classifiers.mask
MaskClassifier
+
+Simple mask classifier, that labels every box inside one of the masks +with its label.
+ + + + + +simple_mask_classifier_factory
+
+The simplest form of mask classification. You define the mask, everything else +is tagged as pollution.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component +
+
+ TYPE:
+ |
+
x0 |
+
+ The x0 coordinate of the mask +
+
+ TYPE:
+ |
+
y0 |
+
+ The y0 coordinate of the mask +
+
+ TYPE:
+ |
+
x1 |
+
+ The x1 coordinate of the mask +
+
+ TYPE:
+ |
+
y1 |
+
+ The y1 coordinate of the mask +
+
+ TYPE:
+ |
+
threshold |
+
+ The threshold for the alignment +
+
+ TYPE:
+ |
+
pipeline.add_pipe(
+ "mask-classifier",
+ name="classifier",
+ config={
+ "threshold": 0.9,
+ "x0": 0.1,
+ "y0": 0.1,
+ "x1": 0.9,
+ "y1": 0.9,
+ },
+)
+
[components.classifier]
+@classifiers = "mask-classifier"
+x0 = 0.1
+y0 = 0.1
+x1 = 0.9
+y1 = 0.9
+threshold = 0.9
+
mask_classifier_factory
+
+A generalisation, wherein the user defines a number of regions.
+The following configuration produces exactly the same classifier as mask.v1
+example above.
Any bloc that is not part of a mask is tagged as pollution
.
PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+
+
+
+ TYPE:
+ |
+
threshold |
+
+ The threshold for the alignment +
+
+ TYPE:
+ |
+
masks |
+
+ The masks +
+
+ TYPE:
+ |
+
pipeline.add_pipe(
+ "multi-mask-classifier",
+ name="classifier",
+ config={
+ "threshold": 0.9,
+ "mymask": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "body"},
+ },
+)
+
[components.classifier]
+@factory = "multi-mask-classifier"
+threshold = 0.9
+
+[components.classifier.mymask]
+label = "body"
+x0 = 0.1
+y0 = 0.1
+x1 = 0.9
+y1 = 0.9
+
The following configuration defines a header
region.
pipeline.add_pipe(
+ "multi-mask-classifier",
+ name="classifier",
+ config={
+ "threshold": 0.9,
+ "body": {"x0": 0.1, "y0": 0.1, "x1": 0.9, "y1": 0.3, "label": "header"},
+ "header": {"x0": 0.1, "y0": 0.3, "x1": 0.9, "y1": 0.9, "label": "body"},
+ },
+)
+
[components.classifier]
+@factory = "multi-mask-classifier"
+threshold = 0.9
+
+[components.classifier.header]
+label = "header"
+x0 = 0.1
+y0 = 0.1
+x1 = 0.9
+y1 = 0.3
+
+[components.classifier.body]
+label = "body"
+x0 = 0.1
+y0 = 0.3
+x1 = 0.9
+y1 = 0.9
+
edspdf.pipes.classifiers.random
RandomClassifier
+
+Random classifier, for chaos purposes. Classifies each box to a random element.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object. +
+
+ TYPE:
+ |
+
name |
+
+ The name of the component. +
+
+ TYPE:
+ |
+
labels |
+
+ The labels to assign to each line. If a list is passed, each label is assigned +with equal probability. If a dict is passed, the keys are the labels and the +values are the probabilities. +
+
+ TYPE:
+ |
+
edspdf.pipes.classifiers.trainable
TrainableClassifier
+
+
+ Bases: TrainablePipe[Dict[str, Any]]
This component predicts a label for each box over the whole document using machine +learning.
+Note
+You must train the model your model to use this classifier. +See Model training for more information
+The classifier is composed of the following blocks:
+In this example, we use a box-embedding
layer to generate the embeddings
+of the boxes. It is composed of a text encoder that embeds the text features of the
+boxes and a layout encoder that embeds the layout features of the boxes.
+These two embeddings are summed and passed through an optional contextualizer
,
+here a box-transformer
.
pipeline.add_pipe(
+ "trainable-classifier",
+ name="classifier",
+ config={
+ # simple embedding computed by pooling embeddings of words in each box
+ "embedding": {
+ "@factory": "sub-box-cnn-pooler",
+ "out_channels": 64,
+ "kernel_sizes": (3, 4, 5),
+ "embedding": {
+ "@factory": "simple-text-embedding",
+ "size": 72,
+ },
+ },
+ "labels": ["body", "pollution"],
+ },
+)
+
[components.classifier]
+@factory = "trainable-classifier"
+labels = ["body", "pollution"]
+
+[components.classifier.embedding]
+@factory = "sub-box-cnn-pooler"
+out_channels = 64
+kernel_sizes = (3, 4, 5)
+
+[components.classifier.embedding.embedding]
+@factory = "simple-text-embedding"
+size = 72
+
PARAMETER | +DESCRIPTION | +
---|---|
labels |
+
+ Initial labels of the classifier (will be completed during initialization) +
+
+ TYPE:
+ |
+
embedding |
+
+ Embedding module to encode the PDF boxes +
+
+ TYPE:
+ |
+
edspdf.pipes.embeddings.box_layout_embedding
BoxLayoutEmbedding
+
+
+ Bases: TrainablePipe[EmbeddingOutput]
This component encodes the geometrical features of a box, as extracted by the +BoxLayoutPreprocessor module, into an embedding. For position modes, use:
+"sin"
to embed positions with a fixed
+ SinusoidalEmbedding"learned"
to embed positions using a learned standard pytorch embedding layerEach produces embedding is the concatenation of the box width, height and the top,
+left, bottom and right coordinates, each embedded depending on the *_mode
param.
PARAMETER | +DESCRIPTION | +
---|---|
size |
+
+ Size of the output box embedding +
+
+ TYPE:
+ |
+
n_positions |
+
+ Number of position embeddings stored in the PositionEmbedding module +
+
+ TYPE:
+ |
+
x_mode |
+
+ Position embedding mode of the x coordinates +
+
+ TYPE:
+ |
+
y_mode |
+
+ Position embedding mode of the x coordinates +
+
+ TYPE:
+ |
+
w_mode |
+
+ Position embedding mode of the width features +
+
+ TYPE:
+ |
+
h_mode |
+
+ Position embedding mode of the height features +
+
+ TYPE:
+ |
+
edspdf.pipes.embeddings.box_layout_preprocessor
BoxLayoutPreprocessor
+
+
+ Bases: TrainablePipe[BoxLayoutBatch]
The box preprocessor is singleton since its is not configurable. +The following features of each box of an input PDFDoc document are encoded +as 1D tensors:
+boxes_page
: page index of the boxboxes_first_page
: is the box on the first pageboxes_last_page
: is the box on the last pageboxes_xmin
: left position of the boxboxes_ymin
: bottom position of the boxboxes_xmax
: right position of the boxboxes_ymax
: top position of the boxboxes_w
: width position of the boxboxes_h
: height position of the boxThe preprocessor also returns an additional tensors:
+page_boxes_id
: box indices per page to index the
+ above 1D tensors (LongTensor: n_pages * n_boxes)edspdf.pipes.embeddings.box_transformer
BoxTransformer
+
+
+ Bases: TrainablePipe[EmbeddingOutput]
BoxTransformer using +BoxTransformerModule +under the hood.
+Note
+This module is a TrainablePipe +and can be used in a Pipeline, while +BoxTransformerModule +is a standard PyTorch module, which does not take care of the +preprocessing, collating, etc. of the input documents.
+PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ Pipeline instance +
+
+ TYPE:
+ |
+
name |
+
+ Name of the component +
+
+ TYPE:
+ |
+
num_heads |
+
+ Number of attention heads in the attention layers +
+
+ TYPE:
+ |
+
n_relative_positions |
+
+ Maximum range of embeddable relative positions between boxes (further +distances are capped to ±n_relative_positions // 2) +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability both for the attention layers and embedding projections +
+
+ TYPE:
+ |
+
head_size |
+
+ Head sizes of the attention layers +
+
+ TYPE:
+ |
+
activation |
+
+ Activation function used in the linear->activation->linear transformations +
+
+ TYPE:
+ |
+
init_resweight |
+
+ Initial weight of the residual gates. +At 0, the layer acts (initially) as an identity function, and at 1 as +a standard Transformer layer. +Initializing with a value close to 0 can help the training converge. +
+
+ TYPE:
+ |
+
attention_mode |
+
+ Mode of relative position infused attention layer. +See the relative attention +documentation for more information. +
+
+ TYPE:
+ |
+
n_layers |
+
+ Number of layers in the Transformer +
+
+ TYPE:
+ |
+
edspdf.pipes.embeddings.embedding_combiner
EmbeddingCombiner
+
+
+ Bases: TrainablePipe[EmbeddingOutput]
Encodes boxes using a combination of multiple encoders
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+ The name of the pipe +
+
+ TYPE:
+ |
+
mode |
+
+ The mode to use to combine the encoders: +
+
+ TYPE:
+ |
+
dropout_p |
+
+ Dropout probability used on the output of the box and textual encoders +
+
+ TYPE:
+ |
+
encoders |
+
+ The encoders to use. The keys are the names of the encoders and the values +are the encoders themselves. +
+
+ TYPE:
+ |
+
edspdf.pipes.embeddings.huggingface_embedding
HuggingfaceEmbedding
+
+
+ Bases: TrainablePipe[EmbeddingOutput]
The HuggingfaceEmbeddings component is a wrapper around the Huggingface multi-modal +models. Such pre-trained models should offer better results than a model trained +from scratch. Compared to using the raw Huggingface model, we offer a simple +mechanism to split long documents into strided windows before feeding them to the +model.
+The HuggingfaceEmbedding component splits long documents into smaller windows before
+feeding them to the model. This is done to avoid hitting the maximum number of
+tokens that can be processed by the model on a single device. The window size and
+stride can be configured using the window
and stride
parameters. The default
+values are 510 and 255 respectively, which means that the model will process windows
+of 510 tokens, each separated by 255 tokens. Whenever a token appears in multiple
+windows, the embedding of the "most contextualized" occurrence is used, i.e. the
+occurrence that is the closest to the center of its window.
Here is an overview how this works in a classifier model : +
+Here is an example of how to define a pipeline with the HuggingfaceEmbedding +component:
+from edspdf import Pipeline
+
+model = Pipeline()
+model.add_pipe(
+ "pdfminer-extractor",
+ name="extractor",
+ config={
+ "render_pages": True,
+ },
+)
+model.add_pipe(
+ "huggingface-embedding",
+ name="embedding",
+ config={
+ "model": "microsoft/layoutlmv3-base",
+ "use_image": False,
+ "window": 128,
+ "stride": 64,
+ "line_pooling": "mean",
+ },
+)
+model.add_pipe(
+ "trainable-classifier",
+ name="classifier",
+ config={
+ "embedding": model.get_pipe("embedding"),
+ "labels": [],
+ },
+)
+
This model can then be trained following the +training recipe.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ The pipeline instance +
+
+ TYPE:
+ |
+
name |
+
+ The component name +
+
+ TYPE:
+ |
+
model |
+
+ The Huggingface model name or path +
+
+ TYPE:
+ |
+
use_image |
+
+ Whether to use the image or not in the model +
+
+ TYPE:
+ |
+
window |
+
+ The window size to use when splitting long documents into smaller windows +before feeding them to the Transformer model (default: 510 = 512 - 2) +
+
+ TYPE:
+ |
+
stride |
+
+ The stride (distance between windows) to use when splitting long documents into +smaller windows: (default: 510 / 2 = 255) +
+
+ TYPE:
+ |
+
line_pooling |
+
+ The pooling strategy to use when combining the embeddings of the tokens in a +line into a single line embedding +
+
+ TYPE:
+ |
+
max_tokens_per_device |
+
+ The maximum number of tokens that can be processed by the model on a single +device. This does not affect the results but can be used to reduce the memory +usage of the model, at the cost of a longer processing time. +
+
+ TYPE:
+ |
+
edspdf.pipes.embeddings
edspdf.pipes.embeddings.simple_text_embedding
SimpleTextEmbedding
+
+
+ Bases: TrainablePipe[EmbeddingOutput]
A module that embeds the textual features of the blocks
+ + + +PARAMETER | +DESCRIPTION | +
---|---|
size |
+
+ Size of the output box embedding +
+
+ TYPE:
+ |
+
pipeline |
+
+ The pipeline object +
+
+ TYPE:
+ |
+
name |
+
+ Name of the component +
+
+ TYPE:
+ |
+
word_shape
+
+Converts a word into its shape following the algorithm used in the +spaCy library.
+https://github.com/explosion/spaCy/blob/b69d249a/spacy/lang/lex_attrs.py#L118
+ +PARAMETER | +DESCRIPTION | +
---|---|
text |
+
+
+
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ str
+
+ |
+
+
+
+
+ |
+
+
+ The word shape
+
+ |
+
+
+
+
+ |
+
edspdf.pipes.embeddings.sub_box_cnn_pooler
SubBoxCNNPooler
+
+
+ Bases: TrainablePipe[EmbeddingOutput]
One dimension CNN encoding multi-kernel layer.
+Input embeddings are convoluted using linear kernels each parametrized with
+a (window) size of kernel_size[kernel_i]
+The output of the kernels are concatenated together, max-pooled and finally
+projected to a size of output_size
.
PARAMETER | +DESCRIPTION | +
---|---|
pipeline |
+
+ Pipeline instance +
+
+ TYPE:
+ |
+
name |
+
+ Name of the component +
+
+ TYPE:
+ |
+
output_size |
+
+ Size of the output embeddings
+Defaults to the
+
+ TYPE:
+ |
+
out_channels |
+
+ Number of channels +
+
+ TYPE:
+ |
+
kernel_sizes |
+
+ Window size of each kernel +
+
+ TYPE:
+ |
+
activation |
+
+ Activation function to use +
+
+ TYPE:
+ |
+
edspdf.pipes.extractors
edspdf.pipes.extractors.pdfminer
PdfMinerExtractor
+
+We provide a PDF line extractor built on top of +PdfMiner.
+This is the most portable extractor, since it is pure-python and can therefore +be run on any platform. Be sure to have a look at their documentation, +especially the part providing a bird's eye view of the PDF extraction process.
+pipeline.add_pipe(
+ "pdfminer-extractor",
+ config=dict(
+ extract_style=False,
+ ),
+)
+
[components.extractor]
+@factory = "pdfminer-extractor"
+extract_style = false
+
And use the pipeline on a PDF document:
+from pathlib import Path
+
+# Apply on a new document
+pipeline(Path("path/to/your/pdf/document").read_bytes())
+
PARAMETER | +DESCRIPTION | +
---|---|
line_overlap |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
char_margin |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
line_margin |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
word_margin |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
boxes_flow |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
detect_vertical |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
all_texts |
+
+ See PDFMiner documentation +
+
+ TYPE:
+ |
+
extract_style |
+
+ Whether to extract style (font, size, ...) information for each line of +the document. +Default: False +
+
+ TYPE:
+ |
+
render_pages |
+
+ Whether to extract the rendered page as a numpy array in the
+
+ TYPE:
+ |
+
render_dpi |
+
+ DPI to use when rendering the page (defaults to 200) +
+
+ TYPE:
+ |
+
raise_on_error |
+
+ Whether to raise an error if the PDF cannot be parsed. +Default: False +
+
+ TYPE:
+ |
+
edspdf.pipes
edspdf.registry
CurriedFactory
+
+instantiate
+
+We need to support passing in the pipeline object and name to factories from +a config file. Since components can be nested, we need to add them to every +factory in the config.
+ +FactoryRegistry
+
+
+
+ Bases: Registry
A registry that validates the input arguments of the registered functions.
+ + + + + +get
+
+Get the registered function for a given name.
+name (str): The name. +RETURNS (Any): The registered function.
+ +register
+
+This is a convenience wrapper around confit.Registry.register
, that
+curries the function to be registered, allowing to instantiate the class
+later once pipeline
and name
are known.
PARAMETER | +DESCRIPTION | +
---|---|
name |
+
+
+
+
+ TYPE:
+ |
+
func |
+
+
+
+
+ TYPE:
+ |
+
default_config |
+
+
+
+
+ TYPE:
+ |
+
assigns |
+
+
+
+
+ TYPE:
+ |
+
requires |
+
+
+
+
+ TYPE:
+ |
+
retokenizes |
+
+
+
+
+ TYPE:
+ |
+
default_score_weights |
+
+
+
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Callable[[InFunc], InFunc]
+
+ |
+
+
+
+
+ |
+
accepted_arguments
+
+Checks that a function accepts a list of keyword arguments
+ +PARAMETER | +DESCRIPTION | +
---|---|
func |
+
+ Function to check +
+
+ TYPE:
+ |
+
args |
+
+ Argument or list of arguments to check +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ List[str]
+
+ |
+
+
+
+
+ |
+
edspdf.structures
PDFDoc
+
+
+
+ Bases: BaseModel
This is the main data structure of the library to hold PDFs. +It contains the content of the PDF, as well as box annotations and text outputs.
+ + + +ATTRIBUTE | +DESCRIPTION | +
---|---|
content |
+
+
+
+ The content of the PDF document. +
+
+ TYPE:
+ |
+
id |
+
+
+
+ The ID of the PDF document. +
+
+ TYPE:
+ |
+
pages |
+
+
+
+ The pages of the PDF document. +
+
+ TYPE:
+ |
+
error |
+
+
+
+ Whether there was an error when processing this PDF document. +
+
+ TYPE:
+ |
+
content_boxes |
+
+
+
+ The content boxes/annotations of the PDF document. +
+
+ TYPE:
+ |
+
aggregated_texts |
+
+
+
+ The aggregated text outputs of the PDF document. +
+
+ TYPE:
+ |
+
text_boxes |
+
+
+
+ The text boxes of the PDF document. +
+
+ TYPE:
+ |
+
Page
+
+
+
+ Bases: BaseModel
The Page
class represents a page of a PDF document.
ATTRIBUTE | +DESCRIPTION | +
---|---|
page_num |
+
+
+
+ The page number of the page. +
+
+ TYPE:
+ |
+
width |
+
+
+
+ The width of the page. +
+
+ TYPE:
+ |
+
height |
+
+
+
+ The height of the page. +
+
+ TYPE:
+ |
+
doc |
+
+
+
+ The PDF document that this page belongs to. +
+
+ TYPE:
+ |
+
image |
+
+
+
+ The rendered image of the page, stored as a NumPy array. +
+
+ TYPE:
+ |
+
text_boxes |
+
+
+
+ The text boxes of the page. +
+
+ TYPE:
+ |
+
TextProperties
+
+
+
+ Bases: BaseModel
The TextProperties
class represents the style properties of a span of text in a
+TextBox.
ATTRIBUTE | +DESCRIPTION | +
---|---|
italic |
+
+
+
+ Whether the text is italic. +
+
+ TYPE:
+ |
+
bold |
+
+
+
+ Whether the text is bold. +
+
+ TYPE:
+ |
+
begin |
+
+
+
+ The beginning index of the span of text. +
+
+ TYPE:
+ |
+
end |
+
+
+
+ The ending index of the span of text. +
+
+ TYPE:
+ |
+
fontname |
+
+
+
+ The font name of the span of text. +
+
+ TYPE:
+ |
+
Box
+
+
+
+ Bases: BaseModel
The Box
class represents a box annotation in a PDF document. It is the base class
+of TextBox.
ATTRIBUTE | +DESCRIPTION | +
---|---|
doc |
+
+
+
+ The PDF document that this box belongs to. +
+
+ TYPE:
+ |
+
page_num |
+
+
+
+ The page number of the box. +
+
+ TYPE:
+ |
+
x0 |
+
+
+
+ The left x-coordinate of the box. +
+
+ TYPE:
+ |
+
x1 |
+
+
+
+ The right x-coordinate of the box. +
+
+ TYPE:
+ |
+
y0 |
+
+
+
+ The top y-coordinate of the box. +
+
+ TYPE:
+ |
+
y1 |
+
+
+
+ The bottom y-coordinate of the box. +
+
+ TYPE:
+ |
+
label |
+
+
+
+ The label of the box. +
+
+ TYPE:
+ |
+
page |
+
+
+
+ The page object that this box belongs to. +
+
+ TYPE:
+ |
+
Text
+
+
+
+ Bases: BaseModel
The TextBox
class represents text object, not bound to any box.
It can be used to store aggregated text from multiple boxes for example.
+ + + +ATTRIBUTE | +DESCRIPTION | +
---|---|
text |
+
+
+
+ The text content. +
+
+ TYPE:
+ |
+
properties |
+
+
+
+ The style properties of the text. +
+
+ TYPE:
+ |
+
TextBox
+
+
+
+ Bases: Box
The TextBox
class represents a text box annotation in a PDF document.
ATTRIBUTE | +DESCRIPTION | +
---|---|
text |
+
+
+
+ The text content of the text box. +
+
+ TYPE:
+ |
+
props |
+
+
+
+ The style properties of the text box. +
+
+ TYPE:
+ |
+
edspdf.trainable_pipe
TrainablePipe
+
+
+ Bases: Module
, Generic[OutputBatch]
A TrainablePipe is a Component that can be trained and inherits torch.nn.Module
.
+You can use it either as a torch module inside a more complex neural network, or as
+a standalone component in a Pipeline.
In addition to the methods of a torch module, a TrainablePipe adds a few methods to +handle preprocessing and collating features, as well as caching intermediate results +for components that share a common subcomponent.
+ + + + + +save_extra_data
+
+Dumps vocabularies indices to json files
+ +PARAMETER | +DESCRIPTION | +
---|---|
path |
+
+ Path to the directory where the files will be saved +
+
+ TYPE:
+ |
+
exclude |
+
+ The set of component names to exclude from saving +This is useful when components are repeated in the pipeline. +
+
+ TYPE:
+ |
+
load_extra_data
+
+Loads vocabularies indices from json files
+ +PARAMETER | +DESCRIPTION | +
---|---|
path |
+
+ Path to the directory where the files will be loaded +
+
+ TYPE:
+ |
+
exclude |
+
+ The set of component names to exclude from loading +This is useful when components are repeated in the pipeline. +
+
+ TYPE:
+ |
+
post_init
+
+This method completes the attributes of the component, by looking at some +documents. It is especially useful to build vocabularies or detect the labels +of a classification task.
+ +PARAMETER | +DESCRIPTION | +
---|---|
gold_data |
+
+ The documents to use for initialization. +
+
+ TYPE:
+ |
+
exclude |
+
+ The names of components to exclude from initialization. +This argument will be gradually updated with the names of initialized +components +
+
+ TYPE:
+ |
+
preprocess
+
+Preprocess the document to extract features that will be used by the +neural network to perform its predictions.
+ +PARAMETER | +DESCRIPTION | +
---|---|
doc |
+
+ PDFDocument to preprocess +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Dict[str, Any]
+
+ |
+
+
+
+ Dictionary (optionally nested) containing the features extracted from +the document. + |
+
collate
+
+Collate the batch of features into a single batch of tensors that can be +used by the forward method of the component.
+ +PARAMETER | +DESCRIPTION | +
---|---|
batch |
+
+ Batch of features +
+
+ TYPE:
+ |
+
device |
+
+ Device on which the tensors should be moved +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ InputBatch
+
+ |
+
+
+
+ Dictionary (optionally nested) containing the collated tensors + |
+
forward
+
+Perform the forward pass of the neural network, i.e, apply transformations +over the collated features to compute new embeddings, probabilities, losses, etc
+ +PARAMETER | +DESCRIPTION | +
---|---|
batch |
+
+ Batch of tensors (nested dictionary) computed by the collate method +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ OutputBatch
+
+ |
+
+
+
+
+ |
+
module_forward
+
+This is a wrapper around torch.nn.Module.__call__
to avoid conflict
+with the
+TrainablePipe.__call__
+method.
make_batch
+
+Convenience method to preprocess a batch of documents and collate them +Features corresponding to the same path are grouped together in a list, +under the same key.
+ +PARAMETER | +DESCRIPTION | +
---|---|
docs |
+
+ Batch of documents +
+
+ TYPE:
+ |
+
supervision |
+
+ Whether to extract supervision features or not +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Dict[str, Sequence[Any]]
+
+ |
+
+
+
+
+ |
+
batch_process
+
+Process a batch of documents using the neural network.
+This differs from the pipe
method in that it does not return an
+iterator, but executes the component on the whole batch at once.
PARAMETER | +DESCRIPTION | +
---|---|
docs |
+
+ Batch of documents +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Sequence[PDFDoc]
+
+ |
+
+
+
+ Batch of updated documents + |
+
postprocess
+
+Update the documents with the predictions of the neural network, for instance +converting label probabilities into label attributes on the document lines.
+By default, this is a no-op.
+ +PARAMETER | +DESCRIPTION | +
---|---|
docs |
+
+ Batch of documents +
+
+ TYPE:
+ |
+
batch |
+
+ Batch of predictions, as returned by the forward method +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Sequence[PDFDoc]
+
+ |
+
+
+
+
+ |
+
preprocess_supervised
+
+Preprocess the document to extract features that will be used by the
+neural network to perform its training.
+By default, this returns the same features as the preprocess
method.
PARAMETER | +DESCRIPTION | +
---|---|
doc |
+
+ PDFDocument to preprocess +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Dict[str, Any]
+
+ |
+
+
+
+ Dictionary (optionally nested) containing the features extracted from +the document. + |
+
__call__
+
+Applies the component on a single doc. +For multiple documents, prefer batch processing via the +batch_process method. +In general, prefer the Pipeline methods
+ +PARAMETER | +DESCRIPTION | +
---|---|
doc |
+
+
+
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ PDFDoc
+
+ |
+
+
+
+
+ |
+
edspdf.utils.alignment
align_box_labels
+
+Align lines with possibly overlapping (and non-exhaustive) labels.
+Possible matches are sorted by covered area. Lines with no overlap at all
+ +PARAMETER | +DESCRIPTION | +
---|---|
src_boxes |
+
+ The labelled boxes that will be used to determine the label of the dst_boxes +
+
+ TYPE:
+ |
+
dst_boxes |
+
+ The non-labelled boxes that will be assigned a label +
+
+ TYPE:
+ |
+
threshold |
+
+ Threshold to use for discounting a label. Used if the
+
+ TYPE:
+ |
+
pollution_label |
+
+ The label to use for boxes that are not covered by any of the source boxes +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ List[Box]
+
+ |
+
+
+
+ A copy of the boxes, with the labels mapped from the source boxes + |
+
edspdf.utils.collections
multi_tee
+
+Makes copies of an iterable such that every iteration over it +starts from 0. If the iterable is a sequence (list, tuple), just returns +it since every iter() over the object restart from the beginning
+ + + + + +FrozenDict
+
+
+ Bases: dict
Copied from spacy.util.SimpleFrozenDict
to ensure compatibility.
Initialize the frozen dict. Can be initialized with pre-defined +values.
+error (str): The error message when user tries to assign to dict.
+ + + + +FrozenList
+
+
+ Bases: list
Copied from spacy.util.SimpleFrozenDict
to ensure compatibility
Initialize the frozen list.
+error (str): The error message when user tries to mutate the list.
+ + + + +edspdf.utils
edspdf.utils.optimization
edspdf.utils.package
PoetryPackager
+
+ensure_pyproject
+
+Generates a Poetry based pyproject.toml
+ +edspdf.utils.random
set_seed
+
+Set seed values for random generators. +If used as a context, restore the random state +used before entering the context.
+ +PARAMETER | +DESCRIPTION | +
---|---|
seed |
+
+ Value used as a seed. ++ + |
+
cuda |
+
+ Saves the cuda random states too +
+
+ DEFAULT:
+ |
+
get_random_generator_state
+
+Get the torch
, numpy
and random
random generator state.
PARAMETER | +DESCRIPTION | +
---|---|
cuda |
+
+ Saves the cuda random states too +
+
+ DEFAULT:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ RandomGeneratorState
+
+ |
+
+
+
+
+ |
+
set_random_generator_state
+
+Set the torch
, numpy
and random
random generator state.
PARAMETER | +DESCRIPTION | +
---|---|
state |
+
+
+ + + |
+
edspdf.utils.torch
compute_pdf_relative_positions
+
+Compute relative positions between boxes. +Input boxes must be split between pages with the shape n_pages * n_boxes
+ +PARAMETER | +DESCRIPTION | +
---|---|
x0 |
+
+
+ + + |
+
y0 |
+
+
+ + + |
+
x1 |
+
+
+ + + |
+
y1 |
+
+
+ + + |
+
width |
+
+
+ + + |
+
height |
+
+
+ + + |
+
n_relative_positions |
+
+ Maximum range of embeddable relative positions between boxes (further +distances will be capped to ±n_relative_positions // 2) ++ + |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ LongTensor
+
+ |
+
+
+
+ Shape: n_pages * n_boxes * n_boxes * 2 + |
+
edspdf.visualization.annotations
show_annotations
+
+Show Box annotations on a PDF document.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pdf |
+
+ Bytes content of the PDF document +
+
+ TYPE:
+ |
+
annotations |
+
+ List of Box annotations to show +
+
+ TYPE:
+ |
+
colors |
+
+ Colors to use for each label. If a list is provided, it will be used to color
+the first
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ List[PpmImageFile]
+
+ |
+
+
+
+ List of PIL images with the annotations. You can display them in a notebook
+with |
+
compare_results
+
+Compare two sets of annotations on a PDF document.
+ +PARAMETER | +DESCRIPTION | +
---|---|
pdf |
+
+ Bytes content of the PDF document +
+
+ TYPE:
+ |
+
pred |
+
+ List of Box annotations to show on the left side +
+
+ TYPE:
+ |
+
gold |
+
+ List of Box annotations to show on the right side +
+
+ TYPE:
+ |
+
colors |
+
+ Colors to use for each label. If a list is provided, it will be used to color
+the first
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ List[PpmImageFile]
+
+ |
+
+
+
+ List of PIL images with the annotations. You can display them in a notebook
+with |
+
edspdf.visualization
edspdf.visualization.merge
merge_boxes
+
+Recursively merge boxes that have the same label to form larger non-overlapping +boxes.
+ +PARAMETER | +DESCRIPTION | +
---|---|
boxes |
+
+ List of boxes to merge +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ List[Box]
+
+ |
+
+
+
+ List of merged boxes + |
+
TrainableClassifier
...save
and load
methods to every pipeline component{authors}, {year}. {title}." + if journal: + reference += f" {journal}." + if volume: + reference += f" {volume}" + if issue: + reference += f"({issue})" + if pages: + reference += f", pp.{pages}" + reference += "." + if doi: + reference += ( + f' {doi}' + ) + reference += "
" + + return etree.fromstring(reference) + + def formatCitation(self, ref): + author_list = list(map(self.formatAuthorSurname, ref.persons["author"])) + year = ref.fields.get("year") + + if len(author_list) == 1: + citation = f"{author_list[0]}" + elif len(author_list) == 2: + citation = f"{author_list[0]} and {author_list[1]}" + else: + citation = f"{author_list[0]} et al." + + citation += f", {year}" + + return citation + + def make_bibliography(self): + if self.order == "alphabetical": + raise (NotImplementedError) + + div = etree.Element("div") + div.set("class", "footnote") + div.append(etree.Element("hr")) + ol = etree.SubElement(div, "ol") + + if not self.citations: + return div + + # table = etree.SubElement(div, "table") + # table.set("class", "references") + # tbody = etree.SubElement(table, "tbody") + etree.SubElement(div, "div") + for id in self.citations: + li = etree.SubElement(ol, "li") + li.set("id", self.referenceID(id)) + # ref_id = etree.SubElement(li, "td") + ref_txt = etree.SubElement(li, "p") + if id in self.references: + self.extension.parser.parseChunk(ref_txt, self.references[id]) + elif id in self.bibsource: + ref_txt.append(self.formatReference(self.bibsource[id])) + else: + ref_txt.text = "Missing citation" + + return div + + def clear_citations(self): + self.citations = OrderedDict() + + +class CitationsPreprocessor(Preprocessor): + """Gather reference definitions and citation keys""" + + def __init__(self, bibliography): + self.bib = bibliography + + def subsequentIndents(self, lines, i): + """Concatenate consecutive indented lines""" + linesOut = [] + while i < len(lines): + m = INDENT_RE.match(lines[i]) + if m: + linesOut.append(m.group(1)) + i += 1 + else: + break + return " ".join(linesOut), i + + def run(self, lines): + linesOut = [] + i = 0 + + while i < len(lines): + # Check to see if the line starts a reference definition + m = DEF_RE.match(lines[i]) + if m: + key = m.group(1) + reference = m.group(2) + indents, i = self.subsequentIndents(lines, i + 1) + reference += " " + indents + + self.bib.setReference(key, reference) + continue + + # Look for all @citekey patterns inside hard brackets + for bracket in BRACKET_RE.findall(lines[i]): + for c in CITE_RE.findall(bracket): + self.bib.addCitation(c) + linesOut.append(lines[i]) + i += 1 + + return linesOut + + +class CitationsPattern(Pattern): + """Handles converting citations keys into links""" + + def __init__(self, pattern, bibliography): + super(CitationsPattern, self).__init__(pattern) + self.bib = bibliography + + def handleMatch(self, m): + span = etree.Element("span") + for cite_match in CITE_RE.finditer(m.group(2)): + id = cite_match.group(1) + if id in self.bib.bibsource: + a = etree.Element("a") + a.set("id", self.bib.citationID(id)) + a.set("href", "./#" + self.bib.referenceID(id)) + a.set("class", "citation") + a.text = self.bib.labels[id] + span.append(a) + else: + continue + if len(span) == 0: + return None + return span + + +context_citations = None + + +class CitationsExtension(Extension): + def __init__(self): + super(CitationsExtension, self).__init__() + self.bib = None + + def extendMarkdown(self, md): + md.registerExtension(self) + self.parser = md.parser + self.md = md + + md.preprocessors.register(CitationsPreprocessor(self.bib), "mdx_bib", 15) + md.inlinePatterns.register( + CitationsPattern(CITATION_RE, self.bib), "mdx_bib", 175 + ) + + +def makeExtension(*args, **kwargs): + return CitationsExtension(*args, **kwargs) + + +class BibTexPlugin(BasePlugin): + config_scheme: Tuple[Tuple[str, MkType]] = ( + ("bibtex_file", MkType(str)), # type: ignore[assignment] + ("order", MkType(str, default="unsorted")), # type: ignore[assignment] + ) + + def __init__(self): + self.citations = None + + def on_config(self, config, **kwargs): + extension = CitationsExtension() + self.bib = Bibliography( + extension, + self, + self.config["bibtex_file"], + self.config["order"], + ) + extension.bib = self.bib + config["markdown_extensions"].append(extension) + + def on_page_content(self, html, page, config, files): + html += "\n" + etree_to_string(self.bib.make_bibliography()).decode() + self.bib.clear_citations() + return html diff --git a/main/scripts/plugin.py b/main/scripts/plugin.py new file mode 100644 index 00000000..4c5afb7d --- /dev/null +++ b/main/scripts/plugin.py @@ -0,0 +1,92 @@ +import os +import shutil +from pathlib import Path + +import mkdocs + +# Add the files from the project root + +# Generate the code reference pages and navigation. +doc_reference = Path("docs/reference") +shutil.rmtree(doc_reference, ignore_errors=True) +os.makedirs(doc_reference, exist_ok=True) +root = Path("edspdf") +for path in sorted(root.rglob("*.py")): + if "poppler_src" in str(path): + continue + module_path = path.relative_to(root.parent).with_suffix("") + doc_path = path.relative_to(root.parent).with_suffix(".md") + full_doc_path = doc_reference / doc_path + parts = list(module_path.parts) + if parts[-1] == "__init__": + parts = parts[:-1] + doc_path = doc_path.with_name("index.md") + full_doc_path = full_doc_path.with_name("index.md") + elif parts[-1] == "__main__": + continue + ident = ".".join(parts) + os.makedirs(full_doc_path.parent, exist_ok=True) + with open(full_doc_path, "w") as fd: + print(f"# `{ident}`\n", file=fd) + print("::: " + ident, file=fd) + if root != "edspdf": + print(" options:", file=fd) + print(" show_source: false", file=fd) + + +def on_files(files: mkdocs.structure.files.Files, config: mkdocs.config.Config): + """ + Recursively the navigation of the mkdocs config + and recursively content of directories of page that point + to directories. + + Parameters + ---------- + config: mkdocs.config.Config + The configuration object + kwargs: dict + Additional arguments + """ + + def get_nested_files(path): + files = [] + for file in path.iterdir(): + if file.is_dir(): + index = file / "index.md" + if index.exists(): + # Get name from h1 heading in index + name = index.read_text().split("\n")[0].strip("# ") + if name.startswith("`edspdf"): + name = name[1:-1].split(".")[-1] + files.append({name: get_nested_files(file)}) + else: + title = file.name.replace("_", " ").replace("-", " ").title() + files.append({title: get_nested_files(file)}) + else: + name = file.read_text().split("\n")[0].strip("# ") + if name.startswith("`edspdf"): + name = name[1:-1].split(".")[-1] + files.append({name: str(file.relative_to(config["docs_dir"]))}) + else: + files.append(str(file.relative_to(config["docs_dir"]))) + return files + + def rec(tree): + if isinstance(tree, list): + return [rec(item) for item in tree] + elif isinstance(tree, dict): + return {k: rec(item) for k, item in tree.items()} + elif isinstance(tree, str): + if tree.endswith("/"): + # We have a directory + path = Path(config["docs_dir"]) / tree + if path.is_dir(): + return get_nested_files(path) + else: + return tree + else: + return tree + else: + return tree + + config["nav"] = rec(config["nav"]) diff --git a/main/search/search_index.json b/main/search/search_index.json new file mode 100644 index 00000000..3bb1dbad --- /dev/null +++ b/main/search/search_index.json @@ -0,0 +1 @@ +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Overview","text":"EDS-PDF provides modular framework to extract text information from PDF documents.
You can use it out-of-the-box, or extend it to fit your use-case.
"},{"location":"#getting-started","title":"Getting started","text":""},{"location":"#installation","title":"Installation","text":"Install the library with pip:
$ pip install edspdf\n---> 100%\ncolor:green Installation successful\n
"},{"location":"#extracting-text","title":"Extracting text","text":"Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.
Configuration based pipelineAPI based pipelineCreate a configuration file:
config.cfg[pipeline]\npipeline = [\"extractor\", \"classifier\", \"aggregator\"]\n\n[components.extractor]\n@factory = \"pdfminer-extractor\"\n\n[components.classifier]\n@factory = \"mask-classifier\"\nx0 = 0.2\nx1 = 0.9\ny0 = 0.3\ny1 = 0.6\nthreshold = 0.1\n\n[components.aggregator]\n@factory = \"simple-aggregator\"\n
and load it from Python:
import edspdf\nfrom pathlib import Path\n\nmodel = edspdf.load(\"config.cfg\") # (1)\n
Or create a pipeline directly from Python:
from edspdf import Pipeline\n\nmodel = Pipeline()\nmodel.add_pipe(\"pdfminer-extractor\")\nmodel.add_pipe(\n \"mask-classifier\",\n config=dict(\n x0=0.2,\n x1=0.9,\n y0=0.3,\n y1=0.6,\n threshold=0.1,\n ),\n)\nmodel.add_pipe(\"simple-aggregator\")\n
This pipeline can then be applied (for instance with this PDF):
# Get a PDF\npdf = Path(\"/Users/perceval/Development/edspdf/tests/resources/letter.pdf\").read_bytes()\npdf = model(pdf)\n\nbody = pdf.aggregated_texts[\"body\"]\n\ntext, style = body.text, body.properties\n
See the rule-based recipe for a step-by-step explanation of what is happening.
"},{"location":"#citation","title":"Citation","text":"If you use EDS-PDF, please cite us as below.
@software{edspdf,\nauthor = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and G\u00e9rardin, Christel and Bey, Romain},\ndoi = {10.5281/zenodo.6902977},\nlicense = {BSD-3-Clause},\ntitle = {{EDS-PDF: Smart text extraction from PDF documents}},\nurl = {https://github.com/aphp/edspdf}\n}\n
"},{"location":"#acknowledgement","title":"Acknowledgement","text":"We would like to thank Assistance Publique \u2013 H\u00f4pitaux de Paris and AP-HP Foundation for funding this project.
"},{"location":"alternatives/","title":"Alternatives & Comparison","text":"EDS-PDF was developed to propose a more modular and extendable approach to PDF extraction than PDFBox, the legacy implementation at APHP's clinical data warehouse.
EDS-PDF takes inspiration from Explosion's spaCy pipelining system and closely follows its API. Therefore, the core object within EDS-PDF is the Pipeline, which organises the processing of PDF documents into multiple components. However, unlike spaCy, the library is built around a single deep learning framework, pytorch, which makes model development easier.
"},{"location":"changelog/","title":"Changelog","text":""},{"location":"changelog/#v080","title":"v0.8.0","text":""},{"location":"changelog/#added","title":"Added","text":"huggingface-embedding
) with windowing optionsrender_page
option to pdfminer
extractor, for multi-modal PDF featuresaccelerators
), with simple mono process support and multi gpu / cpu supportpipeline.package(...)
) to make a pip installable package from a pipelineconfit
to 0.4.2 (better errors) and foldedtensor
to 0.3.0 (better multiprocess support)pipeline.score
. You should use pipeline.pipe
, a custom scorer and pipeline.select_pipes
instead.hatch
instead of setuptools
to build the package / docs and run the testsattrs
dependency only being installed in dev modeMajor refactoring of the library:
"},{"location":"changelog/#core-features","title":"Core features","text":"PDFDoc
, Page
, Box
, ...) for representing PDF documentsCast bytes-like extractor inputs as bytes
"},{"location":"changelog/#v061-2022-12-07","title":"v0.6.1 - 2022-12-07","text":"Performance and cuda related fixes.
"},{"location":"changelog/#v060-2022-12-05","title":"v0.6.0 - 2022-12-05","text":"Many, many changes: - added torch as the main deep learning framework instead of spaCy and thinc - added poppler and mupdf as alternatives to pdfminer - new pipeline / config / registry system to facilitate consistency between training and inference - standardization of the exchange format between components with dataclass models (attrs more specifically) instead of pandas dataframes
"},{"location":"changelog/#v053-2022-08-31","title":"v0.5.3 - 2022-08-31","text":""},{"location":"changelog/#added_1","title":"Added","text":"title
and body
)pdf2image
dependency, replacing it with pypdfium2
(easier installation)aggregation
) to plural names (aggregators
).package-resource.v1
in the misc registryimportlib.metadata
dependency, which led to issues with Python 3.7sklearn-pipeline.v1
compare_results
in visualisationstyled.v1
aggregator, to handle stylesrescale.v1
transform, to go back to the original height and widthline
object is not carried around any moreparams
entry in the EDS-PDF registry.merge_lines
bug when lines were emptyaggregation
submodule to handle the specifics of aggregating text blocslabels
typer
legacy dependencymodels
submodule, which handled the configurations for Spark distribution (deferred to another package)orbis
context, which was APHP-specificInception !
"},{"location":"changelog/#features","title":"Features","text":"dummy.v1
, that classifies everything to body
mask.v1
, for simple rule-based classificationsklearn.v1
, that uses a Scikit-Learn pipelinerandom.v1
, to better sow chaosEDS-PDF is built on top of the confit
configuration system.
The following catalogue registries are included within EDS-PDF:
Section Descriptionfactory
Components factories (most often classes) adapter
Raw data preprocessing functions EDS-PDF pipelines are meant to be reproducible and serializable, such that you can always define a pipeline through the configuration system.
To wit, compare the API-based approach to the configuration-based approach (the two are strictly equivalent):
API-basedConfiguration-basedimport edspdf\nfrom pathlib import Path\n\nmodel = edspdf.Pipeline()\nmodel.add_pipe(\"pdfminer-extractor\", name=\"extractor\")\nmodel.add_pipe(\"mask-classifier\", name=\"classifier\", config=dict(\nx0=0.2,\nx1=0.9,\ny0=0.3,\ny1=0.6,\nthreshold=0.1,\n)\nmodel.add_pipe(\"simple-aggregator\", name=\"aggregator\")\n# Get a PDF\npdf = Path(\"letter.pdf\").read_bytes()\n\npdf = model(pdf)\n\nstr(pdf.aggregated_texts[\"body\"])\n# Out: Cher Pr ABC, Cher DEF,\\n...\n
config.cfg[pipeline]\npipeline = [\"extractor\", \"classifier\", \"aggregator\"]\n\n[components.extractor]\n@factory = \"pdfminer-extractor\"\n\n[components.classifier]\n@factory = \"mask-classifier\"\nx0 = 0.2\nx1 = 0.9\ny0 = 0.3\ny1 = 0.6\nthreshold = 0.1\n\n[components.aggregator]\n@factory = \"simple-aggregator\"\n
import edspdf\nfrom pathlib import Path\n\npipeline = edspdf.load(\"config.cfg\")\n# Get a PDF\npdf = Path(\"letter.pdf\").read_bytes()\n\npdf = pipeline(pdf)\n\nstr(pdf.aggregated_texts[\"body\"])\n# Out: Cher Pr ABC, Cher DEF,\\n...\n
The configuration-based approach strictly separates the definition of the pipeline to its application and avoids tucking away important configuration details. Changes to the pipeline are transparent as there is a single source of truth: the configuration file.
"},{"location":"contributing/","title":"Contributing to EDS-PDF","text":"We welcome contributions ! There are many ways to help. For example, you can:
To be able to run the test suite and develop your own pipeline, you should clone the repo and install it locally. We use the hatch
package manager to manage the project.
color:gray # Clone the repository and change directory\n$ git clone ssh://git@github.com/aphp/edspdf.git\n---> 100%\n\ncolor:gray # Ensure hatch is installed, preferably via pipx\n$ pipx install hatch\n\n$ cd edspdf\n\ncolor:gray # Enter a shell to develop / test the project. This will install everything required in a virtual environment. You can also `source` the path shown by hatch.\n$ hatch shell\n$ ...\n$ exit # when you're done\n
To make sure the pipeline will not fail because of formatting errors, we added pre-commit hooks using the pre-commit
Python library. To use it, simply install it:
$ pre-commit install\n
The pre-commit hooks defined in the configuration will automatically run when you commit your changes, letting you know if something went wrong.
The hooks only run on staged changes. To force-run it on all files, run:
$ pre-commit run --all-files\n---> 100%\ncolor:green All good !\n
"},{"location":"contributing/#proposing-a-merge-request","title":"Proposing a merge request","text":"At the very least, your changes should :
We use the Pytest test suite.
The following command will run the test suite. Writing your own tests is encouraged !
pytest\n
Should your contribution propose a bug fix, we require the bug be thoroughly tested.
"},{"location":"contributing/#style-guide","title":"Style Guide","text":"We use Black to reformat the code. While other formatter only enforce PEP8 compliance, Black also makes the code uniform. In short :
Black reformats entire files in place. It is not configurable.
Moreover, the CI/CD pipeline enforces a number of checks on the \"quality\" of the code. To wit, non black-formatted code will make the test pipeline fail. We use pre-commit
to keep our codebase clean.
Refer to the development install tutorial for tips on how to format your files automatically. Most modern editors propose extensions that will format files on save.
"},{"location":"contributing/#documentation","title":"Documentation","text":"Make sure to document your improvements, both within the code with comprehensive docstrings, as well as in the documentation itself if need be.
We use MkDocs
for EDS-PDF's documentation. You can view your changes with
color:gray # Run the documentation\n$ hatch run docs:serve\n
Go to localhost:8000
to see your changes. MkDocs watches for changes in the documentation folder and automatically reloads the page.
EDS-PDF stores PDFs and their annotation in a custom data structures that are designed to be easy to use and manipulate. We must distinguish between:
A PDF is first converted to a PDFDoc object, which contains the raw PDF content. This task is usually performed a PDF extractor component. Once the PDF is converted, the same object will be used and updated by the different components, and returned at the end of the pipeline.
When running a trainable component, the PDFDoc is preprocessed and converted to tensors containing relevant features for the task. This task is performed in the preprocess
method of the component. The resulting tensors are then collated together to form a batch, in the collate
method of the component. After running the forward
method of the component, the tensor predictions are finally assigned as annotations to original PDFDoc objects in the postprocess
method.
The main data structure is the [PDFDoc][edspdf.structures.PDFDoc], which represents full a PDF document. It contains the raw PDF content, annotations for the full document, regardless of pages. A PDF is split into Page
objects that stores their number, dimension and optionally an image of the rendered page.
The PDF annotations are stored in Box
objects, which represent a rectangular region of the PDF. At the moment, box can only be specialized into TextBox
to represent text regions, such as lines extracted by a PDF extractor. Aggregated texts are stored in Text
objects, that are not associated with a specific box.
A TextBox
contains a list of TextProperties
objects to store the style properties of a styled spans of the text.
PDFDoc
","text":" Bases: BaseModel
This is the main data structure of the library to hold PDFs. It contains the content of the PDF, as well as box annotations and text outputs.
ATTRIBUTE DESCRIPTIONcontent
The content of the PDF document.
TYPE: bytes
id
The ID of the PDF document.
TYPE: (str, optional)
pages
The pages of the PDF document.
TYPE: List[Page]
error
Whether there was an error when processing this PDF document.
TYPE: (bool, optional)
content_boxes
The content boxes/annotations of the PDF document.
TYPE: List[Union[TextBox, ImageBox]]
aggregated_texts
The aggregated text outputs of the PDF document.
TYPE: Dict[str, Text]
text_boxes
The text boxes of the PDF document.
TYPE: List[TextBox]
Page
","text":" Bases: BaseModel
The Page
class represents a page of a PDF document.
page_num
The page number of the page.
TYPE: int
width
The width of the page.
TYPE: float
height
The height of the page.
TYPE: float
doc
The PDF document that this page belongs to.
TYPE: PDFDoc
image
The rendered image of the page, stored as a NumPy array.
TYPE: Optional[ndarray]
text_boxes
The text boxes of the page.
TYPE: List[TextBox]
TextProperties
","text":" Bases: BaseModel
The TextProperties
class represents the style properties of a span of text in a TextBox.
italic
Whether the text is italic.
TYPE: bool
bold
Whether the text is bold.
TYPE: bool
begin
The beginning index of the span of text.
TYPE: int
end
The ending index of the span of text.
TYPE: int
fontname
The font name of the span of text.
TYPE: Optional[str]
Box
","text":" Bases: BaseModel
The Box
class represents a box annotation in a PDF document. It is the base class of TextBox.
doc
The PDF document that this box belongs to.
TYPE: PDFDoc
page_num
The page number of the box.
TYPE: Optional[int]
x0
The left x-coordinate of the box.
TYPE: float
x1
The right x-coordinate of the box.
TYPE: float
y0
The top y-coordinate of the box.
TYPE: float
y1
The bottom y-coordinate of the box.
TYPE: float
label
The label of the box.
TYPE: Optional[str]
page
The page object that this box belongs to.
TYPE: Page
Text
","text":" Bases: BaseModel
The TextBox
class represents text object, not bound to any box.
It can be used to store aggregated text from multiple boxes for example.
ATTRIBUTE DESCRIPTIONtext
The text content.
TYPE: str
properties
The style properties of the text.
TYPE: List[TextProperties]
TextBox
","text":" Bases: Box
The TextBox
class represents a text box annotation in a PDF document.
text
The text content of the text box.
TYPE: str
props
The style properties of the text box.
TYPE: List[TextProperties]
The tensors used to process PDFs with deep learning models usually contain 4 main dimensions, in addition to the standard embedding dimensions:
samples
: one entry per PDF in the batchpages
: one entry per page in a PDFboxes
: one entry per box in a pagetoken
: one entry per token in a box (only for text boxes)These tensors use a special FoldedTensor format to store the data in a compact way and reshape the data depending on the requirements of a layer.
"},{"location":"inference/","title":"Inference","text":"Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.
"},{"location":"inference/#inference-on-a-single-document","title":"Inference on a single document","text":"In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
from pathlib import Path\n\npipeline = ...\ncontent = Path(\"path/to/.pdf\").read_bytes()\ndoc = pipeline(content)\n
If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the multiprocessing accelerator description below.
pipeline.to(\"cuda\") # same semantics as pytorch\ndoc = pipeline(content)\n
"},{"location":"inference/#inference-on-multiple-documents","title":"Inference on multiple documents","text":"When processing multiple documents, it is usually more efficient to use the pipeline.pipe(...)
method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various \"accelerators\" to speed up inference (see the Accelerators section for more details). By default, the .pipe()
method uses the simple
accelerator but you can switch to a different one by passing the accelerator
argument.
pipeline = ...\ndocs = pipeline.pipe(\n [content1, content2, ...],\n batch_size=16, # optional, default to the one defined in the pipeline\n accelerator=my_accelerator,\n)\n
The pipe
method supports the following arguments :
inputs
The inputs to create the PDFDocs from, or the PDFDocs directly.
TYPE: Any
batch_size
The batch size to use. If not provided, the batch size of the pipeline object will be used.
TYPE: Optional[int]
DEFAULT: None
accelerator
The accelerator to use for processing the documents. If not provided, the default accelerator will be used.
TYPE: Optional[Union[str, Accelerator]]
DEFAULT: None
to_doc
The function to use to convert the inputs to PDFDoc objects. By default, the content
field of the inputs will be used if dict-like objects are provided, otherwise the inputs will be passed directly to the pipeline.
TYPE: Optional[ToDoc]
DEFAULT: None
from_doc
The function to use to convert the PDFDoc objects to outputs. By default, the PDFDoc objects will be returned directly.
TYPE: FromDoc
DEFAULT: lambda : doc
This is the simplest accelerator which batches the documents and process each batch on the main process (the one calling .pipe()
).
docs = list(pipeline.pipe([content1, content2, ...]))\n
or, if you want to override the model defined batch size
docs = list(pipeline.pipe([content1, content2, ...], batch_size=8))\n
which is equivalent to passing a confit dict
docs = list(\n pipeline.pipe(\n [content1, content2, ...],\n accelerator={\n \"@accelerator\": \"simple\",\n \"batch_size\": 8,\n },\n )\n)\n
or the instantiated accelerator directly
from edspdf.accelerators.simple import SimpleAccelerator\n\naccelerator = SimpleAccelerator(batch_size=8)\ndocs = list(pipeline.pipe([content1, content2, ...], accelerator=accelerator))\n
If you have a GPU, make sure to move the model to the appropriate device before calling .pipe()
. If you have multiple GPUs, use the multiprocessing accelerator instead.
pipeline.to(\"cuda\")\ndocs = list(pipeline.pipe([content1, content2, ...]))\n
PARAMETER DESCRIPTION batch_size
The number of documents to process in each batch.
TYPE: int
DEFAULT: 32
If you have multiple CPU cores, and optionally multiple GPUs, we provide a multiprocessing
accelerator that allows to run the inference on multiple processes.
This accelerator dispatches the batches between multiple workers (data-parallelism), and distribute the computation of a given batch on one or two workers (model-parallelism). This is done by creating two types of workers:
CPUWorker
which handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning componentsGPUWorker
which handles the forward call of the deep-learning componentsThe advantage of dedicating a worker to the deep-learning components is that it allows to prepare multiple batches in parallel in multiple CPUWorker
, and ensure that the GPUWorker
never wait for a batch to be ready.
The overall architecture described in the following figure, for 3 CPU workers and 2 GPU workers.
Here is how a small pipeline with rule-based components and deep-learning components is distributed between the workers:
"},{"location":"inference/#edspdf.accelerators.multiprocessing.MultiprocessingAccelerator--examples","title":"Examples","text":"docs = list(\n pipeline.pipe(\n [content1, content2, ...],\n accelerator={\n \"@accelerator\": \"multiprocessing\",\n \"num_cpu_workers\": 3,\n \"num_gpu_workers\": 2,\n \"batch_size\": 8,\n },\n )\n)\n
PARAMETER DESCRIPTION batch_size
Number of documents to process at a time in a CPU/GPU worker
TYPE: int
num_cpu_workers
Number of CPU workers. A CPU worker handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components.
TYPE: Optional[int]
DEFAULT: None
num_gpu_workers
Number of GPU workers. A GPU worker handles the forward call of the deep-learning components.
TYPE: Optional[int]
DEFAULT: None
gpu_pipe_names
List of pipe names to accelerate on a GPUWorker, defaults to all pipes that inherit from TrainablePipe
TYPE: Optional[List[str]]
DEFAULT: None
The goal of EDS-PDF is to provide a framework for processing PDF documents, along with some utilities and a few components, stitched together by a robust pipeline and configuration system.
Processing PDFs usually involves many steps such as extracting lines, running OCR models, detecting and classifying boxes, filtering and aggregating parts of the extracted texts, etc. Organising these steps together, combining static and deep learning components, while remaining modular and efficient is a challenge. This is why EDS-PDF is built on top of a new pipelining system.
Deep learning frameworks
The EDS-PDF trainable components are built around the PyTorch framework. While you can use any technology in static components, we do not provide tools to train components built with other deep learning frameworks.
"},{"location":"pipeline/#creating-a-pipeline","title":"Creating a pipeline","text":"A pipe is a processing block (like a function) that applies a transformation on its input and returns a modified object.
At the moment, four types of pipes are implemented in the library:
PDFDoc
object filled with these text boxes.body
, header
, footer
...To create your first pipeline, execute the following code:
from edspdf import Pipeline\n\nmodel = Pipeline()\n# will extract text lines from a document\nmodel.add_pipe(\n \"pdfminer-extractor\",\n config=dict(\n extract_style=False,\n ),\n)\n# classify everything inside the `body` bounding box as `body`\nmodel.add_pipe(\n \"mask-classifier\", config=dict(body={\"x0\": 0.1, \"y0\": 0.1, \"x1\": 0.9, \"y1\": 0.9})\n)\n# aggregates the lines together to re-create the original text\nmodel.add_pipe(\"simple-aggregator\")\n
This pipeline can then be run on one or more PDF documents. As the pipeline process documents, components will be called in the order they were added to the pipeline.
from pathlib import Path\n\npdf_bytes = Path(\"path/to/your/pdf\").read_bytes()\n\n# Processing one document\nmodel(pdf_bytes)\n\n# Processing multiple documents\nmodel.pipe([pdf_bytes, ...])\n
For more information on how to use the pipeline, refer to the Inference page.
"},{"location":"pipeline/#hybrid-models","title":"Hybrid models","text":"EDS-PDF was designed to facilitate the training and inference of hybrid models that arbitrarily chain static components or trained deep learning components. Static components are callable objects that take a PDFDoc object as input, perform arbitrary transformations over the input, and return the modified object. Trainable pipes, on the other hand, allow for deep learning operations to be performed on the PDFDoc object and must be trained to be used.
"},{"location":"pipeline/#saving-and-loading-a-pipeline","title":"Saving and loading a pipeline","text":"Pipelines can be saved and loaded using the save
and load
methods. The saved pipeline is not a pickled objet but a folder containing the config file, the weights and extra resources for each pipeline. This allows for easy inspection and modification of the pipeline, and avoids the execution of arbitrary code when loading a pipeline.
model.save(\"path/to/your/model\")\nmodel = edspdf.load(\"path/to/your/model\")\n
To share the pipeline and turn it into a pip installable package, you can use the package
method, which will use or create a pyproject.toml file, fill it accordingly, and create a wheel file. At the moment, we only support the poetry package manager.
model.package(\n name=\"your-package-name\", # leave None to reuse name in pyproject.toml\n version=\"0.0.1\",\n root_dir=\"path/to/project/root\", # optional, to retrieve an existing pyproject.toml file\n # if you don't have a pyproject.toml, you can provide the metadata here instead\n metadata=dict(\n authors=\"Firstname Lastname <your.email@domain.fr>\",\n description=\"A short description of your package\",\n ),\n)\n
This will create a wheel file in the root_dir/dist folder, which you can share and install with pip
"},{"location":"roadmap/","title":"Roadmap","text":"TrainableClassifier
...save
and load
methods to every pipeline componentTrainable pipes allow for deep learning operations to be performed on the PDFDoc object and must be trained to be used. Such pipes can be used to train a model to predict the label of the lines extracted from a PDF document.
"},{"location":"trainable-pipes/#anatomy-of-a-trainable-pipe","title":"Anatomy of a trainable pipe","text":"Building and running deep learning models usually requires preprocessing the input sample into features, batching or \"collating\" these features together to process multiple samples at once, running deep learning operations over these features (in Pytorch, this step is done in the forward
method) and postprocessing the outputs of these operation to complete the original sample.
In the trainable pipes of EDS-PDF, preprocessing and postprocessing are decoupled from the deep learning code but collocated with the forward method. This is achieved by splitting the class of a trainable component into four methods, which allows us to keep the development of new deep-learning components simple while ensuring efficient models both during training and inference.
"},{"location":"trainable-pipes/#edspdf.trainable_pipe.TrainablePipe.preprocess","title":"preprocess
","text":"Preprocess the document to extract features that will be used by the neural network to perform its predictions.
PARAMETER DESCRIPTIONdoc
PDFDocument to preprocess
TYPE: PDFDoc
Dict[str, Any]
Dictionary (optionally nested) containing the features extracted from the document.
"},{"location":"trainable-pipes/#edspdf.trainable_pipe.TrainablePipe.collate","title":"collate
","text":"Collate the batch of features into a single batch of tensors that can be used by the forward method of the component.
PARAMETER DESCRIPTIONbatch
Batch of features
TYPE: NestedSequences
device
Device on which the tensors should be moved
TYPE: device
InputBatch
Dictionary (optionally nested) containing the collated tensors
"},{"location":"trainable-pipes/#edspdf.trainable_pipe.TrainablePipe.forward","title":"forward
","text":"Perform the forward pass of the neural network, i.e, apply transformations over the collated features to compute new embeddings, probabilities, losses, etc
PARAMETER DESCRIPTIONbatch
Batch of tensors (nested dictionary) computed by the collate method
TYPE: InputBatch
OutputBatch
"},{"location":"trainable-pipes/#edspdf.trainable_pipe.TrainablePipe.postprocess","title":"postprocess
","text":"Update the documents with the predictions of the neural network, for instance converting label probabilities into label attributes on the document lines.
By default, this is a no-op.
PARAMETER DESCRIPTIONdocs
Batch of documents
TYPE: Sequence[PDFDoc]
batch
Batch of predictions, as returned by the forward method
TYPE: OutputBatch
Sequence[PDFDoc]
Additionally, there is a fifth method:
"},{"location":"trainable-pipes/#edspdf.trainable_pipe.TrainablePipe.post_init","title":"post_init
","text":"This method completes the attributes of the component, by looking at some documents. It is especially useful to build vocabularies or detect the labels of a classification task.
PARAMETER DESCRIPTIONgold_data
The documents to use for initialization.
TYPE: Iterable[PDFDoc]
exclude
The names of components to exclude from initialization. This argument will be gradually updated with the names of initialized components
TYPE: set
Here is an example of a trainable component:
from typing import Any, Dict, Iterable, Sequence\n\nimport torch\nfrom tqdm import tqdm\n\nfrom edspdf import Pipeline, TrainablePipe, registry\nfrom edspdf.structures import PDFDoc\n\n\n@registry.factory.register(\"my-component\")\nclass MyComponent(TrainablePipe):\n def __init__(\n self,\n # A subcomponent\n pipeline: Pipeline,\n name: str,\n embedding: TrainablePipe,\n ):\n super().__init__(pipeline=pipeline, name=name)\n self.embedding = embedding\n\n def post_init(self, gold_data: Iterable[PDFDoc], exclude: set):\n # Initialize the component with the gold documents\n with self.label_vocabulary.initialization():\n for doc in tqdm(gold_data, desc=\"Initializing the component\"):\n # Do something like learning a vocabulary over the initialization\n # documents\n ...\n\n # And post_init the subcomponent\n exclude.add(self.name)\n self.embedding.post_init(gold_data, exclude)\n\n # Initialize any layer that might be missing from the module\n self.classifier = torch.nn.Linear(...)\n\n def preprocess(self, doc: PDFDoc, supervision: bool = False) -> Dict[str, Any]:\n # Preprocess the doc to extract features required to run the embedding\n # subcomponent, and this component\n return {\n \"embedding\": self.embedding.preprocess_supervised(doc),\n \"my-feature\": ...(doc),\n }\n\n def collate(self, batch, device: torch.device) -> Dict:\n # Collate the features of the \"embedding\" subcomponent\n # and the features of this component as well\n return {\n \"embedding\": self.embedding.collate(batch[\"embedding\"], device),\n \"my-feature\": torch.as_tensor(batch[\"my-feature\"], device=device),\n }\n\n def forward(self, batch: Dict, supervision=False) -> Dict:\n # Call the embedding subcomponent\n embeds = self.embedding(batch[\"embedding\"])\n\n # Do something with the embedding tensors\n output = ...(embeds)\n\n return output\n\n def postprocess(self, docs: Sequence[PDFDoc], output: Dict) -> Sequence[PDFDoc]:\n # Annotate the docs with the outputs of the forward method\n ...\n return docs\n
"},{"location":"trainable-pipes/#nesting-trainable-pipes","title":"Nesting trainable pipes","text":"Like pytorch modules, you can compose trainable pipes together to build complex architectures. For instance, a trainable classifier component may delegate some of its logic to an embedding component, which will only be responsible for converting PDF lines into multidimensional arrays of numbers.
Nesting pipes allows switching parts of the neural networks to test various architectures and keeping the modelling logic modular.
"},{"location":"trainable-pipes/#sharing-subcomponents","title":"Sharing subcomponents","text":"Sharing parts of a neural network while training on different tasks can be an effective way to improve the network efficiency. For instance, it is common to share an embedding layer between multiple tasks that require embedding the same inputs.
In EDS-PDF, sharing a subcomponent is simply done by sharing the object between the multiple pipes. You can either refer to an existing subcomponent when configuring a new component in Python, or use the interpolation mechanism of our configuration system.
API-basedConfiguration-basedpipeline.add_pipe(\n \"my-component-1\",\n name=\"first\",\n config={\n \"embedding\": {\n \"@factory\": \"box-embedding\",\n # ...\n }\n },\n)\npipeline.add_pipe(\n \"my-component-2\",\n name=\"second\",\n config={\n \"embedding\": pipeline.components.first.embedding,\n },\n)\n
[components.first]\n@factory = \"my-component-1\"\n\n[components.first.embedding]\n@factory = \"box-embedding\"\n...\n\n[components.second]\n@factory = \"my-component-2\"\nembedding = ${components.first.embedding}\n
To avoid recomputing the preprocess
/ forward
and collate
in the multiple components that use it, we rely on a light cache system.
During the training loop, when computing the loss for each component, the forward calls must be wrapped by the pipeline.cache()
context to enable this caching mechanism between components.
EDS-PDF provides a set of specialized deep learning layers that can be used to build trainable components. These layers are built on top of the PyTorch framework and can be used in any PyTorch model.
Layer DescriptionBoxTransformerModule
Contextualize box embeddings with a 2d Transformer with relative position representations BoxTransformerLayer
A single layer of the above BoxTransformerModule
layer RelativeAttention
A 2d attention layer that optionally uses relative position to compute its attention scores SinusoidalEmbedding
A position embedding that uses trigonometric functions to encode positions Vocabulary
A non deep learning layer to encodes / decode vocabularies"},{"location":"layers/box-transformer-layer/","title":"BoxTransformerLayer","text":"BoxTransformerLayer combining a self attention layer and a linear->activation->linear transformation. This layer is used in the BoxTransformerModule module.
"},{"location":"layers/box-transformer-layer/#edspdf.layers.box_transformer.BoxTransformerLayer--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONinput_size
Input embedding size
TYPE: int
num_heads
Number of attention heads in the attention layer
TYPE: int
DEFAULT: 2
dropout_p
Dropout probability both for the attention layer and embedding projections
TYPE: float
DEFAULT: 0.0
head_size
Head sizes of the attention layer
TYPE: Optional[int]
DEFAULT: None
activation
Activation function used in the linear->activation->linear transformation
TYPE: ActivationFunction
DEFAULT: 'gelu'
init_resweight
Initial weight of the residual gates. At 0, the layer acts (initially) as an identity function, and at 1 as a standard Transformer layer. Initializing with a value close to 0 can help the training converge.
TYPE: float
DEFAULT: 0.0
attention_mode
Mode of relative position infused attention layer. See the relative attention documentation for more information.
TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']]
DEFAULT: ('c2c', 'c2p', 'p2c')
position_embedding
Position embedding to use as key/query position embedding in the attention computation.
TYPE: Optional[Union[FloatTensor, Parameter]]
DEFAULT: None
forward
","text":"Forward pass of the BoxTransformerLayer
PARAMETER DESCRIPTIONembeds
Embeddings to contextualize Shape: n_samples * n_keys * input_size
TYPE: FloatTensor
mask
Mask of the embeddings. 0 means padding element. Shape: n_samples * n_keys
TYPE: BoolTensor
relative_positions
Position of the keys relatively to the query elements Shape: n_samples * n_queries * n_keys * n_coordinates (2 for x/y)
TYPE: LongTensor
no_position_mask
Key / query pairs for which the position attention terms should be disabled. Shape: n_samples * n_queries * n_keys
TYPE: Optional[BoolTensor]
DEFAULT: None
Tuple[FloatTensor, FloatTensor]
n_samples * n_queries * n_keys
n_samples * n_queries * n_keys * n_heads
Box Transformer architecture combining a multiple BoxTransformerLayer modules. It is mainly used in BoxTransformer.
"},{"location":"layers/box-transformer/#edspdf.layers.box_transformer.BoxTransformerModule--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONinput_size
Input embedding size
TYPE: Optional[int]
DEFAULT: None
num_heads
Number of attention heads in the attention layers
TYPE: int
DEFAULT: 2
n_relative_positions
Maximum range of embeddable relative positions between boxes (further distances are capped to \u00b1n_relative_positions // 2)
TYPE: Optional[int]
DEFAULT: None
dropout_p
Dropout probability both for the attention layers and embedding projections
TYPE: float
DEFAULT: 0.0
head_size
Head sizes of the attention layers
TYPE: Optional[int]
DEFAULT: None
activation
Activation function used in the linear->activation->linear transformations
TYPE: ActivationFunction
DEFAULT: 'gelu'
init_resweight
Initial weight of the residual gates. At 0, the layer acts (initially) as an identity function, and at 1 as a standard Transformer layer. Initializing with a value close to 0 can help the training converge.
TYPE: float
DEFAULT: 0.0
attention_mode
Mode of relative position infused attention layer. See the relative attention documentation for more information.
TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']]
DEFAULT: ('c2c', 'c2p', 'p2c')
n_layers
Number of layers in the Transformer
TYPE: int
DEFAULT: 2
forward
","text":"Forward pass of the BoxTransformer
PARAMETER DESCRIPTIONembeds
Embeddings to contextualize Shape: n_samples * n_keys * input_size
TYPE: FoldedTensor
boxes
Layout features of the input elements
TYPE: Dict
Tuple[FloatTensor, List[FloatTensor]]
n_samples * n_queries * n_keys
n_samples * n_queries * n_keys * n_heads
A self/cross-attention layer that takes relative position of elements into account to compute the attention weights. When running a relative attention layer, key and queries are represented using content and position embeddings, where position embeddings are retrieved using the relative position of keys relative to queries
"},{"location":"layers/relative-attention/#edspdf.layers.relative_attention.RelativeAttention--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONsize
The size of the output embeddings Also serves as default if query_size, pos_size, or key_size is None
TYPE: int
n_heads
The number of attention heads
TYPE: int
query_size
The size of the query embeddings.
TYPE: Optional[int]
DEFAULT: None
key_size
The size of the key embeddings.
TYPE: Optional[int]
DEFAULT: None
value_size
The size of the value embeddings
TYPE: Optional[int]
DEFAULT: None
head_size
The size of each query / key / value chunk used in the attention dot product Default: key_size / n_heads
TYPE: Optional[int]
DEFAULT: None
position_embedding
The position embedding used as key and query embeddings
TYPE: Optional[Union[FloatTensor, Parameter]]
DEFAULT: None
dropout_p
Dropout probability applied on the attention weights Default: 0.1
TYPE: float
DEFAULT: 0.0
same_key_query_proj
Whether to use the same projection operator for content key and queries when computing the pre-attention key and query embedding chunks Default: False
TYPE: bool
DEFAULT: False
same_positional_key_query_proj
Whether to use the same projection operator for content key and queries when computing the pre-attention key and query embedding chunks Default: False
TYPE: bool
DEFAULT: False
n_coordinates
The number of positional coordinates For instance, text is 1D so 1 coordinate, images are 2D so 2 coordinates ... Default: 1
TYPE: int
DEFAULT: 1
head_bias
Whether to learn a bias term to add to the attention logits This is only useful if you plan to use the attention logits for subsequent operations, since attention weights are unaffected by bias terms.
TYPE: bool
DEFAULT: True
do_pooling
Whether to compute the output embedding. If you only plan to use attention logits, you should disable this parameter. Default: True
TYPE: bool
DEFAULT: True
mode
Whether to compute content to content (c2c), content to position (c2p) or position to content (p2c) attention terms. Setting mode=('c2c\")
disable relative position attention terms: this is the standard attention layer. To get a better intuition about these different types of attention, here is a formulation as fictitious search samples from a word in a (1D) text:
TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']]
DEFAULT: ('c2c', 'p2c', 'c2p')
n_additional_heads
The number of additional head logits to compute. Those are not used to compute output embeddings, but may be useful in subsequent operation. Default: 0
TYPE: int
DEFAULT: 0
forward
","text":"Forward pass of the RelativeAttention layer.
PARAMETER DESCRIPTIONcontent_queries
The content query embedding to use in the attention computation Shape: n_samples * n_queries * query_size
TYPE: FloatTensor
content_keys
The content key embedding to use in the attention computation. If None, defaults to the content_queries
Shape: n_samples * n_keys * query_size
TYPE: Optional[FloatTensor]
DEFAULT: None
content_values
The content values embedding to use in the final pooling computation. If None, pooling won't be performed. Shape: n_samples * n_keys * query_size
TYPE: Optional[FloatTensor]
DEFAULT: None
mask
The content key embedding to use in the attention computation. If None, defaults to the content_queries
Shape: either - n_samples * n_keys
- n_samples * n_queries * n_keys
- n_samples * n_queries * n_keys * n_heads
TYPE: Optional[BoolTensor]
DEFAULT: None
relative_positions
The relative position of keys relative to queries If None, positional attention terms won't be computed. Shape: n_samples * n_queries * n_keys * n_coordinates
TYPE: Optional[LongTensor]
DEFAULT: None
no_position_mask
Key / query pairs for which the position attention terms should be disabled. Shape: n_samples * n_queries * n_keys
TYPE: Optional[BoolTensor]
DEFAULT: None
base_attn
Attention logits to add to the computed attention logits Shape: n_samples * n_queries * n_keys * n_heads
TYPE: Optional[FloatTensor]
DEFAULT: None
Union[Tuple[FloatTensor, FloatTensor], FloatTensor]
do_pooling
attribute is set to True) Shape: n_sample * n_keys * size
A position embedding lookup table that stores embeddings for a fixed number of positions. The value of each of the embedding_dim
channels of the generated embedding is generated according to a trigonometric function (sin for even channels, cos for odd channels). The frequency of the signal in each pair of channels varies according to the temperature parameter.
Any input position above the maximum value num_embeddings
will be capped to num_embeddings - 1
num_embeddings
The maximum number of position embeddings store in this table
TYPE: int
embedding_dim
The embedding size
TYPE: int
temperature
The temperature controls the range of frequencies used by each channel of the embedding
TYPE: float
DEFAULT: 10000.0
forward
","text":"Forward pass of the SinusoidalEmbedding module
PARAMETER DESCRIPTIONindices
Shape: any
TYPE: LongTensor
FloatTensor
Shape: (*input_shape, embedding_dim)
Vocabulary layer. This is not meant to be used as a torch.nn.Module
but subclassing torch.nn.Module
makes the instances appear when printing a model, which is nice.
items
Initial vocabulary elements if any. Specific elements such as padding and unk can be set here to enforce their index in the vocabulary.
TYPE: Sequence[T]
DEFAULT: None
default
Default index to use for out of vocabulary elements Defaults to -100
TYPE: int
DEFAULT: -100
initialization
","text":"Enters the initialization mode. Out of vocabulary elements will be assigned an index.
"},{"location":"layers/vocabulary/#edspdf.layers.vocabulary.Vocabulary.encode","title":"encode
","text":"Converts an element into its vocabulary index If the layer is in its initialization mode (with vocab.initialization(): ...
), and the element is out of vocabulary, a new index will be created and returned. Otherwise, any oov element will be encoded with the default
index.
item
RETURNS DESCRIPTION
int
"},{"location":"layers/vocabulary/#edspdf.layers.vocabulary.Vocabulary.decode","title":"decode
","text":"Converts an index into its original value
PARAMETER DESCRIPTIONidx
RETURNS DESCRIPTION
InputT
"},{"location":"pipes/","title":"Components overview","text":"EDS-PDF provides easy-to-use components for defining PDF processing pipelines.
Box extractorsBox classifiersAggregatorsEmbeddings Factory name Descriptionpdfminer-extractor
Extracts text lines with the pdfminer
library mupdf-extractor
Extracts text lines with the pymupdf
library poppler-extractor
Extracts text lines with the poppler
library Factory name Description mask-classifier
Simple rule-based classification multi-mask-classifier
Simple rule-based classification dummy-classifier
Dummy classifier, for testing purposes. random-classifier
To sow chaos trainable-classifier
Trainable box classification model Factory name Description simple-aggregator
Returns a dictionary with one key for each detected class Factory name Description simple-text-embedding
A module that embeds the textual features of the blocks. embedding-combiner
Encodes boxes using a combination of multiple encoders sub-box-cnn-pooler
Pools the output of a CNN over the elements of a box (like words) box-layout-embedding
Encodes the layout of the boxes box-transformer
Contextualizes box representations using a transformer huggingface-embedding
Box representations using a Huggingface multi-modal model. You can add them to your EDS-PDF pipeline by simply calling add_pipe
, for instance:
# \u2191 Omitted code that defines the pipeline object \u2191\npipeline.add_pipe(\"pdfminer-extractor\", name=\"component-name\", config=...)\n
"},{"location":"pipes/aggregators/","title":"Aggregation","text":"The aggregation step compiles extracted text blocs together according to their detected class.
Factory name Descriptionsimple-aggregator
Returns a dictionary with one key for each detected class"},{"location":"pipes/aggregators/simple-aggregator/","title":"Simple aggregator","text":""},{"location":"pipes/aggregators/simple-aggregator/#edspdf.pipes.aggregators.simple.SimpleAggregator","title":"SimpleAggregator
","text":"Aggregator that returns texts and styles. It groups all text boxes with the same label under the aggregated_text
, and additionally aggregates the styles of the text boxes.
Create a pipeline
API-basedConfiguration-basedpipeline = ...\npipeline.add_pipe(\n \"simple-aggregator\",\n name=\"aggregator\",\n config={\n \"new_line_threshold\": 0.2,\n \"new_paragraph_threshold\": 1.5,\n \"label_map\": {\n \"body\": \"text\",\n \"table\": \"text\",\n },\n },\n)\n
...\n\n[components.aggregator]\n@factory = \"simple-aggregator\"\nnew_line_threshold = 0.2\nnew_paragraph_threshold = 1.5\nlabel_map = { body = \"text\", table = \"text\" }\n\n...\n
and run it on a document:
doc = pipeline(doc)\nprint(doc.aggregated_texts)\n# {\n# \"text\": \"This is the body of the document, followed by a table | A | B |\"\n# }\n
"},{"location":"pipes/aggregators/simple-aggregator/#edspdf.pipes.aggregators.simple.SimpleAggregator--parameters","title":"Parameters","text":"PARAMETER DESCRIPTION pipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
The name of the component
TYPE: str
DEFAULT: 'simple-aggregator'
sort
Whether to sort text boxes inside each label group by (page, y, x) position before merging them.
TYPE: bool
DEFAULT: False
new_line_threshold
Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate lines
TYPE: float
DEFAULT: 0.2
new_paragraph_threshold
Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate paragraphs and thus add a newline character between them.
TYPE: float
DEFAULT: 1.5
label_map
A dictionary mapping labels to new labels. This is useful to group labels together, for instance, to output both \"body\" and \"table\" as \"text\".
TYPE: Dict
DEFAULT: {}
edspdf/pipes/aggregators/simple.py
def __init__(\n self,\n pipeline: Pipeline = None,\n name: str = \"simple-aggregator\",\n sort: bool = False,\n new_line_threshold: float = 0.2,\n new_paragraph_threshold: float = 1.5,\n label_map: Dict = {},\n) -> None:\n self.name = name\n self.sort = sort\n self.label_map = dict(label_map)\n self.new_line_threshold = new_line_threshold\n self.new_paragraph_threshold = new_paragraph_threshold\n
"},{"location":"pipes/box-classifiers/","title":"Box classifiers","text":"We developed EDS-PDF with modularity in mind. To that end, you can choose between multiple classification methods.
Factory name Descriptionmask-classifier
Simple rule-based classification multi-mask-classifier
Simple rule-based classification dummy-classifier
Dummy classifier, for testing purposes. random-classifier
To sow chaos trainable-classifier
Trainable box classification model"},{"location":"pipes/box-classifiers/dummy/","title":"Dummy classifier","text":"Dummy classifier, for chaos purposes. Classifies each line to a random element.
"},{"location":"pipes/box-classifiers/dummy/#edspdf.pipes.classifiers.dummy.DummyClassifier--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
The pipeline object.
TYPE: Pipeline
DEFAULT: None
name
The name of the component.
TYPE: str
DEFAULT: 'dummy-classifier'
label
The label to assign to each line.
TYPE: str
We developed a simple classifier that roughly uses the same strategy as PDFBox, namely:
Two factories are available in the classifiers
registry: mask-classifier
and multi-mask-classifier
.
mask-classifier
","text":"The simplest form of mask classification. You define the mask, everything else is tagged as pollution.
PARAMETER DESCRIPTIONpipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
The name of the component
TYPE: str
DEFAULT: 'mask-classifier'
x0
The x0 coordinate of the mask
TYPE: float
y0
The y0 coordinate of the mask
TYPE: float
x1
The x1 coordinate of the mask
TYPE: float
y1
The y1 coordinate of the mask
TYPE: float
threshold
The threshold for the alignment
TYPE: float
DEFAULT: 1.0
pipeline.add_pipe(\n \"mask-classifier\",\n name=\"classifier\",\n config={\n \"threshold\": 0.9,\n \"x0\": 0.1,\n \"y0\": 0.1,\n \"x1\": 0.9,\n \"y1\": 0.9,\n },\n)\n
[components.classifier]\n@classifiers = \"mask-classifier\"\nx0 = 0.1\ny0 = 0.1\nx1 = 0.9\ny1 = 0.9\nthreshold = 0.9\n
"},{"location":"pipes/box-classifiers/mask/#edspdf.pipes.classifiers.mask.mask_classifier_factory","title":"multi-mask-classifier
","text":"A generalisation, wherein the user defines a number of regions.
The following configuration produces exactly the same classifier as mask.v1
example above.
Any bloc that is not part of a mask is tagged as pollution
.
pipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
TYPE: str
DEFAULT: 'multi-mask-classifier'
threshold
The threshold for the alignment
TYPE: float
DEFAULT: 1.0
masks
The masks
TYPE: Box
DEFAULT: {}
pipeline.add_pipe(\n \"multi-mask-classifier\",\n name=\"classifier\",\n config={\n \"threshold\": 0.9,\n \"mymask\": {\"x0\": 0.1, \"y0\": 0.1, \"x1\": 0.9, \"y1\": 0.3, \"label\": \"body\"},\n },\n)\n
[components.classifier]\n@factory = \"multi-mask-classifier\"\nthreshold = 0.9\n\n[components.classifier.mymask]\nlabel = \"body\"\nx0 = 0.1\ny0 = 0.1\nx1 = 0.9\ny1 = 0.9\n
The following configuration defines a header
region.
pipeline.add_pipe(\n \"multi-mask-classifier\",\n name=\"classifier\",\n config={\n \"threshold\": 0.9,\n \"body\": {\"x0\": 0.1, \"y0\": 0.1, \"x1\": 0.9, \"y1\": 0.3, \"label\": \"header\"},\n \"header\": {\"x0\": 0.1, \"y0\": 0.3, \"x1\": 0.9, \"y1\": 0.9, \"label\": \"body\"},\n },\n)\n
[components.classifier]\n@factory = \"multi-mask-classifier\"\nthreshold = 0.9\n\n[components.classifier.header]\nlabel = \"header\"\nx0 = 0.1\ny0 = 0.1\nx1 = 0.9\ny1 = 0.3\n\n[components.classifier.body]\nlabel = \"body\"\nx0 = 0.1\ny0 = 0.3\nx1 = 0.9\ny1 = 0.9\n
"},{"location":"pipes/box-classifiers/random/","title":"Random classifier","text":"Random classifier, for chaos purposes. Classifies each box to a random element.
"},{"location":"pipes/box-classifiers/random/#edspdf.pipes.classifiers.random.RandomClassifier--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
The pipeline object.
TYPE: Pipeline
name
The name of the component.
TYPE: str
DEFAULT: 'random-classifier'
labels
The labels to assign to each line. If a list is passed, each label is assigned with equal probability. If a dict is passed, the keys are the labels and the values are the probabilities.
TYPE: Union[List[str], Dict[str, float]]
This component predicts a label for each box over the whole document using machine learning.
Note
You must train the model your model to use this classifier. See Model training for more information
"},{"location":"pipes/box-classifiers/trainable/#edspdf.pipes.classifiers.trainable.TrainableClassifier--examples","title":"Examples","text":"The classifier is composed of the following blocks:
In this example, we use a box-embedding
layer to generate the embeddings of the boxes. It is composed of a text encoder that embeds the text features of the boxes and a layout encoder that embeds the layout features of the boxes. These two embeddings are summed and passed through an optional contextualizer
, here a box-transformer
.
pipeline.add_pipe(\n \"trainable-classifier\",\n name=\"classifier\",\n config={\n # simple embedding computed by pooling embeddings of words in each box\n \"embedding\": {\n \"@factory\": \"sub-box-cnn-pooler\",\n \"out_channels\": 64,\n \"kernel_sizes\": (3, 4, 5),\n \"embedding\": {\n \"@factory\": \"simple-text-embedding\",\n \"size\": 72,\n },\n },\n \"labels\": [\"body\", \"pollution\"],\n },\n)\n
[components.classifier]\n@factory = \"trainable-classifier\"\nlabels = [\"body\", \"pollution\"]\n\n[components.classifier.embedding]\n@factory = \"sub-box-cnn-pooler\"\nout_channels = 64\nkernel_sizes = (3, 4, 5)\n\n[components.classifier.embedding.embedding]\n@factory = \"simple-text-embedding\"\nsize = 72\n
"},{"location":"pipes/box-classifiers/trainable/#edspdf.pipes.classifiers.trainable.TrainableClassifier--parameters","title":"Parameters","text":"PARAMETER DESCRIPTION labels
Initial labels of the classifier (will be completed during initialization)
TYPE: Sequence[str]
DEFAULT: ('pollution')
embedding
Embedding module to encode the PDF boxes
TYPE: TrainablePipe[EmbeddingOutput]
We offer multiple embedding methods to encode the text and layout information of the PDFs. The following components can be added to a pipeline or composed together, and contain preprocessing and postprocessing logic to convert and batch documents.
Factory name Descriptionsimple-text-embedding
A module that embeds the textual features of the blocks. embedding-combiner
Encodes boxes using a combination of multiple encoders sub-box-cnn-pooler
Pools the output of a CNN over the elements of a box (like words) box-layout-embedding
Encodes the layout of the boxes box-transformer
Contextualizes box representations using a transformer huggingface-embedding
Box representations using a Huggingface multi-modal model. Layers
These components are not to be confused with layers
, which are standard PyTorch modules that can be used to build trainable components, such as the ones described here.
This component encodes the geometrical features of a box, as extracted by the BoxLayoutPreprocessor module, into an embedding. For position modes, use:
\"sin\"
to embed positions with a fixed SinusoidalEmbedding\"learned\"
to embed positions using a learned standard pytorch embedding layerEach produces embedding is the concatenation of the box width, height and the top, left, bottom and right coordinates, each embedded depending on the *_mode
param.
size
Size of the output box embedding
TYPE: int
n_positions
Number of position embeddings stored in the PositionEmbedding module
TYPE: int
x_mode
Position embedding mode of the x coordinates
TYPE: Literal['sin', 'learned']
DEFAULT: 'sin'
y_mode
Position embedding mode of the x coordinates
TYPE: Literal['sin', 'learned']
DEFAULT: 'sin'
w_mode
Position embedding mode of the width features
TYPE: Literal['sin', 'learned']
DEFAULT: 'sin'
h_mode
Position embedding mode of the height features
TYPE: Literal['sin', 'learned']
DEFAULT: 'sin'
BoxTransformer using BoxTransformerModule under the hood.
Note
This module is a TrainablePipe and can be used in a Pipeline, while BoxTransformerModule is a standard PyTorch module, which does not take care of the preprocessing, collating, etc. of the input documents.
"},{"location":"pipes/embeddings/box-transformer/#edspdf.pipes.embeddings.box_transformer.BoxTransformer--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
Pipeline instance
TYPE: Pipeline
DEFAULT: None
name
Name of the component
TYPE: str
DEFAULT: 'box-transformer'
num_heads
Number of attention heads in the attention layers
TYPE: int
DEFAULT: 2
n_relative_positions
Maximum range of embeddable relative positions between boxes (further distances are capped to \u00b1n_relative_positions // 2)
TYPE: Optional[int]
DEFAULT: None
dropout_p
Dropout probability both for the attention layers and embedding projections
TYPE: float
DEFAULT: 0.0
head_size
Head sizes of the attention layers
TYPE: Optional[int]
DEFAULT: None
activation
Activation function used in the linear->activation->linear transformations
TYPE: ActivationFunction
DEFAULT: 'gelu'
init_resweight
Initial weight of the residual gates. At 0, the layer acts (initially) as an identity function, and at 1 as a standard Transformer layer. Initializing with a value close to 0 can help the training converge.
TYPE: float
DEFAULT: 0.0
attention_mode
Mode of relative position infused attention layer. See the relative attention documentation for more information.
TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']]
DEFAULT: ('c2c', 'c2p', 'p2c')
n_layers
Number of layers in the Transformer
TYPE: int
DEFAULT: 2
Encodes boxes using a combination of multiple encoders
"},{"location":"pipes/embeddings/embedding-combiner/#edspdf.pipes.embeddings.embedding_combiner.EmbeddingCombiner--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
The name of the pipe
TYPE: str
DEFAULT: 'embedding-combiner'
mode
The mode to use to combine the encoders:
sum
: Sum the outputs of the encoderscat
: Concatenate the outputs of the encoders TYPE: Literal['sum', 'cat']
DEFAULT: 'sum'
dropout_p
Dropout probability used on the output of the box and textual encoders
TYPE: float
DEFAULT: 0.0
encoders
The encoders to use. The keys are the names of the encoders and the values are the encoders themselves.
TYPE: TrainablePipe[EmbeddingOutput]
DEFAULT: {}
The HuggingfaceEmbeddings component is a wrapper around the Huggingface multi-modal models. Such pre-trained models should offer better results than a model trained from scratch. Compared to using the raw Huggingface model, we offer a simple mechanism to split long documents into strided windows before feeding them to the model.
"},{"location":"pipes/embeddings/huggingface-embedding/#edspdf.pipes.embeddings.huggingface_embedding.HuggingfaceEmbedding--windowing","title":"Windowing","text":"The HuggingfaceEmbedding component splits long documents into smaller windows before feeding them to the model. This is done to avoid hitting the maximum number of tokens that can be processed by the model on a single device. The window size and stride can be configured using the window
and stride
parameters. The default values are 510 and 255 respectively, which means that the model will process windows of 510 tokens, each separated by 255 tokens. Whenever a token appears in multiple windows, the embedding of the \"most contextualized\" occurrence is used, i.e. the occurrence that is the closest to the center of its window.
Here is an overview how this works in a classifier model :
"},{"location":"pipes/embeddings/huggingface-embedding/#edspdf.pipes.embeddings.huggingface_embedding.HuggingfaceEmbedding--examples","title":"Examples","text":"Here is an example of how to define a pipeline with the HuggingfaceEmbedding component:
from edspdf import Pipeline\n\nmodel = Pipeline()\nmodel.add_pipe(\n \"pdfminer-extractor\",\n name=\"extractor\",\n config={\n \"render_pages\": True,\n },\n)\nmodel.add_pipe(\n \"huggingface-embedding\",\n name=\"embedding\",\n config={\n \"model\": \"microsoft/layoutlmv3-base\",\n \"use_image\": False,\n \"window\": 128,\n \"stride\": 64,\n \"line_pooling\": \"mean\",\n },\n)\nmodel.add_pipe(\n \"trainable-classifier\",\n name=\"classifier\",\n config={\n \"embedding\": model.get_pipe(\"embedding\"),\n \"labels\": [],\n },\n)\n
This model can then be trained following the training recipe.
"},{"location":"pipes/embeddings/huggingface-embedding/#edspdf.pipes.embeddings.huggingface_embedding.HuggingfaceEmbedding--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
The pipeline instance
TYPE: Pipeline
DEFAULT: None
name
The component name
TYPE: str
DEFAULT: 'huggingface-embedding'
model
The Huggingface model name or path
TYPE: str
DEFAULT: None
use_image
Whether to use the image or not in the model
TYPE: bool
DEFAULT: True
window
The window size to use when splitting long documents into smaller windows before feeding them to the Transformer model (default: 510 = 512 - 2)
TYPE: int
DEFAULT: 510
stride
The stride (distance between windows) to use when splitting long documents into smaller windows: (default: 510 / 2 = 255)
TYPE: int
DEFAULT: 255
line_pooling
The pooling strategy to use when combining the embeddings of the tokens in a line into a single line embedding
TYPE: Literal['mean', 'max', 'sum']
DEFAULT: 'mean'
max_tokens_per_device
The maximum number of tokens that can be processed by the model on a single device. This does not affect the results but can be used to reduce the memory usage of the model, at the cost of a longer processing time.
TYPE: int
DEFAULT: 128 * 128
A module that embeds the textual features of the blocks
"},{"location":"pipes/embeddings/simple-text-embedding/#edspdf.pipes.embeddings.simple_text_embedding.SimpleTextEmbedding--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONsize
Size of the output box embedding
TYPE: int
pipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
Name of the component
TYPE: str
DEFAULT: 'simple-text-embedding'
One dimension CNN encoding multi-kernel layer. Input embeddings are convoluted using linear kernels each parametrized with a (window) size of kernel_size[kernel_i]
The output of the kernels are concatenated together, max-pooled and finally projected to a size of output_size
.
pipeline
Pipeline instance
TYPE: Pipeline
DEFAULT: None
name
Name of the component
TYPE: str
DEFAULT: 'sub-box-cnn-pooler'
output_size
Size of the output embeddings Defaults to the input_size
TYPE: Optional[int]
DEFAULT: None
out_channels
Number of channels
TYPE: Optional[int]
DEFAULT: None
kernel_sizes
Window size of each kernel
TYPE: Sequence[int]
DEFAULT: (3, 4, 5)
activation
Activation function to use
TYPE: ActivationFunction
DEFAULT: 'relu'
The extraction phase consists of reading the PDF document and gather text blocs, along with their dimensions and position within the document. Said blocs will go on to the classification phase to separate the body from the rest.
"},{"location":"pipes/extractors/#text-based-pdf","title":"Text-based PDF","text":"We provide a multiple extractor architectures for text-based PDFs :
Factory name Descriptionpdfminer-extractor
Extracts text lines with the pdfminer
library mupdf-extractor
Extracts text lines with the pymupdf
library poppler-extractor
Extracts text lines with the poppler
library"},{"location":"pipes/extractors/#image-based-pdf","title":"Image-based PDF","text":"Image-based PDF documents require an OCR1 step, which is not natively supported by EDS-PDF. However, you can easily extend EDS-PDF by adding such a method to the registry.
We plan on adding such an OCR extractor component in the future.
Optical Character Recognition, or OCR, is the process of extracting characters and words from an image.\u00a0\u21a9
We provide a PDF line extractor built on top of PdfMiner.
This is the most portable extractor, since it is pure-python and can therefore be run on any platform. Be sure to have a look at their documentation, especially the part providing a bird's eye view of the PDF extraction process.
"},{"location":"pipes/extractors/pdfminer/#edspdf.pipes.extractors.pdfminer.PdfMinerExtractor--examples","title":"Examples","text":"API-basedConfiguration-basedpipeline.add_pipe(\n \"pdfminer-extractor\",\n config=dict(\n extract_style=False,\n ),\n)\n
[components.extractor]\n@factory = \"pdfminer-extractor\"\nextract_style = false\n
And use the pipeline on a PDF document:
from pathlib import Path\n\n# Apply on a new document\npipeline(Path(\"path/to/your/pdf/document\").read_bytes())\n
"},{"location":"pipes/extractors/pdfminer/#edspdf.pipes.extractors.pdfminer.PdfMinerExtractor--parameters","title":"Parameters","text":"PARAMETER DESCRIPTION line_overlap
See PDFMiner documentation
TYPE: float
DEFAULT: 0.5
char_margin
See PDFMiner documentation
TYPE: float
DEFAULT: 2.05
line_margin
See PDFMiner documentation
TYPE: float
DEFAULT: 0.5
word_margin
See PDFMiner documentation
TYPE: float
DEFAULT: 0.1
boxes_flow
See PDFMiner documentation
TYPE: Optional[float]
DEFAULT: 0.5
detect_vertical
See PDFMiner documentation
TYPE: bool
DEFAULT: False
all_texts
See PDFMiner documentation
TYPE: bool
DEFAULT: False
extract_style
Whether to extract style (font, size, ...) information for each line of the document. Default: False
TYPE: bool
DEFAULT: False
render_pages
Whether to extract the rendered page as a numpy array in the page.image
attribute (defaults to False)
TYPE: bool
DEFAULT: False
render_dpi
DPI to use when rendering the page (defaults to 200)
TYPE: int
DEFAULT: 200
raise_on_error
Whether to raise an error if the PDF cannot be parsed. Default: False
TYPE: bool
DEFAULT: False
This section goes over a few use-cases for PDF extraction. It is meant as a more hands-on tutorial to get a grip on the library.
"},{"location":"recipes/annotation/","title":"PDF Annotation","text":"In this section, we will cover one methodology to annotate PDF documents.
Data annotation at AP-HP's CDW
At AP-HP's CDW1, we recently moved away from a rule- and Java-based PDF extraction pipeline (using PDFBox) to one using EDS-PDF. Hence, EDS-PDF is used in production, helping extract text from around 100k PDF documents every day.
To train our pipeline presently in production, we annotated around 270 documents, and reached a f1-score of 0.98 on the body classification.
"},{"location":"recipes/annotation/#preparing-the-data-for-annotation","title":"Preparing the data for annotation","text":"We will frame the annotation phase as an image segmentation task, where annotators are asked to draw bounding boxes around the different sections. Hence, the very first step is to convert PDF documents to images. We suggest using the library pdf2image
for that step.
The following script will convert the PDF documents located in a data/pdfs
directory to PNG images inside the data/images
folder.
import pdf2image\nfrom pathlib import Path\n\nDATA_DIR = Path(\"data\")\nPDF_DIR = DATA_DIR / \"pdfs\"\nIMAGE_DIR = DATA_DIR / \"images\"\n\nfor pdf in PDF_DIR.glob(\"*.pdf\"):\n imgs = pdf2image.convert_from_bytes(pdf)\n\n for page, img in enumerate(imgs):\n path = IMAGE_DIR / f\"{pdf.stem}_{page}.png\"\n img.save(path)\n
You can use any annotation tool to annotate the images. If you're looking for a simple way to annotate from within a Jupyter Notebook, ipyannotations might be a good fit.
You will need to post-process the output to convert the annotations to the following format:
Key Descriptionpage
Page within the PDF (0-indexed) x0
Horizontal position of the top-left corner of the bounding box x1
Horizontal position of the bottom-right corner of the bounding box y0
Vertical position of the top-left corner of the bounding box y1
Vertical position of the bottom-right corner of the bounding box label
Class of the bounding box (eg body
, header
...) All dimensions should be normalised by the height and width of the page.
"},{"location":"recipes/annotation/#saving-the-dataset","title":"Saving the dataset","text":"Once the annotation phase is complete, make sure the train/test split is performed once and for all when you create the dataset.
We suggest the following structure:
Directory structuredataset/\n\u251c\u2500\u2500 train/\n\u2502 \u251c\u2500\u2500 <note_id_1>.pdf\n\u2502 \u251c\u2500\u2500 <note_id_1>.json\n\u2502 \u251c\u2500\u2500 <note_id_2>.pdf\n\u2502 \u251c\u2500\u2500 <note_id_2>.json\n\u2502 \u2514\u2500\u2500 ...\n\u2514\u2500\u2500 test/\n \u251c\u2500\u2500 <note_id_n>.pdf\n \u251c\u2500\u2500 <note_id_n>.json\n \u2514\u2500\u2500 ...\n
Where the normalised annotation resides in a JSON file living next to the related PDF, and uses the following schema:
Key Descriptionnote_id
Reference to the document <properties>
Optional property of the document itself annotations
List of annotations, following the schema above This structure presents the advantage of being machine- and human-friendly. The JSON file contains annotated regions as well as any document property that could be useful to adapt the pipeline (typically for the classification step).
"},{"location":"recipes/annotation/#extracting-annotations","title":"Extracting annotations","text":"The following snippet extracts the annotations into a workable format:
from pathlib import Path\nimport pandas as pd\n\n\ndef get_annotations(\n directory: Path,\n) -> pd.DataFrame:\n\"\"\"\n Read annotations from the dataset directory.\n\n Parameters\n ----------\n directory : Path\n Dataset directory\n\n Returns\n -------\n pd.DataFrame\n Pandas DataFrame containing the annotations.\n \"\"\"\n dfs = []\n\n iterator = tqdm(list(directory.glob(\"*.json\")))\n\n for path in iterator:\n meta = json.loads(path.read_text())\n df = pd.DataFrame.from_records(meta.pop(\"annotations\"))\n\n for k, v in meta.items(): # (1)\n df[k] = v\n\n dfs.append(df)\n\n return pd.concat(dfs)\n\n\ntrain_path = Path(\"dataset/train\")\n\nannotations = get_annotations(train_path)\n
The annotations compiled this way can be used to train a pipeline. See the trained pipeline recipe for more detail.
Greater Paris University Hospital's Clinical Data Warehouse\u00a0\u21a9
EDS-PDF is organised around a function registry powered by catalogue and a custom configuration system. The result is a powerful framework that is easy to extend - and we'll see how in this section.
For this recipe, let's imagine we're not entirely satisfied with the aggregation proposed by EDS-PDF. For instance, we might want an aggregator that outputs the text in Markdown format.
Note
Properly converting to markdown is no easy task. For this example, we will limit ourselves to detecting bold and italics sections.
"},{"location":"recipes/extension/#developing-the-new-aggregator","title":"Developing the new aggregator","text":"Our aggregator will inherit from the SimpleAggregator
, and use the style to detect italics and bold sections.
from edspdf import registry\nfrom edspdf.pipes.aggregators.simple import SimpleAggregator\nfrom edspdf.structures import PDFDoc, Text\n\n\n@registry.factory.register(\"markdown-aggregator\") # (1)\nclass MarkdownAggregator(SimpleAggregator):\n def __call__(self, doc: PDFDoc) -> PDFDoc:\n doc = super().__call__(doc)\n\n for label in doc.aggregated_texts.keys():\n text = doc.aggregated_texts[label].text\n\n fragments = []\n\n offset = 0\n for s in doc.aggregated_texts[label].properties:\n if s.begin >= s.end:\n continue\n if offset < s.begin:\n fragments.append(text[offset : s.begin])\n\n offset = s.end\n snippet = text[s.begin : s.end]\n if s.bold:\n snippet = f\"**{snippet}**\"\n if s.italic:\n snippet = f\"_{snippet}_\"\n fragments.append(snippet)\n\n if offset < len(text):\n fragments.append(text[offset:])\n\n doc.aggregated_texts[label] = Text(text=\"\".join(fragments))\n\n return doc\n
__call__
method. It will output a single string, corresponding to the markdown-formatted output.That's it! You can use this new aggregator with the API:
from edspdf import Pipeline\nfrom markdown_aggregator import MarkdownAggregator # (1)\n\nmodel = Pipeline()\n# will extract text lines from a document\nmodel.add_pipe(\n \"pdfminer-extractor\",\n config=dict(\n extract_style=False,\n ),\n)\n# classify everything inside the `body` bounding box as `body`\nmodel.add_pipe(\"mask-classifier\", config={\"x0\": 0.1, \"y0\": 0.1, \"x1\": 0.9, \"y1\": 0.9})\n# aggregates the lines together to generate the markdown formatted text\nmodel.add_pipe(\"markdown-aggregator\")\n
It all works relatively smoothly!
"},{"location":"recipes/extension/#making-the-aggregator-discoverable","title":"Making the aggregator discoverable","text":"Now, how can we instantiate the pipeline using the configuration system? The registry needs to be aware of the new function, but we shouldn't have to import mardown_aggregator.py
just so that the module is registered as a side-effect...
Catalogue solves this problem by using Python entry points.
pyproject.tomlsetup.py[project.entry-points.\"edspdf_factories\"]\n\"markdown-aggregator\" = \"markdown_aggregator:MarkdownAggregator\"\n
from setuptools import setup\n\nsetup(\n name=\"edspdf-markdown-aggregator\",\n entry_points={\n \"edspdf_factories\": [\n \"markdown-aggregator = markdown_aggregator:MarkdownAggregator\"\n ]\n },\n)\n
By declaring the new aggregator as an entrypoint, it will become discoverable by EDS-PDF as long as it is installed in your environment!
"},{"location":"recipes/rule-based/","title":"Rule-based extraction","text":"Let's create a rule-based extractor for PDF documents.
Note
This pipeline will likely perform poorly as soon as your PDF documents come in varied forms. In that case, even a very simple trained pipeline may give you a substantial performance boost (see next section).
First, download this example PDF.
We will use the following configuration:
config.cfg[pipeline]\ncomponents = [\"extractor\", \"classifier\", \"aggregator\"]\ncomponents_config = ${components}\n\n[components.extractor]\n@factory = \"pdfminer-extractor\" # (2)\nextract_style = true\n\n[components.classifier]\n@factory = \"mask-classifier\" # (3)\nx0 = 0.2\nx1 = 0.9\ny0 = 0.3\ny1 = 0.6\nthreshold = 0.1\n\n[components.aggregator]\n@factory = \"styled-aggregator\" # (4)\n
body
label, everything else will be tagged as pollution.Save the configuration as config.cfg
and run the following snippet:
import edspdf\nimport pandas as pd\nfrom pathlib import Path\n\nmodel = edspdf.load(\"config.cfg\") # (1)\n\n# Get a PDF\npdf = Path(\"/Users/perceval/Development/edspdf/tests/resources/letter.pdf\").read_bytes()\npdf = model(pdf)\n\nbody = pdf.aggregated_texts[\"body\"]\n\ntext, style = body.text, body.properties\nprint(text)\nprint(pd.DataFrame(style))\n
This code will output the following results:
VisualisationExtracted TextExtracted StyleCher Pr ABC, Cher DEF,\n\nNous souhaitons remercier le CSE pour son avis favorable quant \u00e0 l\u2019acc\u00e8s aux donn\u00e9es de\nl\u2019Entrep\u00f4t de Donn\u00e9es de Sant\u00e9 du projet n\u00b0 XXXX.\n\nNous avons bien pris connaissance des conditions requises pour cet avis favorable, c\u2019est\npourquoi nous nous engageons par la pr\u00e9sente \u00e0 :\n\n\u2022 Informer individuellement les patients concern\u00e9s par la recherche, admis \u00e0 l'AP-HP\navant juillet 2017, sortis vivants, et non r\u00e9admis depuis.\n\n\u2022 Effectuer une demande d'autorisation \u00e0 la CNIL en cas d'appariement avec d\u2019autres\ncohortes.\n\nBien cordialement,\n
The start
and end
columns refer to the character indices within the extracted text.
In this chapter, we'll see how we can train a deep-learning based classifier to better classify the lines of the document and extract texts from the document.
"},{"location":"recipes/training/#step-by-step-walkthrough","title":"Step-by-step walkthrough","text":"Training supervised models consists in feeding batches of samples taken from a training corpus to a model instantiated from a given architecture and optimizing the learnable weights of the model to decrease a given loss. The process of training a pipeline with EDS-PDF is as follows:
We first start by seeding the random states and instantiating a new trainable pipeline. Here we show two examples of pipeline, the first one based on a custom embedding architecture and the second one based on a pre-trained HuggingFace transformer model.
Custom architecturePre-trained HuggingFace transformerThe architecture of the trainable classifier of this recipe is described in the following figure:
from edspdf import Pipeline\nfrom edspdf.utils.random import set_seed\n\nset_seed(42)\n\nmodel = Pipeline()\nmodel.add_pipe(\"pdfminer-extractor\", name=\"extractor\") # (1)\nmodel.add_pipe(\n \"box-transformer\",\n name=\"embedding\",\n config={\n \"num_heads\": 4,\n \"dropout_p\": 0.1,\n \"activation\": \"gelu\",\n \"init_resweight\": 0.01,\n \"head_size\": 16,\n \"attention_mode\": [\"c2c\", \"c2p\", \"p2c\"],\n \"n_layers\": 1,\n \"n_relative_positions\": 64,\n \"embedding\": {\n \"@factory\": \"embedding-combiner\",\n \"dropout_p\": 0.1,\n \"text_encoder\": {\n \"@factory\": \"sub-box-cnn-pooler\",\n \"out_channels\": 64,\n \"kernel_sizes\": (3, 4, 5),\n \"embedding\": {\n \"@factory\": \"simple-text-embedding\",\n \"size\": 72,\n },\n },\n \"layout_encoder\": {\n \"@factory\": \"box-layout-embedding\",\n \"n_positions\": 64,\n \"x_mode\": \"learned\",\n \"y_mode\": \"learned\",\n \"w_mode\": \"learned\",\n \"h_mode\": \"learned\",\n \"size\": 72,\n },\n },\n },\n)\nmodel.add_pipe(\n \"trainable-classifier\",\n name=\"classifier\",\n config={\n \"embedding\": model.get_pipe(\"embedding\"),\n \"labels\": [],\n },\n)\n
model = Pipeline()\nmodel.add_pipe(\n \"mupdf-extractor\",\n name=\"extractor\",\n config={\n \"render_pages\": True,\n },\n) # (1)\nmodel.add_pipe(\n \"huggingface-embedding\",\n name=\"embedding\",\n config={\n \"model\": \"microsoft/layoutlmv3-base\",\n \"use_image\": False,\n \"window\": 128,\n \"stride\": 64,\n \"line_pooling\": \"mean\",\n },\n)\nmodel.add_pipe(\n \"trainable-classifier\",\n name=\"classifier\",\n config={\n \"embedding\": model.get_pipe(\"embedding\"),\n \"labels\": [],\n },\n)\n
We then load and adapt (i.e., convert into PDFDoc) the training and validation dataset, which is often a combination of JSON and PDF files. The recommended way of doing this is to make a Python generator of PDFDoc objects.
train_docs = list(segmentation_adapter(train_path)(model))\nval_docs = list(segmentation_adapter(val_path)(model))\n
We initialize the missing or incomplete components attributes (such as vocabularies) with the training dataset
model.post_init(train_docs)\n
The training dataset is then preprocessed into features. The resulting preprocessed dataset is then wrapped into a pytorch DataLoader to be fed to the model during the training loop with the model's own collate method.
preprocessed = list(model.preprocess_many(train_docs, supervision=True))\ndataloader = DataLoader(\n preprocessed,\n batch_size=batch_size,\n collate_fn=model.collate,\n shuffle=True,\n)\n
We instantiate an optimizer and start the training loop
from itertools import chain, repeat\n\noptimizer = torch.optim.AdamW(\n params=model.parameters(),\n lr=lr,\n)\n\n# We will loop over the dataloader\niterator = chain.from_iterable(repeat(dataloader))\n\nfor step in tqdm(range(max_steps), \"Training model\", leave=True):\n batch = next(iterator)\n optimizer.zero_grad()\n
The trainable components are fed the collated batches from the dataloader with the TrainablePipe.module_forward
methods to compute the losses. Since outputs of shared subcomponents are reused between components, we enable caching by wrapping this step in a cache context. The training loop is otherwise carried in a similar fashion to a standard pytorch training loop
with model.cache():\n loss = torch.zeros((), device=\"cpu\")\n for name, component in model.trainable_pipes():\n output = component.module_forward(batch[component.name])\n if \"loss\" in output:\n loss += output[\"loss\"]\n\n loss.backward()\n\n optimizer.step()\n
Finally, the model is evaluated on the validation dataset at regular intervals and saved at the end of the training. To score the model, we only want to run \"classifier\" component and not the extractor, otherwise we would overwrite annotated text boxes on documents in the val_docs
dataset, and have mismatching text boxes between the gold and predicted documents. To save the model, although you can use torch.save
to save your model, we provide a safer method to avoid the security pitfalls of pickle models
from edspdf import Pipeline\nfrom sklearn.metrics import classification_report\nfrom copy import deepcopy\n\n\ndef score(golds, preds):\n return classification_report(\n [b.label for gold in golds for b in gold.text_boxes if b.text != \"\"],\n [b.label for pred in preds for b in pred.text_boxes if b.text != \"\"],\n output_dict=True,\n zero_division=0,\n )\n\n\n...\n\nif (step % 100) == 0:\n # we only want to run \"classifier\" component, not overwrite the text boxes\n with model.select_pipes(enable=[\"classifier\"]):\n print(score(val_docs, model.pipe(deepcopy(val_docs))))\n\n# torch.save(model, \"model.pt\")\nmodel.save(\"model\")\n
The first step of training a pipeline is to adapt the dataset to the pipeline. This is done by converting the dataset into a list of PDFDoc objects, using an extractor. The following function loads a dataset of .pdf
and .json
files, where each .json
file contain box annotations represented with page
, x0
, x1
, y0
, y1
and label
.
from edspdf.utils.alignment import align_box_labels\nfrom pathlib import Path\nfrom pydantic import DirectoryPath\nfrom edspdf.registry import registry\nfrom edspdf.structures import Box\nimport json\n\n\n@registry.adapter.register(\"my-segmentation-adapter\")\ndef segmentation_adapter(\n path: DirectoryPath,\n):\n def adapt_to(model):\n for anns_filepath in sorted(Path(path).glob(\"*.json\")):\n pdf_filepath = str(anns_filepath).replace(\".json\", \".pdf\")\n with open(anns_filepath) as f:\n sample = json.load(f)\n pdf = Path(pdf_filepath).read_bytes()\n\n if len(sample[\"annotations\"]) == 0:\n continue\n\n doc = model.components.extractor(pdf)\n doc.id = pdf_filepath.split(\".\")[0].split(\"/\")[-1]\n doc.lines = [\n line\n for page in sorted(set(b.page for b in doc.lines))\n for line in align_box_labels(\n src_boxes=[\n Box(\n page_num=b[\"page\"],\n x0=b[\"x0\"],\n x1=b[\"x1\"],\n y0=b[\"y0\"],\n y1=b[\"y1\"],\n label=b[\"label\"],\n )\n for b in sample[\"annotations\"]\n if b[\"page\"] == page\n ],\n dst_boxes=doc.lines,\n pollution_label=None,\n )\n if line.text == \"\" or line.label is not None\n ]\n yield doc\n\n return adapt_to\n
"},{"location":"recipes/training/#full-example","title":"Full example","text":"Let's wrap the training code in a function, and make it callable from the command line using confit !
train.pyimport itertools\nimport json\nfrom copy import deepcopy\nfrom pathlib import Path\n\nimport torch\nfrom confit import Cli\nfrom pydantic import DirectoryPath\nfrom torch.utils.data import DataLoader\nfrom tqdm import tqdm\n\nfrom edspdf import Pipeline, registry\nfrom edspdf.structures import Box\nfrom edspdf.utils.alignment import align_box_labels\nfrom edspdf.utils.random import set_seed\n\napp = Cli(pretty_exceptions_show_locals=False)\n\n\ndef score(golds, preds):\n return classification_report(\n [b.label for gold in golds for b in gold.text_boxes if b.text != \"\"],\n [b.label for pred in preds for b in pred.text_boxes if b.text != \"\"],\n output_dict=True,\n zero_division=0,\n )\n\n\n@registry.adapter.register(\"my-segmentation-adapter\")\ndef segmentation_adapter(\n path: str,\n):\n def adapt_to(model):\n for anns_filepath in sorted(Path(path).glob(\"*.json\")):\n pdf_filepath = str(anns_filepath).replace(\".json\", \".pdf\")\n with open(anns_filepath) as f:\n sample = json.load(f)\n pdf = Path(pdf_filepath).read_bytes()\n\n if len(sample[\"annotations\"]) == 0:\n continue\n\n doc = model.get_pipe(\"extractor\")(pdf)\n doc.id = pdf_filepath.split(\".\")[0].split(\"/\")[-1]\n doc.content_boxes = [\n line\n for page_num in sorted(set(b.page_num for b in doc.lines))\n for line in align_box_labels(\n src_boxes=[\n Box(\n page_num=b[\"page\"],\n x0=b[\"x0\"],\n x1=b[\"x1\"],\n y0=b[\"y0\"],\n y1=b[\"y1\"],\n label=b[\"label\"],\n )\n for b in sample[\"annotations\"]\n if b[\"page\"] == page_num\n ],\n dst_boxes=doc.lines,\n pollution_label=None,\n )\n if line.text == \"\" or line.label is not None\n ]\n yield doc\n\n return adapt_to\n\n\n@app.command(name=\"train\")\ndef train_my_model(\n train_path: DirectoryPath = \"dataset/train\",\n val_path: DirectoryPath = \"dataset/dev\",\n max_steps: int = 1000,\n batch_size: int = 4,\n lr: float = 3e-4,\n):\n set_seed(42)\n\n # We define the model\n model = Pipeline()\n model.add_pipe(\"mupdf-extractor\", name=\"extractor\")\n model.add_pipe(\n \"box-transformer\",\n name=\"embedding\",\n config={\n \"num_heads\": 4,\n \"dropout_p\": 0.1,\n \"activation\": \"gelu\",\n \"init_resweight\": 0.01,\n \"head_size\": 16,\n \"attention_mode\": [\"c2c\", \"c2p\", \"p2c\"],\n \"n_layers\": 1,\n \"n_relative_positions\": 64,\n \"embedding\": {\n \"@factory\": \"embedding-combiner\",\n \"dropout_p\": 0.1,\n \"text_encoder\": {\n \"@factory\": \"sub-box-cnn-pooler\",\n \"out_channels\": 64,\n \"kernel_sizes\": (3, 4, 5),\n \"embedding\": {\n \"@factory\": \"simple-text-embedding\",\n \"size\": 72,\n },\n },\n \"layout_encoder\": {\n \"@factory\": \"box-layout-embedding\",\n \"n_positions\": 64,\n \"x_mode\": \"learned\",\n \"y_mode\": \"learned\",\n \"w_mode\": \"learned\",\n \"h_mode\": \"learned\",\n \"size\": 72,\n },\n },\n },\n )\n model.add_pipe(\n \"trainable-classifier\",\n name=\"classifier\",\n config={\n \"embedding\": model.get_pipe(\"embedding\"),\n \"labels\": [],\n },\n )\n\n # Loading and adapting the training and validation data\n train_docs = list(segmentation_adapter(train_path)(model))\n val_docs = list(segmentation_adapter(val_path)(model))\n\n # Taking the first `initialization_subset` samples to initialize the model\n model.post_init(train_docs)\n\n # Preprocessing the training dataset into a dataloader\n preprocessed = list(model.preprocess_many(train_docs, supervision=True))\n dataloader = DataLoader(\n preprocessed,\n batch_size=batch_size,\n collate_fn=model.collate,\n shuffle=True,\n )\n\n optimizer = torch.optim.AdamW(\n params=model.parameters(),\n lr=lr,\n )\n\n # We will loop over the dataloader\n iterator = itertools.chain.from_iterable(itertools.repeat(dataloader))\n\n for step in tqdm(range(max_steps), \"Training model\", leave=True):\n batch = next(iterator)\n optimizer.zero_grad()\n\n with model.cache():\n loss = torch.zeros((), device=\"cpu\")\n for name, component in model.trainable_pipes():\n output = component.module_forward(batch[component.name])\n if \"loss\" in output:\n loss += output[\"loss\"]\n\n loss.backward()\n\n optimizer.step()\n\n if (step % 100) == 0:\n with model.select_pipes(enable=[\"classifier\"]):\n print(score(val_docs, model.pipe(deepcopy(val_docs))))\n model.save(\"model\")\n\n return model\n\n\nif __name__ == \"__main__\":\n app()\n
python train.py --seed 42\n
At the end of the training, the pipeline is ready to use (with the .pipe
method) since every trained component of the pipeline is self-sufficient, ie contains the preprocessing, inference and postprocessing code required to run it.
To decouple the configuration and the code of our training script, let's define a configuration file where we will describe both our training parameters and the pipeline. You can either write the config of the pipeline by hand, or generate it from an instantiated pipeline by running:
print(pipeline.config.to_str())\n
Custom architecturePretrained Huggingface Transformer config.cfg# This is this equivalent of the API-based declaration at the beginning of the tutorial\n[pipeline]\npipeline = [\"extractor\", \"embedding\", \"classifier\"]\ndisabled = []\ncomponents = ${components}\n\n[components]\n\n[components.extractor]\n@factory = \"pdfminer-extractor\"\n\n[components.embedding]\n@factory = \"box-transformer\"\nnum_heads = 4\ndropout_p = 0.1\nactivation = \"gelu\"\ninit_resweight = 0.01\nhead_size = 16\nattention_mode = [\"c2c\", \"c2p\", \"p2c\"]\nn_layers = 1\nn_relative_positions = 64\n\n[components.embedding.embedding]\n@factory = \"embedding-combiner\"\ndropout_p = 0.1\n\n[components.embedding.embedding.text_encoder]\n@factory = \"sub-box-cnn-pooler\"\nout_channels = 64\nkernel_sizes = (3, 4, 5)\n\n[components.embedding.embedding.text_encoder.embedding]\n@factory = \"simple-text-embedding\"\nsize = 72\n\n[components.embedding.embedding.layout_encoder]\n@factory = \"box-layout-embedding\"\nn_positions = 64\nx_mode = \"learned\"\ny_mode = \"learned\"\nw_mode = \"learned\"\nh_mode = \"learned\"\nsize = 72\n\n[components.classifier]\n@factory = \"trainable-classifier\"\nembedding = ${components.embedding}\nlabels = []\n\n# This is were we define the training script parameters\n# the \"train\" section refers to the name of the command in the training script\n[train]\nmodel = ${pipeline}\ntrain_data = {\"@adapter\": \"my-segmentation-adapter\", \"path\": \"data/train\"}\nval_data = {\"@adapter\": \"my-segmentation-adapter\", \"path\": \"data/val\"}\nmax_steps = 1000\nseed = 42\nlr = 3e-4\nbatch_size = 4\n
config.cfg[pipeline]\npipeline = [\"extractor\", \"embedding\", \"classifier\"]\ndisabled = []\ncomponents = ${components}\n\n[components]\n\n[components.extractor]\n@factory = \"mupdf-extractor\"\nrender_pages = true\n\n[components.embedding]\n@factory = \"huggingface-embedding\"\nmodel = \"microsoft/layoutlmv3-base\"\nuse_image = false\nwindow = 128\nstride = 64\nline_pooling = \"mean\"\n\n[components.classifier]\n@factory = \"trainable-classifier\"\nembedding = ${components.embedding}\nlabels = []\n\n[train]\nmodel = ${pipeline}\nmax_steps = 1000\nlr = 5e-5\nseed = 42\ntrain_data = {\"@adapter\": \"my-segmentation-adapter\", \"path\": \"data/train\"}\nval_data = {\"@adapter\": \"my-segmentation-adapter\", \"path\": \"data/val\"}\nbatch_size = 8\n
and update our training script to use the pipeline and the data adapters defined in the configuration file instead of the Python declaration :
@app.command(name=\"train\")\ndef train_my_model(\n+ model: Pipeline,\n+ train_path: DirectoryPath = \"data/train\",\n- train_data: Callable = segmentation_adapter(\"data/train\"),\n+ val_path: DirectoryPath = \"data/val\",\n- val_data: Callable = segmentation_adapter(\"data/val\"),\n seed: int = 42,\n max_steps: int = 1000,\n batch_size: int = 4,\n lr: float = 3e-4,\n):\n # Seed will be set by the CLI util, before `model` is instanciated\n- set_seed(seed)\n\n # Model will be defined from the config file using registries\n- model = Pipeline()\n- model.add_pipe(\"mupdf-extractor\", name=\"extractor\")\n- model.add_pipe(\n- \"box-transformer\",\n- name=\"embedding\",\n- config={\n- \"num_heads\": 4,\n- \"dropout_p\": 0.1,\n- \"activation\": \"gelu\",\n- \"init_resweight\": 0.01,\n- \"head_size\": 16,\n- \"attention_mode\": [\"c2c\", \"c2p\", \"p2c\"],\n- \"n_layers\": 1,\n- \"n_relative_positions\": 64,\n- \"embedding\": {\n- \"@factory\": \"embedding-combiner\",\n- \"dropout_p\": 0.1,\n- \"text_encoder\": {\n- \"@factory\": \"sub-box-cnn-pooler\",\n- \"out_channels\": 64,\n- \"kernel_sizes\": (3, 4, 5),\n- \"embedding\": {\n- \"@factory\": \"simple-text-embedding\",\n- \"size\": 72,\n- },\n- },\n- \"layout_encoder\": {\n- \"@factory\": \"box-layout-embedding\",\n- \"n_positions\": 64,\n- \"x_mode\": \"learned\",\n- \"y_mode\": \"learned\",\n- \"w_mode\": \"learned\",\n- \"h_mode\": \"learned\",\n- \"size\": 72,\n- },\n- },\n- },\n- )\n- model.add_pipe(\n- \"trainable-classifier\",\n- name=\"classifier\",\n- config={\n- \"embedding\": model.get_pipe(\"embedding\"),\n- \"labels\": [],\n- },\n- )\n\n # Loading and adapting the training and validation data\n- train_docs = list(segmentation_adapter(train_path)(model))\n+ train_docs = list(train_data(model))\n- val_docs = list(segmentation_adapter(val_path)(model))\n+ val_docs = list(val_data(model))\n\n # Taking the first `initialization_subset` samples to initialize the model\n ...\n
That's it ! We can now call the training script with the configuration file as a parameter, and override some of its defaults values:
python train.py --config config.cfg --components.extractor.extract_styles=true --seed 43\n
"},{"location":"reference/edspdf/","title":"edspdf
","text":""},{"location":"reference/edspdf/pipeline/","title":"edspdf.pipeline
","text":""},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline","title":"Pipeline
","text":"Pipeline to build hybrid and multitask PDF processing pipeline. It uses PyTorch as the deep-learning backend and allows components to share subcomponents.
See the documentation for more details.
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONbatch_size
Batch size to use in the .pipe()
method
TYPE: Optional[int]
DEFAULT: 4
meta
Meta information about the pipeline
TYPE: Dict[str, Any]
DEFAULT: None
disabled
property
","text":"The names of the disabled components
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.cfg","title":"cfg: Config
property
","text":"Returns the config of the pipeline, including the config of all components. Updated from spacy to allow references between components.
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.get_pipe","title":"get_pipe
","text":"Get a component by its name.
PARAMETER DESCRIPTIONname
The name of the component to get.
TYPE: str
Pipe
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.has_pipe","title":"has_pipe
","text":"Check if a component exists in the pipeline.
PARAMETER DESCRIPTIONname
The name of the component to check.
TYPE: str
bool
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.create_pipe","title":"create_pipe
","text":"Create a component from a factory name.
PARAMETER DESCRIPTIONfactory
The name of the factory to use
TYPE: str
name
The name of the component
TYPE: str
config
The config to pass to the factory
TYPE: Dict[str, Any]
DEFAULT: None
Pipe
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.add_pipe","title":"add_pipe
","text":"Add a component to the pipeline.
PARAMETER DESCRIPTIONfactory
The name of the component to add or the component itself
TYPE: Union[str, Pipe]
name
The name of the component. If not provided, the name of the component will be used if it has one (.name), otherwise the factory name will be used.
TYPE: Optional[str]
DEFAULT: None
first
Whether to add the component to the beginning of the pipeline. This argument is mutually exclusive with before
and after
.
TYPE: bool
DEFAULT: False
before
The name of the component to add the new component before. This argument is mutually exclusive with after
and first
.
TYPE: Optional[str]
DEFAULT: None
after
The name of the component to add the new component after. This argument is mutually exclusive with before
and first
.
TYPE: Optional[str]
DEFAULT: None
config
The arguments to pass to the component factory.
Note that instead of replacing arguments with the same keys, the config will be merged with the default config of the component. This means that you can override specific nested arguments without having to specify the entire config.
TYPE: Optional[Dict[str, Any]]
DEFAULT: None
Pipe
The component that was added to the pipeline.
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.__call__","title":"__call__
","text":"Apply each component successively on a document.
PARAMETER DESCRIPTIONdoc
The doc to create the PDFDoc from, or a PDFDoc.
TYPE: Any
PDFDoc
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.pipe","title":"pipe
","text":"Process a stream of documents by applying each component successively on batches of documents.
PARAMETER DESCRIPTIONinputs
The inputs to create the PDFDocs from, or the PDFDocs directly.
TYPE: Any
batch_size
The batch size to use. If not provided, the batch size of the pipeline object will be used.
TYPE: Optional[int]
DEFAULT: None
accelerator
The accelerator to use for processing the documents. If not provided, the default accelerator will be used.
TYPE: Optional[Union[str, Accelerator]]
DEFAULT: None
to_doc
The function to use to convert the inputs to PDFDoc objects. By default, the content
field of the inputs will be used if dict-like objects are provided, otherwise the inputs will be passed directly to the pipeline.
TYPE: Optional[ToDoc]
DEFAULT: None
from_doc
The function to use to convert the PDFDoc objects to outputs. By default, the PDFDoc objects will be returned directly.
TYPE: FromDoc
DEFAULT: lambda : doc
Iterable[PDFDoc]
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.cache","title":"cache
","text":"Enable caching for all (trainable) components in the pipeline
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.trainable_pipes","title":"trainable_pipes
","text":"Yields components that are PyTorch modules.
PARAMETER DESCRIPTIONdisable
The names of disabled components, which will be skipped.
TYPE: Sequence[str]
DEFAULT: ()
Iterable[Tuple[str, TrainablePipe]]
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.post_init","title":"post_init
","text":"Completes the initialization of the pipeline by calling the post_init method of all components that have one. This is useful for components that need to see some data to build their vocabulary, for instance.
PARAMETER DESCRIPTIONgold_data
The documents to use for initialization. Each component will not necessarily see all the data.
TYPE: Iterable[PDFDoc]
exclude
The names of components to exclude from initialization. This argument will be gradually updated with the names of initialized components
TYPE: Optional[set]
DEFAULT: None
from_config
classmethod
","text":"Create a pipeline from a config object
PARAMETER DESCRIPTIONconfig
The config to use
TYPE: Dict[str, Any]
DEFAULT: {}
disable
Components to disable
TYPE: Optional[Set[str]]
DEFAULT: None
enable
Components to enable
TYPE: Optional[Set[str]]
DEFAULT: None
exclude
Components to exclude
TYPE: Optional[Set[str]]
DEFAULT: None
meta
Metadata to add to the pipeline
TYPE: Optional[Dict[str, Any]]
DEFAULT: None
Pipeline
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.__get_validators__","title":"__get_validators__
classmethod
","text":"Pydantic validators generator
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.validate","title":"validate
classmethod
","text":"Pydantic validator, used in the validate_arguments
decorated functions
preprocess
","text":"Run the preprocessing methods of each component in the pipeline on a document and returns a dictionary containing the results, with the component names as keys.
PARAMETER DESCRIPTIONdoc
The document to preprocess
TYPE: PDFDoc
supervision
Whether to include supervision information in the preprocessing
TYPE: bool
DEFAULT: False
Dict[str, Any]
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.preprocess_many","title":"preprocess_many
","text":"Runs the preprocessing methods of each component in the pipeline on a collection of documents and returns an iterable of dictionaries containing the results, with the component names as keys.
PARAMETER DESCRIPTIONdocs
TYPE: Iterable[PDFDoc]
compress
Whether to deduplicate identical preprocessing outputs of the results if multiple documents share identical subcomponents. This step is required to enable the cache mechanism when training or running the pipeline over a tabular datasets such as pyarrow tables that do not store referential equality information.
DEFAULT: True
supervision
Whether to include supervision information in the preprocessing
DEFAULT: True
Iterable[OutputT]
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.collate","title":"collate
","text":"Collates a batch of preprocessed samples into a single (maybe nested) dictionary of tensors by calling the collate method of each component.
PARAMETER DESCRIPTIONbatch
The batch of preprocessed samples
TYPE: List[Dict[str, Any]]
device
The device to move the tensors to before returning them
TYPE: Optional[device]
DEFAULT: None
Dict[str, Any]
The collated batch
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.parameters","title":"parameters
","text":"Returns an iterator over the Pytorch parameters of the components in the pipeline
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.named_parameters","title":"named_parameters
","text":"Returns an iterator over the Pytorch parameters of the components in the pipeline
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.to","title":"to
","text":"Moves the pipeline to a given device
"},{"location":"reference/edspdf/pipeline/#edspdf.pipeline.Pipeline.train","title":"train
","text":"Enables training mode on pytorch modules
PARAMETER DESCRIPTIONmode
Whether to enable training or not
DEFAULT: True
save
","text":"Save the pipeline to a directory.
PARAMETER DESCRIPTIONpath
The path to the directory to save the pipeline to. Every component will be saved to separated subdirectories of this directory, except for tensors that will be saved to a shared files depending on the references between the components.
TYPE: Union[str, Path]
exclude
The names of the components, or attributes to exclude from the saving process. This list will be gradually filled in place as components are saved
TYPE: Optional[Set[str]]
DEFAULT: None
load_state_from_disk
","text":"Load the pipeline from a directory. Components will be updated in-place.
PARAMETER DESCRIPTIONpath
The path to the directory to load the pipeline from
TYPE: Union[str, Path]
exclude
The names of the components, or attributes to exclude from the loading process. This list will be gradually filled in place as components are loaded
TYPE: Set[str]
DEFAULT: None
select_pipes
","text":"Temporarily disable and enable components in the pipeline.
PARAMETER DESCRIPTIONdisable
The name of the component to disable, or a list of names.
TYPE: Optional[Union[str, Iterable[str]]]
DEFAULT: None
enable
The name of the component to enable, or a list of names.
TYPE: Optional[Union[str, Iterable[str]]]
DEFAULT: None
edspdf.registry
","text":""},{"location":"reference/edspdf/registry/#edspdf.registry.CurriedFactory","title":"CurriedFactory
","text":""},{"location":"reference/edspdf/registry/#edspdf.registry.CurriedFactory.instantiate","title":"instantiate
","text":"We need to support passing in the pipeline object and name to factories from a config file. Since components can be nested, we need to add them to every factory in the config.
"},{"location":"reference/edspdf/registry/#edspdf.registry.FactoryRegistry","title":"FactoryRegistry
","text":" Bases: Registry
A registry that validates the input arguments of the registered functions.
"},{"location":"reference/edspdf/registry/#edspdf.registry.FactoryRegistry.get","title":"get
","text":"Get the registered function for a given name.
name (str): The name. RETURNS (Any): The registered function.
"},{"location":"reference/edspdf/registry/#edspdf.registry.FactoryRegistry.register","title":"register
","text":"This is a convenience wrapper around confit.Registry.register
, that curries the function to be registered, allowing to instantiate the class later once pipeline
and name
are known.
name
TYPE: str
func
TYPE: Optional[InFunc]
DEFAULT: None
default_config
TYPE: Dict[str, Any]
DEFAULT: FrozenDict()
assigns
TYPE: Iterable[str]
DEFAULT: FrozenList()
requires
TYPE: Iterable[str]
DEFAULT: FrozenList()
retokenizes
TYPE: bool
DEFAULT: False
default_score_weights
TYPE: Dict[str, Optional[float]]
DEFAULT: FrozenDict()
Callable[[InFunc], InFunc]
"},{"location":"reference/edspdf/registry/#edspdf.registry.accepted_arguments","title":"accepted_arguments
","text":"Checks that a function accepts a list of keyword arguments
PARAMETER DESCRIPTIONfunc
Function to check
TYPE: Callable
args
Argument or list of arguments to check
TYPE: Sequence[str]
List[str]
"},{"location":"reference/edspdf/structures/","title":"edspdf.structures
","text":""},{"location":"reference/edspdf/structures/#edspdf.structures.PDFDoc","title":"PDFDoc
","text":" Bases: BaseModel
This is the main data structure of the library to hold PDFs. It contains the content of the PDF, as well as box annotations and text outputs.
ATTRIBUTE DESCRIPTIONcontent
The content of the PDF document.
TYPE: bytes
id
The ID of the PDF document.
TYPE: (str, optional)
pages
The pages of the PDF document.
TYPE: List[Page]
error
Whether there was an error when processing this PDF document.
TYPE: (bool, optional)
content_boxes
The content boxes/annotations of the PDF document.
TYPE: List[Union[TextBox, ImageBox]]
aggregated_texts
The aggregated text outputs of the PDF document.
TYPE: Dict[str, Text]
text_boxes
The text boxes of the PDF document.
TYPE: List[TextBox]
Page
","text":" Bases: BaseModel
The Page
class represents a page of a PDF document.
page_num
The page number of the page.
TYPE: int
width
The width of the page.
TYPE: float
height
The height of the page.
TYPE: float
doc
The PDF document that this page belongs to.
TYPE: PDFDoc
image
The rendered image of the page, stored as a NumPy array.
TYPE: Optional[ndarray]
text_boxes
The text boxes of the page.
TYPE: List[TextBox]
TextProperties
","text":" Bases: BaseModel
The TextProperties
class represents the style properties of a span of text in a TextBox.
italic
Whether the text is italic.
TYPE: bool
bold
Whether the text is bold.
TYPE: bool
begin
The beginning index of the span of text.
TYPE: int
end
The ending index of the span of text.
TYPE: int
fontname
The font name of the span of text.
TYPE: Optional[str]
Box
","text":" Bases: BaseModel
The Box
class represents a box annotation in a PDF document. It is the base class of TextBox.
doc
The PDF document that this box belongs to.
TYPE: PDFDoc
page_num
The page number of the box.
TYPE: Optional[int]
x0
The left x-coordinate of the box.
TYPE: float
x1
The right x-coordinate of the box.
TYPE: float
y0
The top y-coordinate of the box.
TYPE: float
y1
The bottom y-coordinate of the box.
TYPE: float
label
The label of the box.
TYPE: Optional[str]
page
The page object that this box belongs to.
TYPE: Page
Text
","text":" Bases: BaseModel
The TextBox
class represents text object, not bound to any box.
It can be used to store aggregated text from multiple boxes for example.
ATTRIBUTE DESCRIPTIONtext
The text content.
TYPE: str
properties
The style properties of the text.
TYPE: List[TextProperties]
TextBox
","text":" Bases: Box
The TextBox
class represents a text box annotation in a PDF document.
text
The text content of the text box.
TYPE: str
props
The style properties of the text box.
TYPE: List[TextProperties]
edspdf.trainable_pipe
","text":""},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe","title":"TrainablePipe
","text":" Bases: Module
, Generic[OutputBatch]
A TrainablePipe is a Component that can be trained and inherits torch.nn.Module
. You can use it either as a torch module inside a more complex neural network, or as a standalone component in a Pipeline.
In addition to the methods of a torch module, a TrainablePipe adds a few methods to handle preprocessing and collating features, as well as caching intermediate results for components that share a common subcomponent.
"},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe.save_extra_data","title":"save_extra_data
","text":"Dumps vocabularies indices to json files
PARAMETER DESCRIPTIONpath
Path to the directory where the files will be saved
TYPE: Path
exclude
The set of component names to exclude from saving This is useful when components are repeated in the pipeline.
TYPE: set
load_extra_data
","text":"Loads vocabularies indices from json files
PARAMETER DESCRIPTIONpath
Path to the directory where the files will be loaded
TYPE: Path
exclude
The set of component names to exclude from loading This is useful when components are repeated in the pipeline.
TYPE: set
post_init
","text":"This method completes the attributes of the component, by looking at some documents. It is especially useful to build vocabularies or detect the labels of a classification task.
PARAMETER DESCRIPTIONgold_data
The documents to use for initialization.
TYPE: Iterable[PDFDoc]
exclude
The names of components to exclude from initialization. This argument will be gradually updated with the names of initialized components
TYPE: set
preprocess
","text":"Preprocess the document to extract features that will be used by the neural network to perform its predictions.
PARAMETER DESCRIPTIONdoc
PDFDocument to preprocess
TYPE: PDFDoc
Dict[str, Any]
Dictionary (optionally nested) containing the features extracted from the document.
"},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe.collate","title":"collate
","text":"Collate the batch of features into a single batch of tensors that can be used by the forward method of the component.
PARAMETER DESCRIPTIONbatch
Batch of features
TYPE: NestedSequences
device
Device on which the tensors should be moved
TYPE: device
InputBatch
Dictionary (optionally nested) containing the collated tensors
"},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe.forward","title":"forward
","text":"Perform the forward pass of the neural network, i.e, apply transformations over the collated features to compute new embeddings, probabilities, losses, etc
PARAMETER DESCRIPTIONbatch
Batch of tensors (nested dictionary) computed by the collate method
TYPE: InputBatch
OutputBatch
"},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe.module_forward","title":"module_forward
","text":"This is a wrapper around torch.nn.Module.__call__
to avoid conflict with the TrainablePipe.__call__
method.
make_batch
","text":"Convenience method to preprocess a batch of documents and collate them Features corresponding to the same path are grouped together in a list, under the same key.
PARAMETER DESCRIPTIONdocs
Batch of documents
TYPE: Sequence[PDFDoc]
supervision
Whether to extract supervision features or not
TYPE: bool
DEFAULT: False
Dict[str, Sequence[Any]]
"},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe.batch_process","title":"batch_process
","text":"Process a batch of documents using the neural network. This differs from the pipe
method in that it does not return an iterator, but executes the component on the whole batch at once.
docs
Batch of documents
TYPE: Sequence[PDFDoc]
Sequence[PDFDoc]
Batch of updated documents
"},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe.postprocess","title":"postprocess
","text":"Update the documents with the predictions of the neural network, for instance converting label probabilities into label attributes on the document lines.
By default, this is a no-op.
PARAMETER DESCRIPTIONdocs
Batch of documents
TYPE: Sequence[PDFDoc]
batch
Batch of predictions, as returned by the forward method
TYPE: OutputBatch
Sequence[PDFDoc]
"},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe.preprocess_supervised","title":"preprocess_supervised
","text":"Preprocess the document to extract features that will be used by the neural network to perform its training. By default, this returns the same features as the preprocess
method.
doc
PDFDocument to preprocess
TYPE: PDFDoc
Dict[str, Any]
Dictionary (optionally nested) containing the features extracted from the document.
"},{"location":"reference/edspdf/trainable_pipe/#edspdf.trainable_pipe.TrainablePipe.__call__","title":"__call__
","text":"Applies the component on a single doc. For multiple documents, prefer batch processing via the batch_process method. In general, prefer the Pipeline methods
PARAMETER DESCRIPTIONdoc
TYPE: PDFDoc
PDFDoc
"},{"location":"reference/edspdf/accelerators/","title":"edspdf.accelerators
","text":""},{"location":"reference/edspdf/accelerators/base/","title":"edspdf.accelerators.base
","text":""},{"location":"reference/edspdf/accelerators/base/#edspdf.accelerators.base.FromDoc","title":"FromDoc
","text":"A FromDoc converter (from a PDFDoc to an arbitrary type) can be either:
edspdf.accelerators.multiprocessing
","text":""},{"location":"reference/edspdf/accelerators/multiprocessing/#edspdf.accelerators.multiprocessing.MultiprocessingAccelerator","title":"MultiprocessingAccelerator
","text":" Bases: Accelerator
If you have multiple CPU cores, and optionally multiple GPUs, we provide a multiprocessing
accelerator that allows to run the inference on multiple processes.
This accelerator dispatches the batches between multiple workers (data-parallelism), and distribute the computation of a given batch on one or two workers (model-parallelism). This is done by creating two types of workers:
CPUWorker
which handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning componentsGPUWorker
which handles the forward call of the deep-learning componentsThe advantage of dedicating a worker to the deep-learning components is that it allows to prepare multiple batches in parallel in multiple CPUWorker
, and ensure that the GPUWorker
never wait for a batch to be ready.
The overall architecture described in the following figure, for 3 CPU workers and 2 GPU workers.
Here is how a small pipeline with rule-based components and deep-learning components is distributed between the workers:
"},{"location":"reference/edspdf/accelerators/multiprocessing/#edspdf.accelerators.multiprocessing.MultiprocessingAccelerator--examples","title":"Examples","text":"docs = list(\n pipeline.pipe(\n [content1, content2, ...],\n accelerator={\n \"@accelerator\": \"multiprocessing\",\n \"num_cpu_workers\": 3,\n \"num_gpu_workers\": 2,\n \"batch_size\": 8,\n },\n )\n)\n
"},{"location":"reference/edspdf/accelerators/multiprocessing/#edspdf.accelerators.multiprocessing.MultiprocessingAccelerator--parameters","title":"Parameters","text":"PARAMETER DESCRIPTION batch_size
Number of documents to process at a time in a CPU/GPU worker
TYPE: int
num_cpu_workers
Number of CPU workers. A CPU worker handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components.
TYPE: Optional[int]
DEFAULT: None
num_gpu_workers
Number of GPU workers. A GPU worker handles the forward call of the deep-learning components.
TYPE: Optional[int]
DEFAULT: None
gpu_pipe_names
List of pipe names to accelerate on a GPUWorker, defaults to all pipes that inherit from TrainablePipe
TYPE: Optional[List[str]]
DEFAULT: None
__call__
","text":"Stream of documents to process. Each document can be a string or a tuple
PARAMETER DESCRIPTIONinputs
TYPE: Iterable[Any]
model
TYPE: Any
Any
Processed outputs of the pipeline
"},{"location":"reference/edspdf/accelerators/simple/","title":"edspdf.accelerators.simple
","text":""},{"location":"reference/edspdf/accelerators/simple/#edspdf.accelerators.simple.SimpleAccelerator","title":"SimpleAccelerator
","text":" Bases: Accelerator
This is the simplest accelerator which batches the documents and process each batch on the main process (the one calling .pipe()
).
docs = list(pipeline.pipe([content1, content2, ...]))\n
or, if you want to override the model defined batch size
docs = list(pipeline.pipe([content1, content2, ...], batch_size=8))\n
which is equivalent to passing a confit dict
docs = list(\n pipeline.pipe(\n [content1, content2, ...],\n accelerator={\n \"@accelerator\": \"simple\",\n \"batch_size\": 8,\n },\n )\n)\n
or the instantiated accelerator directly
from edspdf.accelerators.simple import SimpleAccelerator\n\naccelerator = SimpleAccelerator(batch_size=8)\ndocs = list(pipeline.pipe([content1, content2, ...], accelerator=accelerator))\n
If you have a GPU, make sure to move the model to the appropriate device before calling .pipe()
. If you have multiple GPUs, use the multiprocessing accelerator instead.
pipeline.to(\"cuda\")\ndocs = list(pipeline.pipe([content1, content2, ...]))\n
"},{"location":"reference/edspdf/accelerators/simple/#edspdf.accelerators.simple.SimpleAccelerator--parameters","title":"Parameters","text":"PARAMETER DESCRIPTION batch_size
The number of documents to process in each batch.
TYPE: int
DEFAULT: 32
edspdf.layers
","text":""},{"location":"reference/edspdf/layers/box_transformer/","title":"edspdf.layers.box_transformer
","text":""},{"location":"reference/edspdf/layers/box_transformer/#edspdf.layers.box_transformer.BoxTransformerLayer","title":"BoxTransformerLayer
","text":" Bases: Module
BoxTransformerLayer combining a self attention layer and a linear->activation->linear transformation. This layer is used in the BoxTransformerModule module.
"},{"location":"reference/edspdf/layers/box_transformer/#edspdf.layers.box_transformer.BoxTransformerLayer--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONinput_size
Input embedding size
TYPE: int
num_heads
Number of attention heads in the attention layer
TYPE: int
DEFAULT: 2
dropout_p
Dropout probability both for the attention layer and embedding projections
TYPE: float
DEFAULT: 0.0
head_size
Head sizes of the attention layer
TYPE: Optional[int]
DEFAULT: None
activation
Activation function used in the linear->activation->linear transformation
TYPE: ActivationFunction
DEFAULT: 'gelu'
init_resweight
Initial weight of the residual gates. At 0, the layer acts (initially) as an identity function, and at 1 as a standard Transformer layer. Initializing with a value close to 0 can help the training converge.
TYPE: float
DEFAULT: 0.0
attention_mode
Mode of relative position infused attention layer. See the relative attention documentation for more information.
TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']]
DEFAULT: ('c2c', 'c2p', 'p2c')
position_embedding
Position embedding to use as key/query position embedding in the attention computation.
TYPE: Optional[Union[FloatTensor, Parameter]]
DEFAULT: None
forward
","text":"Forward pass of the BoxTransformerLayer
PARAMETER DESCRIPTIONembeds
Embeddings to contextualize Shape: n_samples * n_keys * input_size
TYPE: FloatTensor
mask
Mask of the embeddings. 0 means padding element. Shape: n_samples * n_keys
TYPE: BoolTensor
relative_positions
Position of the keys relatively to the query elements Shape: n_samples * n_queries * n_keys * n_coordinates (2 for x/y)
TYPE: LongTensor
no_position_mask
Key / query pairs for which the position attention terms should be disabled. Shape: n_samples * n_queries * n_keys
TYPE: Optional[BoolTensor]
DEFAULT: None
Tuple[FloatTensor, FloatTensor]
n_samples * n_queries * n_keys
n_samples * n_queries * n_keys * n_heads
BoxTransformerModule
","text":" Bases: Module
Box Transformer architecture combining a multiple BoxTransformerLayer modules. It is mainly used in BoxTransformer.
"},{"location":"reference/edspdf/layers/box_transformer/#edspdf.layers.box_transformer.BoxTransformerModule--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONinput_size
Input embedding size
TYPE: Optional[int]
DEFAULT: None
num_heads
Number of attention heads in the attention layers
TYPE: int
DEFAULT: 2
n_relative_positions
Maximum range of embeddable relative positions between boxes (further distances are capped to \u00b1n_relative_positions // 2)
TYPE: Optional[int]
DEFAULT: None
dropout_p
Dropout probability both for the attention layers and embedding projections
TYPE: float
DEFAULT: 0.0
head_size
Head sizes of the attention layers
TYPE: Optional[int]
DEFAULT: None
activation
Activation function used in the linear->activation->linear transformations
TYPE: ActivationFunction
DEFAULT: 'gelu'
init_resweight
Initial weight of the residual gates. At 0, the layer acts (initially) as an identity function, and at 1 as a standard Transformer layer. Initializing with a value close to 0 can help the training converge.
TYPE: float
DEFAULT: 0.0
attention_mode
Mode of relative position infused attention layer. See the relative attention documentation for more information.
TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']]
DEFAULT: ('c2c', 'c2p', 'p2c')
n_layers
Number of layers in the Transformer
TYPE: int
DEFAULT: 2
forward
","text":"Forward pass of the BoxTransformer
PARAMETER DESCRIPTIONembeds
Embeddings to contextualize Shape: n_samples * n_keys * input_size
TYPE: FoldedTensor
boxes
Layout features of the input elements
TYPE: Dict
Tuple[FloatTensor, List[FloatTensor]]
n_samples * n_queries * n_keys
n_samples * n_queries * n_keys * n_heads
edspdf.layers.relative_attention
","text":""},{"location":"reference/edspdf/layers/relative_attention/#edspdf.layers.relative_attention.RelativeAttention","title":"RelativeAttention
","text":" Bases: Module
A self/cross-attention layer that takes relative position of elements into account to compute the attention weights. When running a relative attention layer, key and queries are represented using content and position embeddings, where position embeddings are retrieved using the relative position of keys relative to queries
"},{"location":"reference/edspdf/layers/relative_attention/#edspdf.layers.relative_attention.RelativeAttention--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONsize
The size of the output embeddings Also serves as default if query_size, pos_size, or key_size is None
TYPE: int
n_heads
The number of attention heads
TYPE: int
query_size
The size of the query embeddings.
TYPE: Optional[int]
DEFAULT: None
key_size
The size of the key embeddings.
TYPE: Optional[int]
DEFAULT: None
value_size
The size of the value embeddings
TYPE: Optional[int]
DEFAULT: None
head_size
The size of each query / key / value chunk used in the attention dot product Default: key_size / n_heads
TYPE: Optional[int]
DEFAULT: None
position_embedding
The position embedding used as key and query embeddings
TYPE: Optional[Union[FloatTensor, Parameter]]
DEFAULT: None
dropout_p
Dropout probability applied on the attention weights Default: 0.1
TYPE: float
DEFAULT: 0.0
same_key_query_proj
Whether to use the same projection operator for content key and queries when computing the pre-attention key and query embedding chunks Default: False
TYPE: bool
DEFAULT: False
same_positional_key_query_proj
Whether to use the same projection operator for content key and queries when computing the pre-attention key and query embedding chunks Default: False
TYPE: bool
DEFAULT: False
n_coordinates
The number of positional coordinates For instance, text is 1D so 1 coordinate, images are 2D so 2 coordinates ... Default: 1
TYPE: int
DEFAULT: 1
head_bias
Whether to learn a bias term to add to the attention logits This is only useful if you plan to use the attention logits for subsequent operations, since attention weights are unaffected by bias terms.
TYPE: bool
DEFAULT: True
do_pooling
Whether to compute the output embedding. If you only plan to use attention logits, you should disable this parameter. Default: True
TYPE: bool
DEFAULT: True
mode
Whether to compute content to content (c2c), content to position (c2p) or position to content (p2c) attention terms. Setting mode=('c2c\")
disable relative position attention terms: this is the standard attention layer. To get a better intuition about these different types of attention, here is a formulation as fictitious search samples from a word in a (1D) text:
TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']]
DEFAULT: ('c2c', 'p2c', 'c2p')
n_additional_heads
The number of additional head logits to compute. Those are not used to compute output embeddings, but may be useful in subsequent operation. Default: 0
TYPE: int
DEFAULT: 0
forward
","text":"Forward pass of the RelativeAttention layer.
PARAMETER DESCRIPTIONcontent_queries
The content query embedding to use in the attention computation Shape: n_samples * n_queries * query_size
TYPE: FloatTensor
content_keys
The content key embedding to use in the attention computation. If None, defaults to the content_queries
Shape: n_samples * n_keys * query_size
TYPE: Optional[FloatTensor]
DEFAULT: None
content_values
The content values embedding to use in the final pooling computation. If None, pooling won't be performed. Shape: n_samples * n_keys * query_size
TYPE: Optional[FloatTensor]
DEFAULT: None
mask
The content key embedding to use in the attention computation. If None, defaults to the content_queries
Shape: either - n_samples * n_keys
- n_samples * n_queries * n_keys
- n_samples * n_queries * n_keys * n_heads
TYPE: Optional[BoolTensor]
DEFAULT: None
relative_positions
The relative position of keys relative to queries If None, positional attention terms won't be computed. Shape: n_samples * n_queries * n_keys * n_coordinates
TYPE: Optional[LongTensor]
DEFAULT: None
no_position_mask
Key / query pairs for which the position attention terms should be disabled. Shape: n_samples * n_queries * n_keys
TYPE: Optional[BoolTensor]
DEFAULT: None
base_attn
Attention logits to add to the computed attention logits Shape: n_samples * n_queries * n_keys * n_heads
TYPE: Optional[FloatTensor]
DEFAULT: None
Union[Tuple[FloatTensor, FloatTensor], FloatTensor]
do_pooling
attribute is set to True) Shape: n_sample * n_keys * size
edspdf.layers.sinusoidal_embedding
","text":""},{"location":"reference/edspdf/layers/sinusoidal_embedding/#edspdf.layers.sinusoidal_embedding.SinusoidalEmbedding","title":"SinusoidalEmbedding
","text":" Bases: Module
A position embedding lookup table that stores embeddings for a fixed number of positions. The value of each of the embedding_dim
channels of the generated embedding is generated according to a trigonometric function (sin for even channels, cos for odd channels). The frequency of the signal in each pair of channels varies according to the temperature parameter.
Any input position above the maximum value num_embeddings
will be capped to num_embeddings - 1
num_embeddings
The maximum number of position embeddings store in this table
TYPE: int
embedding_dim
The embedding size
TYPE: int
temperature
The temperature controls the range of frequencies used by each channel of the embedding
TYPE: float
DEFAULT: 10000.0
forward
","text":"Forward pass of the SinusoidalEmbedding module
PARAMETER DESCRIPTIONindices
Shape: any
TYPE: LongTensor
FloatTensor
Shape: (*input_shape, embedding_dim)
edspdf.layers.vocabulary
","text":""},{"location":"reference/edspdf/layers/vocabulary/#edspdf.layers.vocabulary.Vocabulary","title":"Vocabulary
","text":" Bases: Module
, Generic[T]
Vocabulary layer. This is not meant to be used as a torch.nn.Module
but subclassing torch.nn.Module
makes the instances appear when printing a model, which is nice.
items
Initial vocabulary elements if any. Specific elements such as padding and unk can be set here to enforce their index in the vocabulary.
TYPE: Sequence[T]
DEFAULT: None
default
Default index to use for out of vocabulary elements Defaults to -100
TYPE: int
DEFAULT: -100
initialization
","text":"Enters the initialization mode. Out of vocabulary elements will be assigned an index.
"},{"location":"reference/edspdf/layers/vocabulary/#edspdf.layers.vocabulary.Vocabulary.encode","title":"encode
","text":"Converts an element into its vocabulary index If the layer is in its initialization mode (with vocab.initialization(): ...
), and the element is out of vocabulary, a new index will be created and returned. Otherwise, any oov element will be encoded with the default
index.
item
RETURNS DESCRIPTION
int
"},{"location":"reference/edspdf/layers/vocabulary/#edspdf.layers.vocabulary.Vocabulary.decode","title":"decode
","text":"Converts an index into its original value
PARAMETER DESCRIPTIONidx
RETURNS DESCRIPTION
InputT
"},{"location":"reference/edspdf/pipes/","title":"edspdf.pipes
","text":""},{"location":"reference/edspdf/pipes/aggregators/","title":"edspdf.pipes.aggregators
","text":""},{"location":"reference/edspdf/pipes/aggregators/simple/","title":"edspdf.pipes.aggregators.simple
","text":""},{"location":"reference/edspdf/pipes/aggregators/simple/#edspdf.pipes.aggregators.simple.SimpleAggregator","title":"SimpleAggregator
","text":"Aggregator that returns texts and styles. It groups all text boxes with the same label under the aggregated_text
, and additionally aggregates the styles of the text boxes.
Create a pipeline
API-basedConfiguration-basedpipeline = ...\npipeline.add_pipe(\n \"simple-aggregator\",\n name=\"aggregator\",\n config={\n \"new_line_threshold\": 0.2,\n \"new_paragraph_threshold\": 1.5,\n \"label_map\": {\n \"body\": \"text\",\n \"table\": \"text\",\n },\n },\n)\n
...\n\n[components.aggregator]\n@factory = \"simple-aggregator\"\nnew_line_threshold = 0.2\nnew_paragraph_threshold = 1.5\nlabel_map = { body = \"text\", table = \"text\" }\n\n...\n
and run it on a document:
doc = pipeline(doc)\nprint(doc.aggregated_texts)\n# {\n# \"text\": \"This is the body of the document, followed by a table | A | B |\"\n# }\n
"},{"location":"reference/edspdf/pipes/aggregators/simple/#edspdf.pipes.aggregators.simple.SimpleAggregator--parameters","title":"Parameters","text":"PARAMETER DESCRIPTION pipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
The name of the component
TYPE: str
DEFAULT: 'simple-aggregator'
sort
Whether to sort text boxes inside each label group by (page, y, x) position before merging them.
TYPE: bool
DEFAULT: False
new_line_threshold
Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate lines
TYPE: float
DEFAULT: 0.2
new_paragraph_threshold
Minimum ratio of the distance between two lines to the median height of lines to consider them as being on separate paragraphs and thus add a newline character between them.
TYPE: float
DEFAULT: 1.5
label_map
A dictionary mapping labels to new labels. This is useful to group labels together, for instance, to output both \"body\" and \"table\" as \"text\".
TYPE: Dict
DEFAULT: {}
edspdf.pipes.classifiers
","text":""},{"location":"reference/edspdf/pipes/classifiers/dummy/","title":"edspdf.pipes.classifiers.dummy
","text":""},{"location":"reference/edspdf/pipes/classifiers/dummy/#edspdf.pipes.classifiers.dummy.DummyClassifier","title":"DummyClassifier
","text":"Dummy classifier, for chaos purposes. Classifies each line to a random element.
"},{"location":"reference/edspdf/pipes/classifiers/dummy/#edspdf.pipes.classifiers.dummy.DummyClassifier--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
The pipeline object.
TYPE: Pipeline
DEFAULT: None
name
The name of the component.
TYPE: str
DEFAULT: 'dummy-classifier'
label
The label to assign to each line.
TYPE: str
edspdf.pipes.classifiers.mask
","text":""},{"location":"reference/edspdf/pipes/classifiers/mask/#edspdf.pipes.classifiers.mask.MaskClassifier","title":"MaskClassifier
","text":"Simple mask classifier, that labels every box inside one of the masks with its label.
"},{"location":"reference/edspdf/pipes/classifiers/mask/#edspdf.pipes.classifiers.mask.simple_mask_classifier_factory","title":"simple_mask_classifier_factory
","text":"The simplest form of mask classification. You define the mask, everything else is tagged as pollution.
PARAMETER DESCRIPTIONpipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
The name of the component
TYPE: str
DEFAULT: 'mask-classifier'
x0
The x0 coordinate of the mask
TYPE: float
y0
The y0 coordinate of the mask
TYPE: float
x1
The x1 coordinate of the mask
TYPE: float
y1
The y1 coordinate of the mask
TYPE: float
threshold
The threshold for the alignment
TYPE: float
DEFAULT: 1.0
pipeline.add_pipe(\n \"mask-classifier\",\n name=\"classifier\",\n config={\n \"threshold\": 0.9,\n \"x0\": 0.1,\n \"y0\": 0.1,\n \"x1\": 0.9,\n \"y1\": 0.9,\n },\n)\n
[components.classifier]\n@classifiers = \"mask-classifier\"\nx0 = 0.1\ny0 = 0.1\nx1 = 0.9\ny1 = 0.9\nthreshold = 0.9\n
"},{"location":"reference/edspdf/pipes/classifiers/mask/#edspdf.pipes.classifiers.mask.mask_classifier_factory","title":"mask_classifier_factory
","text":"A generalisation, wherein the user defines a number of regions.
The following configuration produces exactly the same classifier as mask.v1
example above.
Any bloc that is not part of a mask is tagged as pollution
.
pipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
TYPE: str
DEFAULT: 'multi-mask-classifier'
threshold
The threshold for the alignment
TYPE: float
DEFAULT: 1.0
masks
The masks
TYPE: Box
DEFAULT: {}
pipeline.add_pipe(\n \"multi-mask-classifier\",\n name=\"classifier\",\n config={\n \"threshold\": 0.9,\n \"mymask\": {\"x0\": 0.1, \"y0\": 0.1, \"x1\": 0.9, \"y1\": 0.3, \"label\": \"body\"},\n },\n)\n
[components.classifier]\n@factory = \"multi-mask-classifier\"\nthreshold = 0.9\n\n[components.classifier.mymask]\nlabel = \"body\"\nx0 = 0.1\ny0 = 0.1\nx1 = 0.9\ny1 = 0.9\n
The following configuration defines a header
region.
pipeline.add_pipe(\n \"multi-mask-classifier\",\n name=\"classifier\",\n config={\n \"threshold\": 0.9,\n \"body\": {\"x0\": 0.1, \"y0\": 0.1, \"x1\": 0.9, \"y1\": 0.3, \"label\": \"header\"},\n \"header\": {\"x0\": 0.1, \"y0\": 0.3, \"x1\": 0.9, \"y1\": 0.9, \"label\": \"body\"},\n },\n)\n
[components.classifier]\n@factory = \"multi-mask-classifier\"\nthreshold = 0.9\n\n[components.classifier.header]\nlabel = \"header\"\nx0 = 0.1\ny0 = 0.1\nx1 = 0.9\ny1 = 0.3\n\n[components.classifier.body]\nlabel = \"body\"\nx0 = 0.1\ny0 = 0.3\nx1 = 0.9\ny1 = 0.9\n
"},{"location":"reference/edspdf/pipes/classifiers/random/","title":"edspdf.pipes.classifiers.random
","text":""},{"location":"reference/edspdf/pipes/classifiers/random/#edspdf.pipes.classifiers.random.RandomClassifier","title":"RandomClassifier
","text":"Random classifier, for chaos purposes. Classifies each box to a random element.
"},{"location":"reference/edspdf/pipes/classifiers/random/#edspdf.pipes.classifiers.random.RandomClassifier--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
The pipeline object.
TYPE: Pipeline
name
The name of the component.
TYPE: str
DEFAULT: 'random-classifier'
labels
The labels to assign to each line. If a list is passed, each label is assigned with equal probability. If a dict is passed, the keys are the labels and the values are the probabilities.
TYPE: Union[List[str], Dict[str, float]]
edspdf.pipes.classifiers.trainable
","text":""},{"location":"reference/edspdf/pipes/classifiers/trainable/#edspdf.pipes.classifiers.trainable.TrainableClassifier","title":"TrainableClassifier
","text":" Bases: TrainablePipe[Dict[str, Any]]
This component predicts a label for each box over the whole document using machine learning.
Note
You must train the model your model to use this classifier. See Model training for more information
"},{"location":"reference/edspdf/pipes/classifiers/trainable/#edspdf.pipes.classifiers.trainable.TrainableClassifier--examples","title":"Examples","text":"The classifier is composed of the following blocks:
In this example, we use a box-embedding
layer to generate the embeddings of the boxes. It is composed of a text encoder that embeds the text features of the boxes and a layout encoder that embeds the layout features of the boxes. These two embeddings are summed and passed through an optional contextualizer
, here a box-transformer
.
pipeline.add_pipe(\n \"trainable-classifier\",\n name=\"classifier\",\n config={\n # simple embedding computed by pooling embeddings of words in each box\n \"embedding\": {\n \"@factory\": \"sub-box-cnn-pooler\",\n \"out_channels\": 64,\n \"kernel_sizes\": (3, 4, 5),\n \"embedding\": {\n \"@factory\": \"simple-text-embedding\",\n \"size\": 72,\n },\n },\n \"labels\": [\"body\", \"pollution\"],\n },\n)\n
[components.classifier]\n@factory = \"trainable-classifier\"\nlabels = [\"body\", \"pollution\"]\n\n[components.classifier.embedding]\n@factory = \"sub-box-cnn-pooler\"\nout_channels = 64\nkernel_sizes = (3, 4, 5)\n\n[components.classifier.embedding.embedding]\n@factory = \"simple-text-embedding\"\nsize = 72\n
"},{"location":"reference/edspdf/pipes/classifiers/trainable/#edspdf.pipes.classifiers.trainable.TrainableClassifier--parameters","title":"Parameters","text":"PARAMETER DESCRIPTION labels
Initial labels of the classifier (will be completed during initialization)
TYPE: Sequence[str]
DEFAULT: ('pollution')
embedding
Embedding module to encode the PDF boxes
TYPE: TrainablePipe[EmbeddingOutput]
edspdf.pipes.embeddings
","text":""},{"location":"reference/edspdf/pipes/embeddings/box_layout_embedding/","title":"edspdf.pipes.embeddings.box_layout_embedding
","text":""},{"location":"reference/edspdf/pipes/embeddings/box_layout_embedding/#edspdf.pipes.embeddings.box_layout_embedding.BoxLayoutEmbedding","title":"BoxLayoutEmbedding
","text":" Bases: TrainablePipe[EmbeddingOutput]
This component encodes the geometrical features of a box, as extracted by the BoxLayoutPreprocessor module, into an embedding. For position modes, use:
\"sin\"
to embed positions with a fixed SinusoidalEmbedding\"learned\"
to embed positions using a learned standard pytorch embedding layerEach produces embedding is the concatenation of the box width, height and the top, left, bottom and right coordinates, each embedded depending on the *_mode
param.
size
Size of the output box embedding
TYPE: int
n_positions
Number of position embeddings stored in the PositionEmbedding module
TYPE: int
x_mode
Position embedding mode of the x coordinates
TYPE: Literal['sin', 'learned']
DEFAULT: 'sin'
y_mode
Position embedding mode of the x coordinates
TYPE: Literal['sin', 'learned']
DEFAULT: 'sin'
w_mode
Position embedding mode of the width features
TYPE: Literal['sin', 'learned']
DEFAULT: 'sin'
h_mode
Position embedding mode of the height features
TYPE: Literal['sin', 'learned']
DEFAULT: 'sin'
edspdf.pipes.embeddings.box_layout_preprocessor
","text":""},{"location":"reference/edspdf/pipes/embeddings/box_layout_preprocessor/#edspdf.pipes.embeddings.box_layout_preprocessor.BoxLayoutPreprocessor","title":"BoxLayoutPreprocessor
","text":" Bases: TrainablePipe[BoxLayoutBatch]
The box preprocessor is singleton since its is not configurable. The following features of each box of an input PDFDoc document are encoded as 1D tensors:
boxes_page
: page index of the boxboxes_first_page
: is the box on the first pageboxes_last_page
: is the box on the last pageboxes_xmin
: left position of the boxboxes_ymin
: bottom position of the boxboxes_xmax
: right position of the boxboxes_ymax
: top position of the boxboxes_w
: width position of the boxboxes_h
: height position of the boxThe preprocessor also returns an additional tensors:
page_boxes_id
: box indices per page to index the above 1D tensors (LongTensor: n_pages * n_boxes)edspdf.pipes.embeddings.box_transformer
","text":""},{"location":"reference/edspdf/pipes/embeddings/box_transformer/#edspdf.pipes.embeddings.box_transformer.BoxTransformer","title":"BoxTransformer
","text":" Bases: TrainablePipe[EmbeddingOutput]
BoxTransformer using BoxTransformerModule under the hood.
Note
This module is a TrainablePipe and can be used in a Pipeline, while BoxTransformerModule is a standard PyTorch module, which does not take care of the preprocessing, collating, etc. of the input documents.
"},{"location":"reference/edspdf/pipes/embeddings/box_transformer/#edspdf.pipes.embeddings.box_transformer.BoxTransformer--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
Pipeline instance
TYPE: Pipeline
DEFAULT: None
name
Name of the component
TYPE: str
DEFAULT: 'box-transformer'
num_heads
Number of attention heads in the attention layers
TYPE: int
DEFAULT: 2
n_relative_positions
Maximum range of embeddable relative positions between boxes (further distances are capped to \u00b1n_relative_positions // 2)
TYPE: Optional[int]
DEFAULT: None
dropout_p
Dropout probability both for the attention layers and embedding projections
TYPE: float
DEFAULT: 0.0
head_size
Head sizes of the attention layers
TYPE: Optional[int]
DEFAULT: None
activation
Activation function used in the linear->activation->linear transformations
TYPE: ActivationFunction
DEFAULT: 'gelu'
init_resweight
Initial weight of the residual gates. At 0, the layer acts (initially) as an identity function, and at 1 as a standard Transformer layer. Initializing with a value close to 0 can help the training converge.
TYPE: float
DEFAULT: 0.0
attention_mode
Mode of relative position infused attention layer. See the relative attention documentation for more information.
TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']]
DEFAULT: ('c2c', 'c2p', 'p2c')
n_layers
Number of layers in the Transformer
TYPE: int
DEFAULT: 2
edspdf.pipes.embeddings.embedding_combiner
","text":""},{"location":"reference/edspdf/pipes/embeddings/embedding_combiner/#edspdf.pipes.embeddings.embedding_combiner.EmbeddingCombiner","title":"EmbeddingCombiner
","text":" Bases: TrainablePipe[EmbeddingOutput]
Encodes boxes using a combination of multiple encoders
"},{"location":"reference/edspdf/pipes/embeddings/embedding_combiner/#edspdf.pipes.embeddings.embedding_combiner.EmbeddingCombiner--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
The name of the pipe
TYPE: str
DEFAULT: 'embedding-combiner'
mode
The mode to use to combine the encoders:
sum
: Sum the outputs of the encoderscat
: Concatenate the outputs of the encoders TYPE: Literal['sum', 'cat']
DEFAULT: 'sum'
dropout_p
Dropout probability used on the output of the box and textual encoders
TYPE: float
DEFAULT: 0.0
encoders
The encoders to use. The keys are the names of the encoders and the values are the encoders themselves.
TYPE: TrainablePipe[EmbeddingOutput]
DEFAULT: {}
edspdf.pipes.embeddings.huggingface_embedding
","text":""},{"location":"reference/edspdf/pipes/embeddings/huggingface_embedding/#edspdf.pipes.embeddings.huggingface_embedding.HuggingfaceEmbedding","title":"HuggingfaceEmbedding
","text":" Bases: TrainablePipe[EmbeddingOutput]
The HuggingfaceEmbeddings component is a wrapper around the Huggingface multi-modal models. Such pre-trained models should offer better results than a model trained from scratch. Compared to using the raw Huggingface model, we offer a simple mechanism to split long documents into strided windows before feeding them to the model.
"},{"location":"reference/edspdf/pipes/embeddings/huggingface_embedding/#edspdf.pipes.embeddings.huggingface_embedding.HuggingfaceEmbedding--windowing","title":"Windowing","text":"The HuggingfaceEmbedding component splits long documents into smaller windows before feeding them to the model. This is done to avoid hitting the maximum number of tokens that can be processed by the model on a single device. The window size and stride can be configured using the window
and stride
parameters. The default values are 510 and 255 respectively, which means that the model will process windows of 510 tokens, each separated by 255 tokens. Whenever a token appears in multiple windows, the embedding of the \"most contextualized\" occurrence is used, i.e. the occurrence that is the closest to the center of its window.
Here is an overview how this works in a classifier model :
"},{"location":"reference/edspdf/pipes/embeddings/huggingface_embedding/#edspdf.pipes.embeddings.huggingface_embedding.HuggingfaceEmbedding--examples","title":"Examples","text":"Here is an example of how to define a pipeline with the HuggingfaceEmbedding component:
from edspdf import Pipeline\n\nmodel = Pipeline()\nmodel.add_pipe(\n \"pdfminer-extractor\",\n name=\"extractor\",\n config={\n \"render_pages\": True,\n },\n)\nmodel.add_pipe(\n \"huggingface-embedding\",\n name=\"embedding\",\n config={\n \"model\": \"microsoft/layoutlmv3-base\",\n \"use_image\": False,\n \"window\": 128,\n \"stride\": 64,\n \"line_pooling\": \"mean\",\n },\n)\nmodel.add_pipe(\n \"trainable-classifier\",\n name=\"classifier\",\n config={\n \"embedding\": model.get_pipe(\"embedding\"),\n \"labels\": [],\n },\n)\n
This model can then be trained following the training recipe.
"},{"location":"reference/edspdf/pipes/embeddings/huggingface_embedding/#edspdf.pipes.embeddings.huggingface_embedding.HuggingfaceEmbedding--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONpipeline
The pipeline instance
TYPE: Pipeline
DEFAULT: None
name
The component name
TYPE: str
DEFAULT: 'huggingface-embedding'
model
The Huggingface model name or path
TYPE: str
DEFAULT: None
use_image
Whether to use the image or not in the model
TYPE: bool
DEFAULT: True
window
The window size to use when splitting long documents into smaller windows before feeding them to the Transformer model (default: 510 = 512 - 2)
TYPE: int
DEFAULT: 510
stride
The stride (distance between windows) to use when splitting long documents into smaller windows: (default: 510 / 2 = 255)
TYPE: int
DEFAULT: 255
line_pooling
The pooling strategy to use when combining the embeddings of the tokens in a line into a single line embedding
TYPE: Literal['mean', 'max', 'sum']
DEFAULT: 'mean'
max_tokens_per_device
The maximum number of tokens that can be processed by the model on a single device. This does not affect the results but can be used to reduce the memory usage of the model, at the cost of a longer processing time.
TYPE: int
DEFAULT: 128 * 128
edspdf.pipes.embeddings.simple_text_embedding
","text":""},{"location":"reference/edspdf/pipes/embeddings/simple_text_embedding/#edspdf.pipes.embeddings.simple_text_embedding.SimpleTextEmbedding","title":"SimpleTextEmbedding
","text":" Bases: TrainablePipe[EmbeddingOutput]
A module that embeds the textual features of the blocks
"},{"location":"reference/edspdf/pipes/embeddings/simple_text_embedding/#edspdf.pipes.embeddings.simple_text_embedding.SimpleTextEmbedding--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONsize
Size of the output box embedding
TYPE: int
pipeline
The pipeline object
TYPE: Pipeline
DEFAULT: None
name
Name of the component
TYPE: str
DEFAULT: 'simple-text-embedding'
word_shape
","text":"Converts a word into its shape following the algorithm used in the spaCy library.
https://github.com/explosion/spaCy/blob/b69d249a/spacy/lang/lex_attrs.py#L118
PARAMETER DESCRIPTIONtext
TYPE: str
str
The word shape
"},{"location":"reference/edspdf/pipes/embeddings/sub_box_cnn_pooler/","title":"edspdf.pipes.embeddings.sub_box_cnn_pooler
","text":""},{"location":"reference/edspdf/pipes/embeddings/sub_box_cnn_pooler/#edspdf.pipes.embeddings.sub_box_cnn_pooler.SubBoxCNNPooler","title":"SubBoxCNNPooler
","text":" Bases: TrainablePipe[EmbeddingOutput]
One dimension CNN encoding multi-kernel layer. Input embeddings are convoluted using linear kernels each parametrized with a (window) size of kernel_size[kernel_i]
The output of the kernels are concatenated together, max-pooled and finally projected to a size of output_size
.
pipeline
Pipeline instance
TYPE: Pipeline
DEFAULT: None
name
Name of the component
TYPE: str
DEFAULT: 'sub-box-cnn-pooler'
output_size
Size of the output embeddings Defaults to the input_size
TYPE: Optional[int]
DEFAULT: None
out_channels
Number of channels
TYPE: Optional[int]
DEFAULT: None
kernel_sizes
Window size of each kernel
TYPE: Sequence[int]
DEFAULT: (3, 4, 5)
activation
Activation function to use
TYPE: ActivationFunction
DEFAULT: 'relu'
edspdf.pipes.extractors
","text":""},{"location":"reference/edspdf/pipes/extractors/pdfminer/","title":"edspdf.pipes.extractors.pdfminer
","text":""},{"location":"reference/edspdf/pipes/extractors/pdfminer/#edspdf.pipes.extractors.pdfminer.PdfMinerExtractor","title":"PdfMinerExtractor
","text":"We provide a PDF line extractor built on top of PdfMiner.
This is the most portable extractor, since it is pure-python and can therefore be run on any platform. Be sure to have a look at their documentation, especially the part providing a bird's eye view of the PDF extraction process.
"},{"location":"reference/edspdf/pipes/extractors/pdfminer/#edspdf.pipes.extractors.pdfminer.PdfMinerExtractor--examples","title":"Examples","text":"API-basedConfiguration-basedpipeline.add_pipe(\n \"pdfminer-extractor\",\n config=dict(\n extract_style=False,\n ),\n)\n
[components.extractor]\n@factory = \"pdfminer-extractor\"\nextract_style = false\n
And use the pipeline on a PDF document:
from pathlib import Path\n\n# Apply on a new document\npipeline(Path(\"path/to/your/pdf/document\").read_bytes())\n
"},{"location":"reference/edspdf/pipes/extractors/pdfminer/#edspdf.pipes.extractors.pdfminer.PdfMinerExtractor--parameters","title":"Parameters","text":"PARAMETER DESCRIPTION line_overlap
See PDFMiner documentation
TYPE: float
DEFAULT: 0.5
char_margin
See PDFMiner documentation
TYPE: float
DEFAULT: 2.05
line_margin
See PDFMiner documentation
TYPE: float
DEFAULT: 0.5
word_margin
See PDFMiner documentation
TYPE: float
DEFAULT: 0.1
boxes_flow
See PDFMiner documentation
TYPE: Optional[float]
DEFAULT: 0.5
detect_vertical
See PDFMiner documentation
TYPE: bool
DEFAULT: False
all_texts
See PDFMiner documentation
TYPE: bool
DEFAULT: False
extract_style
Whether to extract style (font, size, ...) information for each line of the document. Default: False
TYPE: bool
DEFAULT: False
render_pages
Whether to extract the rendered page as a numpy array in the page.image
attribute (defaults to False)
TYPE: bool
DEFAULT: False
render_dpi
DPI to use when rendering the page (defaults to 200)
TYPE: int
DEFAULT: 200
raise_on_error
Whether to raise an error if the PDF cannot be parsed. Default: False
TYPE: bool
DEFAULT: False
edspdf.utils
","text":""},{"location":"reference/edspdf/utils/alignment/","title":"edspdf.utils.alignment
","text":""},{"location":"reference/edspdf/utils/alignment/#edspdf.utils.alignment.align_box_labels","title":"align_box_labels
","text":"Align lines with possibly overlapping (and non-exhaustive) labels.
Possible matches are sorted by covered area. Lines with no overlap at all
PARAMETER DESCRIPTIONsrc_boxes
The labelled boxes that will be used to determine the label of the dst_boxes
TYPE: Sequence[Box]
dst_boxes
The non-labelled boxes that will be assigned a label
TYPE: Sequence[T]
threshold
Threshold to use for discounting a label. Used if the labels
DataFrame does not provide a threshold
column, or to fill NaN
values thereof.
TYPE: float
DEFAULT: 1
pollution_label
The label to use for boxes that are not covered by any of the source boxes
TYPE: Any
DEFAULT: None
List[Box]
A copy of the boxes, with the labels mapped from the source boxes
"},{"location":"reference/edspdf/utils/collections/","title":"edspdf.utils.collections
","text":""},{"location":"reference/edspdf/utils/collections/#edspdf.utils.collections.multi_tee","title":"multi_tee
","text":"Makes copies of an iterable such that every iteration over it starts from 0. If the iterable is a sequence (list, tuple), just returns it since every iter() over the object restart from the beginning
"},{"location":"reference/edspdf/utils/collections/#edspdf.utils.collections.FrozenDict","title":"FrozenDict
","text":" Bases: dict
Copied from spacy.util.SimpleFrozenDict
to ensure compatibility.
Initialize the frozen dict. Can be initialized with pre-defined values.
error (str): The error message when user tries to assign to dict.
"},{"location":"reference/edspdf/utils/collections/#edspdf.utils.collections.FrozenList","title":"FrozenList
","text":" Bases: list
Copied from spacy.util.SimpleFrozenDict
to ensure compatibility
Initialize the frozen list.
error (str): The error message when user tries to mutate the list.
"},{"location":"reference/edspdf/utils/optimization/","title":"edspdf.utils.optimization
","text":""},{"location":"reference/edspdf/utils/package/","title":"edspdf.utils.package
","text":""},{"location":"reference/edspdf/utils/package/#edspdf.utils.package.PoetryPackager","title":"PoetryPackager
","text":""},{"location":"reference/edspdf/utils/package/#edspdf.utils.package.PoetryPackager.ensure_pyproject","title":"ensure_pyproject
","text":"Generates a Poetry based pyproject.toml
"},{"location":"reference/edspdf/utils/random/","title":"edspdf.utils.random
","text":""},{"location":"reference/edspdf/utils/random/#edspdf.utils.random.set_seed","title":"set_seed
","text":"Set seed values for random generators. If used as a context, restore the random state used before entering the context.
"},{"location":"reference/edspdf/utils/random/#edspdf.utils.random.set_seed--parameters","title":"Parameters","text":"PARAMETER DESCRIPTIONseed
Value used as a seed.
cuda
Saves the cuda random states too
DEFAULT: is_available()
get_random_generator_state
","text":"Get the torch
, numpy
and random
random generator state.
cuda
Saves the cuda random states too
DEFAULT: is_available()
RandomGeneratorState
"},{"location":"reference/edspdf/utils/random/#edspdf.utils.random.set_random_generator_state","title":"set_random_generator_state
","text":"Set the torch
, numpy
and random
random generator state.
state
"},{"location":"reference/edspdf/utils/torch/","title":"
edspdf.utils.torch
","text":""},{"location":"reference/edspdf/utils/torch/#edspdf.utils.torch.compute_pdf_relative_positions","title":"compute_pdf_relative_positions
","text":"Compute relative positions between boxes. Input boxes must be split between pages with the shape n_pages * n_boxes
PARAMETER DESCRIPTIONx0
y0
x1
y1
width
height
n_relative_positions
Maximum range of embeddable relative positions between boxes (further distances will be capped to \u00b1n_relative_positions // 2)
RETURNS DESCRIPTION
LongTensor
Shape: n_pages * n_boxes * n_boxes * 2
"},{"location":"reference/edspdf/visualization/","title":"edspdf.visualization
","text":""},{"location":"reference/edspdf/visualization/annotations/","title":"edspdf.visualization.annotations
","text":""},{"location":"reference/edspdf/visualization/annotations/#edspdf.visualization.annotations.show_annotations","title":"show_annotations
","text":"Show Box annotations on a PDF document.
PARAMETER DESCRIPTIONpdf
Bytes content of the PDF document
TYPE: bytes
annotations
List of Box annotations to show
TYPE: Sequence[Box]
colors
Colors to use for each label. If a list is provided, it will be used to color the first len(colors)
unique labels. If a dictionary is provided, it will be used to color the labels in the dictionary. If None, a default color scheme will be used.
TYPE: Optional[Union[Dict[str, str], List[str]]]
DEFAULT: None
List[PpmImageFile]
List of PIL images with the annotations. You can display them in a notebook with display(*pages)
.
compare_results
","text":"Compare two sets of annotations on a PDF document.
PARAMETER DESCRIPTIONpdf
Bytes content of the PDF document
TYPE: bytes
pred
List of Box annotations to show on the left side
TYPE: Sequence[Box]
gold
List of Box annotations to show on the right side
TYPE: Sequence[Box]
colors
Colors to use for each label. If a list is provided, it will be used to color the first len(colors)
unique labels. If a dictionary is provided, it will be used to color the labels in the dictionary. If None, a default color scheme will be used.
TYPE: Optional[Union[Dict[str, str], List[str]]]
DEFAULT: None
List[PpmImageFile]
List of PIL images with the annotations. You can display them in a notebook with display(*pages)
.
edspdf.visualization.merge
","text":""},{"location":"reference/edspdf/visualization/merge/#edspdf.visualization.merge.merge_boxes","title":"merge_boxes
","text":"Recursively merge boxes that have the same label to form larger non-overlapping boxes.
PARAMETER DESCRIPTIONboxes
List of boxes to merge
TYPE: Sequence[Box]
List[Box]
List of merged boxes
"},{"location":"utilities/","title":"Overview","text":"EDS-PDF provides a few utilities help annotate PDF documents, and debug the output of an extraction pipeline.
"},{"location":"utilities/alignment/","title":"Alignment","text":"To simplify the annotation process, EDS-PDF provides a utility that aligns bounding boxes with text blocs extracted from a PDF document. This is particularly useful for annotating documents.
BlocsBlocs + AnnotationAlignedMerged Blocs "},{"location":"utilities/visualisation/","title":"Visualisation","text":"EDS-PDF provides utilities to help you visualise the output of the pipeline.
"},{"location":"utilities/visualisation/#visualising-a-pipelines-output","title":"Visualising a pipeline's output","text":"You can use EDS-PDF to overlay labelled bounding boxes on top of a PDF document.
import edspdf\nfrom confit import Config\nfrom pathlib import Path\nfrom edspdf.visualization import show_annotations\n\nconfig = \"\"\"\n[pipeline]\npipeline = [\"extractor\", \"classifier\"]\n\n[components]\n\n[components.extractor]\n@factory = \"pdfminer-extractor\"\nextract_style = true\n\n[components.classifier]\n@factory = \"mask-classifier\"\nx0 = 0.25\nx1 = 0.95\ny0 = 0.3\ny1 = 0.9\nthreshold = 0.1\n\"\"\"\n\nmodel = edspdf.load(Config.from_str(config))\n\n# Get a PDF\npdf = Path(\"/Users/perceval/Development/edspdf/tests/resources/letter.pdf\").read_bytes()\n\n# Construct the DataFrame of blocs\ndoc = model(pdf)\n\n# Compute an image representation of each page of the PDF\n# overlaid with the predicted bounding boxes\nimgs = show_annotations(pdf=pdf, annotations=doc.text_boxes)\n\nimgs[0]\n
If you run this code in a Jupyter notebook, you'll see the following:
"},{"location":"utilities/visualisation/#merging-blocs-together","title":"Merging blocs together","text":"To help debug a pipeline (or a labelled dataset), you might want to merge blocs together according to their labels. EDS-PDF provides a merge_lines
method that does just that.
# \u2191 Omitted code above \u2191\nfrom edspdf.visualization import merge_boxes, show_annotations\n\nmerged = merge_boxes(doc.text_boxes)\n\nimgs = show_annotations(pdf=pdf, annotations=merged)\nimgs[0]\n
See the difference:
OriginalMergedThe merge_boxes
method uses the notion of maximal cliques to compute merges. It forbids the combined blocs from overlapping with any bloc from another label.
Trainable pipes allow for deep learning operations to be performed on the PDFDoc object and must be trained to be used. +Such pipes can be used to train a model to predict the label of the lines extracted from a PDF document.
+Building and running deep learning models usually requires preprocessing the input sample into features, batching or "collating" these features together to process multiple samples at once, running deep learning operations over these features (in Pytorch, this step is done in the forward
method) and postprocessing the outputs of these operation to complete the original sample.
In the trainable pipes of EDS-PDF, preprocessing and postprocessing are decoupled from the deep learning code but collocated with the forward method. This is achieved by splitting the class of a trainable component into four methods, which allows us to keep the development of new deep-learning components simple while ensuring efficient models both during training and inference.
+preprocess
Preprocess the document to extract features that will be used by the +neural network to perform its predictions.
+ +PARAMETER | +DESCRIPTION | +
---|---|
doc |
+
+ PDFDocument to preprocess +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Dict[str, Any]
+
+ |
+
+
+
+ Dictionary (optionally nested) containing the features extracted from +the document. + |
+
collate
Collate the batch of features into a single batch of tensors that can be +used by the forward method of the component.
+ +PARAMETER | +DESCRIPTION | +
---|---|
batch |
+
+ Batch of features +
+
+ TYPE:
+ |
+
device |
+
+ Device on which the tensors should be moved +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ InputBatch
+
+ |
+
+
+
+ Dictionary (optionally nested) containing the collated tensors + |
+
forward
Perform the forward pass of the neural network, i.e, apply transformations +over the collated features to compute new embeddings, probabilities, losses, etc
+ +PARAMETER | +DESCRIPTION | +
---|---|
batch |
+
+ Batch of tensors (nested dictionary) computed by the collate method +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ OutputBatch
+
+ |
+
+
+
+
+ |
+
postprocess
Update the documents with the predictions of the neural network, for instance +converting label probabilities into label attributes on the document lines.
+By default, this is a no-op.
+ +PARAMETER | +DESCRIPTION | +
---|---|
docs |
+
+ Batch of documents +
+
+ TYPE:
+ |
+
batch |
+
+ Batch of predictions, as returned by the forward method +
+
+ TYPE:
+ |
+
RETURNS | +DESCRIPTION | +
---|---|
+
+ Sequence[PDFDoc]
+
+ |
+
+
+
+
+ |
+
Additionally, there is a fifth method:
+post_init
This method completes the attributes of the component, by looking at some +documents. It is especially useful to build vocabularies or detect the labels +of a classification task.
+ +PARAMETER | +DESCRIPTION | +
---|---|
gold_data |
+
+ The documents to use for initialization. +
+
+ TYPE:
+ |
+
exclude |
+
+ The names of components to exclude from initialization. +This argument will be gradually updated with the names of initialized +components +
+
+ TYPE:
+ |
+
Here is an example of a trainable component:
+from typing import Any, Dict, Iterable, Sequence
+
+import torch
+from tqdm import tqdm
+
+from edspdf import Pipeline, TrainablePipe, registry
+from edspdf.structures import PDFDoc
+
+
+@registry.factory.register("my-component")
+class MyComponent(TrainablePipe):
+ def __init__(
+ self,
+ # A subcomponent
+ pipeline: Pipeline,
+ name: str,
+ embedding: TrainablePipe,
+ ):
+ super().__init__(pipeline=pipeline, name=name)
+ self.embedding = embedding
+
+ def post_init(self, gold_data: Iterable[PDFDoc], exclude: set):
+ # Initialize the component with the gold documents
+ with self.label_vocabulary.initialization():
+ for doc in tqdm(gold_data, desc="Initializing the component"):
+ # Do something like learning a vocabulary over the initialization
+ # documents
+ ...
+
+ # And post_init the subcomponent
+ exclude.add(self.name)
+ self.embedding.post_init(gold_data, exclude)
+
+ # Initialize any layer that might be missing from the module
+ self.classifier = torch.nn.Linear(...)
+
+ def preprocess(self, doc: PDFDoc, supervision: bool = False) -> Dict[str, Any]:
+ # Preprocess the doc to extract features required to run the embedding
+ # subcomponent, and this component
+ return {
+ "embedding": self.embedding.preprocess_supervised(doc),
+ "my-feature": ...(doc),
+ }
+
+ def collate(self, batch, device: torch.device) -> Dict:
+ # Collate the features of the "embedding" subcomponent
+ # and the features of this component as well
+ return {
+ "embedding": self.embedding.collate(batch["embedding"], device),
+ "my-feature": torch.as_tensor(batch["my-feature"], device=device),
+ }
+
+ def forward(self, batch: Dict, supervision=False) -> Dict:
+ # Call the embedding subcomponent
+ embeds = self.embedding(batch["embedding"])
+
+ # Do something with the embedding tensors
+ output = ...(embeds)
+
+ return output
+
+ def postprocess(self, docs: Sequence[PDFDoc], output: Dict) -> Sequence[PDFDoc]:
+ # Annotate the docs with the outputs of the forward method
+ ...
+ return docs
+
Like pytorch modules, you can compose trainable pipes together to build complex architectures. For instance, a trainable classifier component may delegate some of its logic to an embedding component, which will only be responsible for converting PDF lines into multidimensional arrays of numbers.
+Nesting pipes allows switching parts of the neural networks to test various architectures and keeping the modelling logic modular.
+Sharing parts of a neural network while training on different tasks can be an effective way to improve the network efficiency. For instance, it is common to share an embedding layer between multiple tasks that require embedding the same inputs.
+In EDS-PDF, sharing a subcomponent is simply done by sharing the object between the multiple pipes. You can either refer to an existing subcomponent when configuring a new component in Python, or use the interpolation mechanism of our configuration system.
+pipeline.add_pipe(
+ "my-component-1",
+ name="first",
+ config={
+ "embedding": {
+ "@factory": "box-embedding",
+ # ...
+ }
+ },
+)
+pipeline.add_pipe(
+ "my-component-2",
+ name="second",
+ config={
+ "embedding": pipeline.components.first.embedding,
+ },
+)
+
[components.first]
+@factory = "my-component-1"
+
+[components.first.embedding]
+@factory = "box-embedding"
+...
+
+[components.second]
+@factory = "my-component-2"
+embedding = ${components.first.embedding}
+
To avoid recomputing the preprocess
/ forward
and collate
in the multiple components that use it, we rely on a light cache system.
During the training loop, when computing the loss for each component, the forward calls must be wrapped by the pipeline.cache()
context to enable this caching mechanism between components.
EDS-PDF provides a few utilities help annotate PDF documents, and debug the output of an extraction pipeline.
+EDS-PDF provides utilities to help you visualise the output of the pipeline.
+You can use EDS-PDF to overlay labelled bounding boxes on top of a PDF document.
+import edspdf
+from confit import Config
+from pathlib import Path
+from edspdf.visualization import show_annotations
+
+config = """
+[pipeline]
+pipeline = ["extractor", "classifier"]
+
+[components]
+
+[components.extractor]
+@factory = "pdfminer-extractor"
+extract_style = true
+
+[components.classifier]
+@factory = "mask-classifier"
+x0 = 0.25
+x1 = 0.95
+y0 = 0.3
+y1 = 0.9
+threshold = 0.1
+"""
+
+model = edspdf.load(Config.from_str(config))
+
+# Get a PDF
+pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
+
+# Construct the DataFrame of blocs
+doc = model(pdf)
+
+# Compute an image representation of each page of the PDF
+# overlaid with the predicted bounding boxes
+imgs = show_annotations(pdf=pdf, annotations=doc.text_boxes)
+
+imgs[0]
+
If you run this code in a Jupyter notebook, you'll see the following:
+ +To help debug a pipeline (or a labelled dataset), you might want to
+merge blocs together according to their labels. EDS-PDF provides a merge_lines
method
+that does just that.
# ↑ Omitted code above ↑
+from edspdf.visualization import merge_boxes, show_annotations
+
+merged = merge_boxes(doc.text_boxes)
+
+imgs = show_annotations(pdf=pdf, annotations=merged)
+imgs[0]
+
See the difference:
+ +The merge_boxes
method uses the notion of maximal cliques to compute merges.
+It forbids the combined blocs from overlapping with any bloc from another label.