Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API update (data & processing) #25

Merged
merged 15 commits into from
Feb 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,13 @@ jobs:
python-version: ${{ matrix.python-version }}
architecture: x64

- name: Cache HuggingFace Models
uses: actions/cache@v2
id: cache-huggingface
with:
path: ~/.cache/huggingface/
key: ${{ matrix.python-version }}-huggingface

- name: Install hatch
run: pip install hatch

Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ report.xml
*.pickle
*.joblib
*.pdf
data/
/data/

# MkDocs output
docs/reference
Expand Down
28 changes: 28 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,33 @@
# Changelog



## v0.9.0

### Added

- New unified `edspdf.data` api (pdf files, pandas, parquet) and LazyCollection object
to efficiently read / write data from / to different formats & sources. This API is
has been heavily inspired by the `edsnlp.data` API.
- New unified processing API to select the execution backend via `data.set_processing(...)`
to replace the old `accelerators` API (which is now deprecated, but still available).
- `huggingface-embedding` now supports quantization and other `AutoModel.from_pretrained` kwargs
- It is now possible to add convert a label to multiple labels in the `simple-aggregator` component :

```ini
# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
"text" : [ "title", "body", "table" ],
"title": "title",
}
```

### Fixed

- `huggingface-embedding` now resize bbox features for large PDFs, instead of making the model crash
- `huggingface-embedding` and `sub-box-cnn-pooler` now handle empty PDFs correctly

## v0.8.1

### Fixed
Expand Down
Binary file added docs/assets/images/multiprocessing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 0 additions & 3 deletions docs/assets/images/multiprocessing.svg

This file was deleted.

10 changes: 4 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,10 @@ See the [rule-based recipe](recipes/rule-based.md) for a step-by-step explanatio
If you use EDS-PDF, please cite us as below.

```bibtex
@software{edspdf,
author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
doi = {10.5281/zenodo.6902977},
license = {BSD-3-Clause},
title = {{EDS-PDF: Smart text extraction from PDF documents}},
url = {https://github.com/aphp/edspdf}
@article{gerardin_wajsburt_pdf,
title={Bridging Clinical PDFs and Downstream Natural Language Processing: An Efficient Neural Approach to Layout Segmentation},
author={G{\'e}rardin, Christel Ducroz and Wajsburt, Perceval and Dura, Basile and Calliger, Alice and Mouchet, Alexandre and Tannier, Xavier and Bey, Romain},
journal={Available at SSRN 4587624}
}
```

Expand Down
123 changes: 93 additions & 30 deletions docs/inference.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,124 @@
# Inference

Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.
Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference. This page answers the following questions :

> How do we leverage computational resources run a model on many documents?

> How do we connect to various data sources to retrieve documents?


## Inference on a single document

In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
In EDS-model, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:

- a sequence of bytes
- or a [PDFDoc][edspdf.structures.PDFDoc] object
- a bytes string
- or a [PDFDoc](https://spacy.io/api/doc) object

```python
```{ .python .no-check }
from pathlib import Path

pipeline = ...
content = Path("path/to/.pdf").read_bytes()
doc = pipeline(content)
model = ...
pdf_bytes = b"..."
doc = model(pdf_bytes)
```

If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below.
If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline.

```python
pipeline.to("cuda") # same semantics as pytorch
doc = pipeline(content)
```{ .python .no-check }
model.to("cuda") # same semantics as pytorch
doc = model(pdf_bytes)
```

## Inference on multiple documents
To leverage multiple GPUs when processing multiple documents, refer to the [multiprocessing backend][edspdf.processing.multiprocessing.execute_multiprocessing_backend] description below.

## Inference on multiple documents {: #edspdf.lazy_collection.LazyCollection }

When processing multiple documents, we can optimize the inference by parallelizing the computation on a single core, multiple cores and GPUs or even multiple machines.

### Lazy collection

These optimizations are enabled by performing *lazy inference* : the operations (e.g., reading a document, converting it to a PDFDoc, running the different pipes of a model or writing the result somewhere) are not executed immediately but are instead scheduled in a [LazyCollection][edspdf.lazy_collection.LazyCollection] object. It can then be executed by calling the `execute` method, iterating over it or calling a writing method (e.g., `to_pandas`). In fact, data connectors like `edspdf.data.read_files` return a lazy collection, as well as the `model.pipe` method.

A lazy collection contains :

- a `reader`: the source of the data (e.g., a file, a database, a list of strings, etc.)
- the list of operations to perform under a `pipeline` attribute containing the name if any, function / pipe, keyword arguments and context for each operation
- an optional `writer`: the destination of the data (e.g., a file, a database, a list of strings, etc.)
- the execution `config`, containing the backend to use and its configuration such as the number of workers, the batch size, etc.

All methods (`.map`, `.map_pipeline`, `.set_processing`) of the lazy collection are chainable, meaning that they return a new object (no in-place modification).

For instance, the following code will load a model, read a folder of JSON files, apply the model to each document and write the result in a Parquet folder, using 4 CPUs and 2 GPUs.

When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument.
```{ .python .no-check }
import edspdf

```python
pipeline = ...
docs = pipeline.pipe(
[content1, content2, ...],
batch_size=16, # optional, default to the one defined in the pipeline
accelerator=my_accelerator,
# Load or create a model, for instance following the "Recipes"
model = edspdf.load("path/to/model")

# Read some data (this is lazy, no data will be read until the end of of this snippet)
data = edspdf.data.read_files(
"/Users/perceval/Development/edspdf/tests/resources/",
# dict to doc converter function
converter=lambda x: PDFDoc(id=x["id"], content=x["content"]),
)

# Apply each pipe of the model to our documents
data = data.map_pipeline(model)
# or equivalently : data = model.pipe(data)

# Configure the execution
data = data.set_processing(
# 4 CPUs to parallelize rule-based pipes, IO and preprocessing
num_cpu_workers=4,
# 2 GPUs to accelerate deep-learning pipes
num_gpu_workers=2,
)

# Write the result, this will execute the lazy collection
data.write_parquet(
"path/to/output_folder",
# doc to dict converter function
converter=lambda doc: {
"id": doc.id,
"text": (
doc.aggregated_texts["body"].text
if "body" in doc.aggregated_texts
else ""
),
},
)
```

The `pipe` method supports the following arguments :
### Applying operations to a lazy collection

To apply an operation to a lazy collection, you can use the `.map` method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection.

To apply a model, you can use the `.map_pipeline` method. It takes a model as input and will add every pipe of the model to the scheduled operations.

In both cases, the operations will not be executed immediately but will be scheduled to be executed when iterating of the collection, or calling the `.execute`, `.to_*` or `.write_*` methods.

### Execution of a lazy collection {: #edspdf.lazy_collection.LazyCollection.set_processing }

You can configure how the operations performed in the lazy collection is executed by calling its `set_processing(...)` method. The following options are available :

::: edspdf.pipeline.Pipeline.pipe
::: edspdf.lazy_collection.LazyCollection.set_processing
options:
heading_level: 3
only_parameters: true
only_parameters: "no-header"

## Accelerators
## Backends

### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator }
### Simple backend {: #edspdf.processing.simple.execute_simple_backend }

::: edspdf.accelerators.simple.SimpleAccelerator
::: edspdf.processing.simple.execute_simple_backend
options:
heading_level: 3
only_class_level: true
show_source: false

### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator }
### Multiprocessing backend {: #edspdf.processing.multiprocessing.execute_multiprocessing_backend }

::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator
::: edspdf.processing.multiprocessing.execute_multiprocessing_backend
options:
heading_level: 3
only_class_level: true
show_source: false
Loading
Loading