Skip to content

Commit

Permalink
feat: add multiprocessing accelerator with tests & docs
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Sep 7, 2023
1 parent 46c4d46 commit e493d7b
Show file tree
Hide file tree
Showing 11 changed files with 758 additions and 32 deletions.
1 change: 1 addition & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

- Add multi-modal transformers (`huggingface-embedding`) with windowing options
- Add `render_page` option to `pdfminer` extractor, for multi-modal PDF features
- Add inference utilities (`accelerators`), with simple mono process support and multi gpu / cpu support

### Changed

Expand Down
3 changes: 3 additions & 0 deletions docs/assets/images/multiprocessing.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/assets/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,6 @@ body, input {
margin-top: 1.5rem;
}

.references {

.doc td > code {
word-break: normal;
}
61 changes: 61 additions & 0 deletions docs/inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Inference

Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.

## Inference on a single document

In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:

- a sequence of bytes
- or a [PDFDoc][edspdf.structures.PDFDoc] object

```python
from pathlib import Path

pipeline = ...
content = Path("path/to/.pdf").read_bytes()
doc = pipeline(content)
```

If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below.

```python
pipeline.to("cuda") # same semantics as pytorch
doc = pipeline(content)
```

## Inference on multiple documents

When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument.

```python
pipeline = ...
docs = pipeline.pipe(
[content1, content2, ...],
batch_size=16, # optional, default to the one defined in the pipeline
accelerator=my_accelerator,
)
```

The `pipe` method supports the following arguments :

::: edspdf.pipeline.Pipeline.pipe
options:
heading_level: 3
only_parameters: true

## Accelerators

### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator }

::: edspdf.accelerators.simple.SimpleAccelerator
options:
heading_level: 3
only_class_level: true

### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator }

::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator
options:
heading_level: 3
only_class_level: true
2 changes: 2 additions & 0 deletions docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ model(pdf_bytes)
model.pipe([pdf_bytes, ...])
```

For more information on how to use the pipeline, refer to the [Inference](../inference) page.

## Hybrid models

EDS-PDF was designed to facilitate the training and inference of hybrid models that
Expand Down
43 changes: 19 additions & 24 deletions edspdf/accelerators/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,23 +25,18 @@ def __get_validators__(cls):
@classmethod
def validate(cls, value, config=None):
if isinstance(value, str):
return FromDictFieldsToDoc(value)
elif isinstance(value, dict):
return FromDictFieldsToDoc(**value)
elif callable(value):
value = {"content_field": value}
if isinstance(value, dict):
value = FromDictFieldsToDoc(**value)
if callable(value):
return value
else:
raise TypeError(
f"Invalid entry {value} ({type(value)}) for ToDoc, "
f"expected string, a dict or a callable."
)

raise TypeError(
f"Invalid entry {value} ({type(value)}) for ToDoc, "
f"expected string, a dict or a callable."
)

def identity(x):
return x


FROM_DOC_TO_DICT_FIELDS_TEMPLATE = """\
FROM_DOC_TO_DICT_FIELDS_TEMPLATE = """
def fn(doc):
return {X}
"""
Expand All @@ -50,8 +45,10 @@ def fn(doc):
class FromDocToDictFields:
def __init__(self, mapping):
self.mapping = mapping
dict_fields = ", ".join(f"{k}: doc.{v}" for k, v in mapping.items())
self.fn = eval(FROM_DOC_TO_DICT_FIELDS_TEMPLATE.replace("X", dict_fields))
dict_fields = ", ".join(f"{repr(k)}: doc.{v}" for k, v in mapping.items())
local_vars = {}
exec(FROM_DOC_TO_DICT_FIELDS_TEMPLATE.replace("X", dict_fields), local_vars)
self.fn = local_vars["fn"]

def __reduce__(self):
return FromDocToDictFields, (self.mapping,)
Expand All @@ -75,14 +72,13 @@ def __get_validators__(cls):
@classmethod
def validate(cls, value, config=None):
if isinstance(value, dict):
return FromDocToDictFields(value)
elif callable(value):
value = FromDocToDictFields(value)
if callable(value):
return value
else:
raise TypeError(
f"Invalid entry {value} ({type(value)}) for ToDoc, "
f"expected dict or callable"
)
raise TypeError(
f"Invalid entry {value} ({type(value)}) for ToDoc, "
f"expected dict or callable"
)


class Accelerator:
Expand All @@ -92,7 +88,6 @@ def __call__(
model: Any,
to_doc: ToDoc = FromDictFieldsToDoc("content"),
from_doc: FromDoc = lambda doc: doc,
component_cfg: Dict[str, Dict[str, Any]] = None,
):
raise NotImplementedError()

Expand Down
Loading

0 comments on commit e493d7b

Please sign in to comment.