Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerators #19

Merged
merged 2 commits into from
Sep 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

- Add multi-modal transformers (`huggingface-embedding`) with windowing options
- Add `render_page` option to `pdfminer` extractor, for multi-modal PDF features
- Add inference utilities (`accelerators`), with simple mono process support and multi gpu / cpu support

### Changed

Expand Down
3 changes: 3 additions & 0 deletions docs/assets/images/multiprocessing.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/assets/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,6 @@ body, input {
margin-top: 1.5rem;
}

.references {

.doc td > code {
word-break: normal;
}
61 changes: 61 additions & 0 deletions docs/inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Inference

Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.

## Inference on a single document

In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:

- a sequence of bytes
- or a [PDFDoc][edspdf.structures.PDFDoc] object

```python
from pathlib import Path

pipeline = ...
content = Path("path/to/.pdf").read_bytes()
doc = pipeline(content)
```

If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below.

```python
pipeline.to("cuda") # same semantics as pytorch
doc = pipeline(content)
```

## Inference on multiple documents

When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument.

```python
pipeline = ...
docs = pipeline.pipe(
[content1, content2, ...],
batch_size=16, # optional, default to the one defined in the pipeline
accelerator=my_accelerator,
)
```

The `pipe` method supports the following arguments :

::: edspdf.pipeline.Pipeline.pipe
options:
heading_level: 3
only_parameters: true

## Accelerators

### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator }

::: edspdf.accelerators.simple.SimpleAccelerator
options:
heading_level: 3
only_class_level: true

### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator }

::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator
options:
heading_level: 3
only_class_level: true
2 changes: 2 additions & 0 deletions docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ model(pdf_bytes)
model.pipe([pdf_bytes, ...])
```

For more information on how to use the pipeline, refer to the [Inference](../inference) page.

## Hybrid models

EDS-PDF was designed to facilitate the training and inference of hybrid models that
Expand Down
Empty file added edspdf/accelerators/__init__.py
Empty file.
97 changes: 97 additions & 0 deletions edspdf/accelerators/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
from typing import TYPE_CHECKING, Any, Callable, Dict, Iterable, Union

from ..structures import PDFDoc


class FromDictFieldsToDoc:
def __init__(self, content_field, id_field=None):
self.content_field = content_field
self.id_field = id_field

def __call__(self, item):
if isinstance(item, dict):
return PDFDoc(
content=item[self.content_field],
id=item[self.id_field] if self.id_field else None,
)
return item


class ToDoc:
@classmethod
def __get_validators__(cls):
yield cls.validate

@classmethod
def validate(cls, value, config=None):
if isinstance(value, str):
value = {"content_field": value}
if isinstance(value, dict):
value = FromDictFieldsToDoc(**value)
if callable(value):
return value
raise TypeError(
f"Invalid entry {value} ({type(value)}) for ToDoc, "
f"expected string, a dict or a callable."
)


FROM_DOC_TO_DICT_FIELDS_TEMPLATE = """
def fn(doc):
return {X}
"""


class FromDocToDictFields:
def __init__(self, mapping):
self.mapping = mapping
dict_fields = ", ".join(f"{repr(k)}: doc.{v}" for k, v in mapping.items())
local_vars = {}
exec(FROM_DOC_TO_DICT_FIELDS_TEMPLATE.replace("X", dict_fields), local_vars)
self.fn = local_vars["fn"]

def __reduce__(self):
return FromDocToDictFields, (self.mapping,)

Check warning on line 54 in edspdf/accelerators/base.py

View check run for this annotation

Codecov / codecov/patch

edspdf/accelerators/base.py#L54

Added line #L54 was not covered by tests

def __call__(self, doc):
return self.fn(doc)


class FromDoc:
"""
A FromDoc converter (from a PDFDoc to an arbitrary type) can be either:

- a dict mapping field names to doc attributes
- a callable that takes a PDFDoc and returns an arbitrary type
"""

@classmethod
def __get_validators__(cls):
yield cls.validate

@classmethod
def validate(cls, value, config=None):
if isinstance(value, dict):
value = FromDocToDictFields(value)
if callable(value):
return value
raise TypeError(
f"Invalid entry {value} ({type(value)}) for ToDoc, "
f"expected dict or callable"
)


class Accelerator:
def __call__(
self,
inputs: Iterable[Any],
model: Any,
to_doc: ToDoc = FromDictFieldsToDoc("content"),
from_doc: FromDoc = lambda doc: doc,
):
raise NotImplementedError()


if TYPE_CHECKING:
ToDoc = Union[str, Dict[str, Any], Callable[[Any], PDFDoc]] # noqa: F811
FromDoc = Union[Dict[str, Any], Callable[[PDFDoc], Any]] # noqa: F811
Loading
Loading