aphp · percevalw · Sep 7, 2023 · Sep 6, 2023 · Sep 6, 2023
diff --git a/changelog.md b/changelog.md
@@ -6,6 +6,7 @@
 
 - Add multi-modal transformers (`huggingface-embedding`) with windowing options
 - Add `render_page` option to `pdfminer` extractor, for multi-modal PDF features
+- Add inference utilities (`accelerators`), with simple mono process support and multi gpu / cpu support
 
 ### Changed
 

diff --git a/docs/assets/images/multiprocessing.svg b/docs/assets/images/multiprocessing.svg
diff --git a/docs/assets/stylesheets/extra.css b/docs/assets/stylesheets/extra.css
@@ -155,6 +155,6 @@ body, input {
     margin-top: 1.5rem;
 }
 
-.references {
-
+.doc td > code {
+    word-break: normal;
 }
diff --git a/docs/inference.md b/docs/inference.md
@@ -0,0 +1,61 @@
+# Inference
+
+Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.
+
+## Inference on a single document
+
+In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
+
+- a sequence of bytes
+- or a [PDFDoc][edspdf.structures.PDFDoc] object
+
+```python
+from pathlib import Path
+
+pipeline = ...
+content = Path("path/to/.pdf").read_bytes()
+doc = pipeline(content)
+```
+
+If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below.
+
+```python
+pipeline.to("cuda")  # same semantics as pytorch
+doc = pipeline(content)
+```
+
+## Inference on multiple documents
+
+When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument.
+
+```python
+pipeline = ...
+docs = pipeline.pipe(
+    [content1, content2, ...],
+    batch_size=16,  # optional, default to the one defined in the pipeline
+    accelerator=my_accelerator,
+)
+```
+
+The `pipe` method supports the following arguments :
+
+::: edspdf.pipeline.Pipeline.pipe
+    options:
+        heading_level: 3
+        only_parameters: true
+
+## Accelerators
+
+### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator }
+
+::: edspdf.accelerators.simple.SimpleAccelerator
+    options:
+        heading_level: 3
+        only_class_level: true
+
+### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator }
+
+::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator
+    options:
+        heading_level: 3
+        only_class_level: true
diff --git a/docs/pipeline.md b/docs/pipeline.md
@@ -57,6 +57,8 @@ model(pdf_bytes)
 model.pipe([pdf_bytes, ...])
 ```
 
+For more information on how to use the pipeline, refer to the [Inference](../inference) page.
+
 ## Hybrid models
 
 EDS-PDF was designed to facilitate the training and inference of hybrid models that

diff --git a/edspdf/accelerators/__init__.py b/edspdf/accelerators/__init__.py
diff --git a/edspdf/accelerators/base.py b/edspdf/accelerators/base.py
@@ -0,0 +1,97 @@
+from typing import TYPE_CHECKING, Any, Callable, Dict, Iterable, Union
+
+from ..structures import PDFDoc
+
+
+class FromDictFieldsToDoc:
+    def __init__(self, content_field, id_field=None):
+        self.content_field = content_field
+        self.id_field = id_field
+
+    def __call__(self, item):
+        if isinstance(item, dict):
+            return PDFDoc(
+                content=item[self.content_field],
+                id=item[self.id_field] if self.id_field else None,
+            )
+        return item
+
+
+class ToDoc:
+    @classmethod
+    def __get_validators__(cls):
+        yield cls.validate
+
+    @classmethod
+    def validate(cls, value, config=None):
+        if isinstance(value, str):
+            value = {"content_field": value}
+        if isinstance(value, dict):
+            value = FromDictFieldsToDoc(**value)
+        if callable(value):
+            return value
+        raise TypeError(
+            f"Invalid entry {value} ({type(value)}) for ToDoc, "
+            f"expected string, a dict or a callable."
+        )
+
+
+FROM_DOC_TO_DICT_FIELDS_TEMPLATE = """
+def fn(doc):
+    return {X}
+"""
+
+
+class FromDocToDictFields:
+    def __init__(self, mapping):
+        self.mapping = mapping
+        dict_fields = ", ".join(f"{repr(k)}: doc.{v}" for k, v in mapping.items())
+        local_vars = {}
+        exec(FROM_DOC_TO_DICT_FIELDS_TEMPLATE.replace("X", dict_fields), local_vars)
+        self.fn = local_vars["fn"]
+
+    def __reduce__(self):
+        return FromDocToDictFields, (self.mapping,)
+
+    def __call__(self, doc):
+        return self.fn(doc)
+
+
+class FromDoc:
+    """
+    A FromDoc converter (from a PDFDoc to an arbitrary type) can be either:
+
+    - a dict mapping field names to doc attributes
+    - a callable that takes a PDFDoc and returns an arbitrary type
+    """
+
+    @classmethod
+    def __get_validators__(cls):
+        yield cls.validate
+
+    @classmethod
+    def validate(cls, value, config=None):
+        if isinstance(value, dict):
+            value = FromDocToDictFields(value)
+        if callable(value):
+            return value
+        raise TypeError(
+            f"Invalid entry {value} ({type(value)}) for ToDoc, "
+            f"expected dict or callable"
+        )
+
+
+class Accelerator:
+    def __call__(
+        self,
+        inputs: Iterable[Any],
+        model: Any,
+        to_doc: ToDoc = FromDictFieldsToDoc("content"),
+        from_doc: FromDoc = lambda doc: doc,
+    ):
+        raise NotImplementedError()
+
+
+if TYPE_CHECKING:
+    ToDoc = Union[str, Dict[str, Any], Callable[[Any], PDFDoc]]  # noqa: F811
+    FromDoc = Union[Dict[str, Any], Callable[[PDFDoc], Any]]  # noqa: F811