feat: add multiprocessing accelerator with tests & docs

aphp · Sep 7, 2023 · e493d7b · e493d7b
1 parent 46c4d46
commit e493d7b
Show file tree

Hide file tree

Showing 11 changed files with 758 additions and 32 deletions.
diff --git a/changelog.md b/changelog.md
@@ -6,6 +6,7 @@
 
 - Add multi-modal transformers (`huggingface-embedding`) with windowing options
 - Add `render_page` option to `pdfminer` extractor, for multi-modal PDF features
+- Add inference utilities (`accelerators`), with simple mono process support and multi gpu / cpu support
 
 ### Changed
 

diff --git a/docs/assets/images/multiprocessing.svg b/docs/assets/images/multiprocessing.svg
diff --git a/docs/assets/stylesheets/extra.css b/docs/assets/stylesheets/extra.css
@@ -155,6 +155,6 @@ body, input {
     margin-top: 1.5rem;
 }
 
-.references {
-
+.doc td > code {
+    word-break: normal;
 }
diff --git a/docs/inference.md b/docs/inference.md
@@ -0,0 +1,61 @@
+# Inference
+
+Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.
+
+## Inference on a single document
+
+In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
+
+- a sequence of bytes
+- or a [PDFDoc][edspdf.structures.PDFDoc] object
+
+```python
+from pathlib import Path
+
+pipeline = ...
+content = Path("path/to/.pdf").read_bytes()
+doc = pipeline(content)
+```
+
+If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below.
+
+```python
+pipeline.to("cuda")  # same semantics as pytorch
+doc = pipeline(content)
+```
+
+## Inference on multiple documents
+
+When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument.
+
+```python
+pipeline = ...
+docs = pipeline.pipe(
+    [content1, content2, ...],
+    batch_size=16,  # optional, default to the one defined in the pipeline
+    accelerator=my_accelerator,
+)
+```
+
+The `pipe` method supports the following arguments :
+
+::: edspdf.pipeline.Pipeline.pipe
+    options:
+        heading_level: 3
+        only_parameters: true
+
+## Accelerators
+
+### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator }
+
+::: edspdf.accelerators.simple.SimpleAccelerator
+    options:
+        heading_level: 3
+        only_class_level: true
+
+### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator }
+
+::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator
+    options:
+        heading_level: 3
+        only_class_level: true
diff --git a/docs/pipeline.md b/docs/pipeline.md
@@ -57,6 +57,8 @@ model(pdf_bytes)
 model.pipe([pdf_bytes, ...])
 ```
 
+For more information on how to use the pipeline, refer to the [Inference](../inference) page.
+
 ## Hybrid models
 
 EDS-PDF was designed to facilitate the training and inference of hybrid models that

diff --git a/edspdf/accelerators/base.py b/edspdf/accelerators/base.py
@@ -25,23 +25,18 @@ def __get_validators__(cls):
     @classmethod
     def validate(cls, value, config=None):
         if isinstance(value, str):
-            return FromDictFieldsToDoc(value)
-        elif isinstance(value, dict):
-            return FromDictFieldsToDoc(**value)
-        elif callable(value):
+            value = {"content_field": value}
+        if isinstance(value, dict):
+            value = FromDictFieldsToDoc(**value)
+        if callable(value):
             return value
-        else:
-            raise TypeError(
-                f"Invalid entry {value} ({type(value)}) for ToDoc, "
-                f"expected string, a dict or a callable."
-            )
-
+        raise TypeError(
+            f"Invalid entry {value} ({type(value)}) for ToDoc, "
+            f"expected string, a dict or a callable."
+        )
 
-def identity(x):
-    return x
 
-
-FROM_DOC_TO_DICT_FIELDS_TEMPLATE = """\
+FROM_DOC_TO_DICT_FIELDS_TEMPLATE = """
 def fn(doc):
     return {X}
 """
@@ -50,8 +45,10 @@ def fn(doc):
 class FromDocToDictFields:
     def __init__(self, mapping):
         self.mapping = mapping
-        dict_fields = ", ".join(f"{k}: doc.{v}" for k, v in mapping.items())
-        self.fn = eval(FROM_DOC_TO_DICT_FIELDS_TEMPLATE.replace("X", dict_fields))
+        dict_fields = ", ".join(f"{repr(k)}: doc.{v}" for k, v in mapping.items())
+        local_vars = {}
+        exec(FROM_DOC_TO_DICT_FIELDS_TEMPLATE.replace("X", dict_fields), local_vars)
+        self.fn = local_vars["fn"]
 
     def __reduce__(self):
         return FromDocToDictFields, (self.mapping,)
@@ -75,14 +72,13 @@ def __get_validators__(cls):
     @classmethod
     def validate(cls, value, config=None):
         if isinstance(value, dict):
-            return FromDocToDictFields(value)
-        elif callable(value):
+            value = FromDocToDictFields(value)
+        if callable(value):
             return value
-        else:
-            raise TypeError(
-                f"Invalid entry {value} ({type(value)}) for ToDoc, "
-                f"expected dict or callable"
-            )
+        raise TypeError(
+            f"Invalid entry {value} ({type(value)}) for ToDoc, "
+            f"expected dict or callable"
+        )
 
 
 class Accelerator:
@@ -92,7 +88,6 @@ def __call__(
         model: Any,
         to_doc: ToDoc = FromDictFieldsToDoc("content"),
         from_doc: FromDoc = lambda doc: doc,
-        component_cfg: Dict[str, Dict[str, Any]] = None,
     ):
         raise NotImplementedError()