aphp · percevalw · Feb 24, 2024 · Feb 7, 2024 · Feb 7, 2024 · Feb 8, 2024
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -31,6 +31,13 @@ jobs:
           python-version: ${{ matrix.python-version }}
           architecture: x64
 
+      - name: Cache HuggingFace Models
+        uses: actions/cache@v2
+        id: cache-huggingface
+        with:
+          path: ~/.cache/huggingface/
+          key: ${{ matrix.python-version }}-huggingface
+
       - name: Install hatch
         run: pip install hatch
 

diff --git a/.gitignore b/.gitignore
@@ -63,7 +63,7 @@ report.xml
 *.pickle
 *.joblib
 *.pdf
-data/
+/data/
 
 # MkDocs output
 docs/reference

diff --git a/changelog.md b/changelog.md
@@ -1,5 +1,33 @@
 # Changelog
 
+
+
+## v0.9.0
+
+### Added
+
+- New unified `edspdf.data` api (pdf files, pandas, parquet) and LazyCollection object
+  to efficiently read / write data from / to different formats & sources. This API is
+  has been heavily inspired by the `edsnlp.data` API.
+- New unified processing API to select the execution backend via `data.set_processing(...)`
+  to replace the old `accelerators` API (which is now deprecated, but still available).
+- `huggingface-embedding` now supports quantization and other `AutoModel.from_pretrained` kwargs
+- It is now possible to add convert a label to multiple labels in the `simple-aggregator` component :
+
+```ini
+# To build the "text" field, we will aggregate "title", "body" and "table" lines,
+# and output "title" lines in a separate field as well.
+label_map = {
+    "text" : [ "title", "body", "table" ],
+    "title": "title",
+    }
+```
+
+### Fixed
+
+- `huggingface-embedding` now resize bbox features for large PDFs, instead of making the model crash
+- `huggingface-embedding` and `sub-box-cnn-pooler` now handle empty PDFs correctly
+
 ## v0.8.1
 
 ### Fixed

diff --git a/docs/assets/images/multiprocessing.png b/docs/assets/images/multiprocessing.png
diff --git a/docs/assets/images/multiprocessing.svg b/docs/assets/images/multiprocessing.svg
diff --git a/docs/index.md b/docs/index.md
@@ -99,12 +99,10 @@ See the [rule-based recipe](recipes/rule-based.md) for a step-by-step explanatio
 If you use EDS-PDF, please cite us as below.
 
 ```bibtex
-@software{edspdf,
-  author  = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
-  doi     = {10.5281/zenodo.6902977},
-  license = {BSD-3-Clause},
-  title   = {{EDS-PDF: Smart text extraction from PDF documents}},
-  url     = {https://github.com/aphp/edspdf}
+@article{gerardin_wajsburt_pdf,
+  title={Bridging Clinical PDFs and Downstream Natural Language Processing: An Efficient Neural Approach to Layout Segmentation},
+  author={G{\'e}rardin, Christel Ducroz and Wajsburt, Perceval and Dura, Basile and Calliger, Alice and Mouchet, Alexandre and Tannier, Xavier and Bey, Romain},
+  journal={Available at SSRN 4587624}
 }
 ```
 

diff --git a/docs/inference.md b/docs/inference.md
@@ -1,61 +1,124 @@
 # Inference
 
-Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.
+Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference. This page answers the following questions :
+
+> How do we leverage computational resources run a model on many documents?
+
+> How do we connect to various data sources to retrieve documents?
+
 
 ## Inference on a single document
 
-In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
+In EDS-model, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
 
-- a sequence of bytes
-- or a [PDFDoc][edspdf.structures.PDFDoc] object
+- a bytes string
+- or a [PDFDoc](https://spacy.io/api/doc) object
 
-```python
+```{ .python .no-check }
 from pathlib import Path
 
-pipeline = ...
-content = Path("path/to/.pdf").read_bytes()
-doc = pipeline(content)
+model = ...
+pdf_bytes = b"..."
+doc = model(pdf_bytes)
 ```
 
-If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below.
+If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline.
 
-```python
-pipeline.to("cuda")  # same semantics as pytorch
-doc = pipeline(content)
+```{ .python .no-check }
+model.to("cuda")  # same semantics as pytorch
+doc = model(pdf_bytes)
 ```
 
-## Inference on multiple documents
+To leverage multiple GPUs when processing multiple documents, refer to the [multiprocessing backend][edspdf.processing.multiprocessing.execute_multiprocessing_backend] description below.
+
+## Inference on multiple documents {: #edspdf.lazy_collection.LazyCollection }
+
+When processing multiple documents, we can optimize the inference by parallelizing the computation on a single core, multiple cores and GPUs or even multiple machines.
+
+### Lazy collection
+
+These optimizations are enabled by performing *lazy inference* : the operations (e.g., reading a document, converting it to a PDFDoc, running the different pipes of a model or writing the result somewhere) are not executed immediately but are instead scheduled in a [LazyCollection][edspdf.lazy_collection.LazyCollection] object. It can then be executed by calling the `execute` method, iterating over it or calling a writing method (e.g., `to_pandas`). In fact, data connectors like `edspdf.data.read_files` return a lazy collection, as well as the `model.pipe` method.
+
+A lazy collection contains :
+
+- a `reader`: the source of the data (e.g., a file, a database, a list of strings, etc.)
+- the list of operations to perform under a `pipeline` attribute containing the name if any, function / pipe, keyword arguments and context for each operation
+- an optional `writer`: the destination of the data (e.g., a file, a database, a list of strings, etc.)
+- the execution `config`, containing the backend to use and its configuration such as the number of workers, the batch size, etc.
+
+All methods (`.map`, `.map_pipeline`, `.set_processing`) of the lazy collection are chainable, meaning that they return a new object (no in-place modification).
+
+For instance, the following code will load a model, read a folder of JSON files, apply the model to each document and write the result in a Parquet folder, using 4 CPUs and 2 GPUs.
 
-When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument.
+```{ .python .no-check }
+import edspdf
 
-```python
-pipeline = ...
-docs = pipeline.pipe(
-    [content1, content2, ...],
-    batch_size=16,  # optional, default to the one defined in the pipeline
-    accelerator=my_accelerator,
+# Load or create a model, for instance following the "Recipes"
+model = edspdf.load("path/to/model")
+
+# Read some data (this is lazy, no data will be read until the end of of this snippet)
+data = edspdf.data.read_files(
+    "/Users/perceval/Development/edspdf/tests/resources/",
+    # dict to doc converter function
+    converter=lambda x: PDFDoc(id=x["id"], content=x["content"]),
+)
+
+# Apply each pipe of the model to our documents
+data = data.map_pipeline(model)
+# or equivalently : data = model.pipe(data)
+
+# Configure the execution
+data = data.set_processing(
+    # 4 CPUs to parallelize rule-based pipes, IO and preprocessing
+    num_cpu_workers=4,
+    # 2 GPUs to accelerate deep-learning pipes
+    num_gpu_workers=2,
+)
+
+# Write the result, this will execute the lazy collection
+data.write_parquet(
+    "path/to/output_folder",
+    # doc to dict converter function
+    converter=lambda doc: {
+        "id": doc.id,
+        "text": (
+            doc.aggregated_texts["body"].text
+            if "body" in doc.aggregated_texts
+            else ""
+        ),
+    },
 )
 ```
 
-The `pipe` method supports the following arguments :
+### Applying operations to a lazy collection
+
+To apply an operation to a lazy collection, you can use the `.map` method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection.
+
+To apply a model, you can use the `.map_pipeline` method. It takes a model as input and will add every pipe of the model to the scheduled operations.
+
+In both cases, the operations will not be executed immediately but will be scheduled to be executed when iterating of the collection, or calling the `.execute`, `.to_*` or `.write_*` methods.
+
+### Execution of a lazy collection {: #edspdf.lazy_collection.LazyCollection.set_processing }
+
+You can configure how the operations performed in the lazy collection is executed by calling its `set_processing(...)` method. The following options are available :
 
-::: edspdf.pipeline.Pipeline.pipe
+::: edspdf.lazy_collection.LazyCollection.set_processing
     options:
         heading_level: 3
-        only_parameters: true
+        only_parameters: "no-header"
 
-## Accelerators
+## Backends
 
-### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator }
+### Simple backend {: #edspdf.processing.simple.execute_simple_backend }
 
-::: edspdf.accelerators.simple.SimpleAccelerator
+::: edspdf.processing.simple.execute_simple_backend
     options:
         heading_level: 3
-        only_class_level: true
+        show_source: false
 
-### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator }
+### Multiprocessing backend {: #edspdf.processing.multiprocessing.execute_multiprocessing_backend }
 
-::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator
+::: edspdf.processing.multiprocessing.execute_multiprocessing_backend
     options:
         heading_level: 3
-        only_class_level: true
+        show_source: false