Skip to content

Commit

Permalink
feat: add pipeline packaging utils
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Sep 2, 2023
1 parent 95ba47c commit f47f69e
Show file tree
Hide file tree
Showing 9 changed files with 619 additions and 23 deletions.
1 change: 1 addition & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ jobs:
- name: Install dependencies
run: |
pip install -e '.[dev]'
pip install poetry build
- name: Test with Pytest on Python ${{ matrix.python-version }}
run: python -m pytest --cov edspdf --cov-report xml
Expand Down
1 change: 1 addition & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

- Add multi-modal transformers (`huggingface-embedding`) with windowing options
- Add `render_page` option to `pdfminer` extractor, for multi-modal PDF features
- Packaging utils (`pipeline.package(...)`) to make a pip installable package from a pipeline

### Changed

Expand Down
30 changes: 29 additions & 1 deletion docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Processing PDFs usually involves many steps such as extracting lines, running OC
can use any technology in static components, we do not provide tools to train
components built with other deep learning frameworks.

## Creating a pipeline

A pipe is a processing block (like a function) that applies a transformation on its input and returns a modified object.

At the moment, four types of pipes are implemented in the library:
Expand Down Expand Up @@ -57,7 +59,33 @@ model(pdf_bytes)
model.pipe([pdf_bytes, ...])
```

## Hybrid models
### Hybrid models

EDS-PDF was designed to facilitate the training and inference of hybrid models that
arbitrarily chain static components or trained deep learning components. Static components are callable objects that take a PDFDoc object as input, perform arbitrary transformations over the input, and return the modified object. [Trainable pipes][edspdf.trainable_pipe.TrainablePipe], on the other hand, allow for deep learning operations to be performed on the [PDFDoc][edspdf.structures.PDFDoc] object and must be trained to be used.

## Saving and loading a pipeline

Pipelines can be saved and loaded using the `save` and `load` methods. The saved pipeline is not a pickled objet but a folder containing the config file, the weights and extra resources for each pipeline. This allows for easy inspection and modification of the pipeline, and avoids the execution of arbitrary code when loading a pipeline.

```python
model.save("path/to/your/model")
model = edspdf.load("path/to/your/model")
```

To share the pipeline and turn it into a pip installable package, you can use the `package` method, which will use or create a pyproject.toml file, fill it accordingly, and create a wheel file. At the moment, we only support the poetry package manager.

```python
model.package(
name="path/to/your/package",
version="0.0.1",
root_dir="path/to/project/root", # optional, to retrieve an existing pyproject.toml file
# if you don't have a pyproject.toml, you can provide the metadata here instead
metadata=dict(
authors="Firstname Lastname <[email protected]>",
description="A short description of your package",
),
)
```

This will create a wheel file in the root_dir/dist folder, which you can share and install with pip
32 changes: 32 additions & 0 deletions edspdf/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,10 @@
from confit.utils.collections import join_path, split_path
from confit.utils.xjson import Reference
from tqdm import tqdm
from typing_extensions import Literal

import edspdf
from build import ConfigSettingsType

from .registry import CurriedFactory, registry
from .structures import PDFDoc
Expand All @@ -40,6 +42,7 @@
decompress_dict,
multi_tee,
)
from .utils.package import ModuleName, package

EMPTY_LIST = FrozenList()

Expand Down Expand Up @@ -944,6 +947,35 @@ def select_pipes(
yield self
self._disabled = disabled_before

def package(
self,
name: ModuleName,
root_dir: Union[str, Path] = ".",
artifacts_name: ModuleName = "artifacts",
check_dependencies: bool = False,
project_type: Optional[Literal["poetry", "setuptools"]] = None,
version: str = "0.1.0",
metadata: Optional[Dict[str, Any]] = {},
distributions: Optional[Sequence[Literal["wheel", "sdist"]]] = ["wheel"],
config_settings: Optional[ConfigSettingsType] = None,
isolation: bool = True,
skip_build_dependency_check: bool = False,
):
return package(
pipeline=self,
name=name,
root_dir=root_dir,
artifacts_name=artifacts_name,
check_dependencies=check_dependencies,
project_type=project_type,
version=version,
metadata=metadata,
distributions=distributions,
config_settings=config_settings,
isolation=isolation,
skip_build_dependency_check=skip_build_dependency_check,
)


def load(
config: Union[Path, str, Config],
Expand Down
Loading

0 comments on commit f47f69e

Please sign in to comment.