Skip to content

Commit

Permalink
feat: add pipeline packaging utils
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Sep 7, 2023
1 parent 320a062 commit 6888d1b
Show file tree
Hide file tree
Showing 10 changed files with 631 additions and 37 deletions.
21 changes: 8 additions & 13 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,14 @@ jobs:
Linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- name: Set PY variable
run: echo "PY=$(python -VV | sha256sum | cut -d' ' -f1)" >> $GITHUB_ENV
- uses: actions/cache@v2
- uses: actions/checkout@v3
with:
path: ~/.cache/pre-commit
key: pre-commit|${{ env.PY }}|${{ hashFiles('.pre-commit-config.yaml') }}
- name: Install pre-commit
run: |
pip install pre-commit
pre-commit install
- name: Run pre-commit
run: SKIP=no-commit-to-branch pre-commit run --all-files
# requites to grab the history of the PR
fetch-depth: 0
- uses: actions/setup-python@v3
- uses: pre-commit/[email protected]
with:
extra_args: --color=always --from-ref ${{ github.event.pull_request.base.sha }} --to-ref ${{ github.event.pull_request.head.sha }}

Pytest:
runs-on: ubuntu-latest
Expand All @@ -45,6 +39,7 @@ jobs:
- name: Install dependencies
run: |
pip install -e '.[dev]'
pip install poetry build
- name: Test with Pytest on Python ${{ matrix.python-version }}
run: python -m pytest --cov edspdf --cov-report xml
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ repos:
# ruff
- repo: https://github.com/charliermarsh/ruff-pre-commit
# Ruff version.
rev: 'v0.0.245'
rev: 'v0.0.287'
hooks:
- id: ruff
args: ['--config', 'pyproject.toml']
Expand Down
1 change: 1 addition & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- Add multi-modal transformers (`huggingface-embedding`) with windowing options
- Add `render_page` option to `pdfminer` extractor, for multi-modal PDF features
- Add inference utilities (`accelerators`), with simple mono process support and multi gpu / cpu support
- Packaging utils (`pipeline.package(...)`) to make a pip installable package from a pipeline

### Changed

Expand Down
30 changes: 29 additions & 1 deletion docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Processing PDFs usually involves many steps such as extracting lines, running OC
can use any technology in static components, we do not provide tools to train
components built with other deep learning frameworks.

## Creating a pipeline

A pipe is a processing block (like a function) that applies a transformation on its input and returns a modified object.

At the moment, four types of pipes are implemented in the library:
Expand Down Expand Up @@ -59,7 +61,33 @@ model.pipe([pdf_bytes, ...])

For more information on how to use the pipeline, refer to the [Inference](/inference) page.

## Hybrid models
### Hybrid models

EDS-PDF was designed to facilitate the training and inference of hybrid models that
arbitrarily chain static components or trained deep learning components. Static components are callable objects that take a PDFDoc object as input, perform arbitrary transformations over the input, and return the modified object. [Trainable pipes][edspdf.trainable_pipe.TrainablePipe], on the other hand, allow for deep learning operations to be performed on the [PDFDoc][edspdf.structures.PDFDoc] object and must be trained to be used.

## Saving and loading a pipeline

Pipelines can be saved and loaded using the `save` and `load` methods. The saved pipeline is not a pickled objet but a folder containing the config file, the weights and extra resources for each pipeline. This allows for easy inspection and modification of the pipeline, and avoids the execution of arbitrary code when loading a pipeline.

```python
model.save("path/to/your/model")
model = edspdf.load("path/to/your/model")
```

To share the pipeline and turn it into a pip installable package, you can use the `package` method, which will use or create a pyproject.toml file, fill it accordingly, and create a wheel file. At the moment, we only support the poetry package manager.

```python
model.package(
name="path/to/your/package",
version="0.0.1",
root_dir="path/to/project/root", # optional, to retrieve an existing pyproject.toml file
# if you don't have a pyproject.toml, you can provide the metadata here instead
metadata=dict(
authors="Firstname Lastname <[email protected]>",
description="A short description of your package",
),
)
```

This will create a wheel file in the root_dir/dist folder, which you can share and install with pip
33 changes: 33 additions & 0 deletions edspdf/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
Dict,
Iterable,
List,
Mapping,
Optional,
Sequence,
Set,
Expand All @@ -25,6 +26,7 @@
from confit.errors import ConfitValidationError, patch_errors
from confit.utils.collections import join_path, split_path
from confit.utils.xjson import Reference
from typing_extensions import Literal

import edspdf

Expand Down Expand Up @@ -857,6 +859,37 @@ def __exit__(ctx_self, type, value, traceback):
self._disabled = disable
return context()

def package(
self,
name: Optional[str] = None,
root_dir: Union[str, Path] = ".",
artifacts_name: str = "artifacts",
check_dependencies: bool = False,
project_type: Optional[Literal["poetry", "setuptools"]] = None,
version: str = "0.1.0",
metadata: Optional[Dict[str, Any]] = {},
distributions: Optional[Sequence[Literal["wheel", "sdist"]]] = ["wheel"],
config_settings: Optional[Mapping[str, Union[str, Sequence[str]]]] = None,
isolation: bool = True,
skip_build_dependency_check: bool = False,
):
from .utils.package import package

return package(
pipeline=self,
name=name,
root_dir=root_dir,
artifacts_name=artifacts_name,
check_dependencies=check_dependencies,
project_type=project_type,
version=version,
metadata=metadata,
distributions=distributions,
config_settings=config_settings,
isolation=isolation,
skip_build_dependency_check=skip_build_dependency_check,
)


def load(
config: Union[Path, str, Config],
Expand Down
Loading

0 comments on commit 6888d1b

Please sign in to comment.