Skip to content

Commit

Permalink
feat: add pipeline packaging utils
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Sep 2, 2023
1 parent 95ba47c commit 3d04436
Show file tree
Hide file tree
Showing 10 changed files with 629 additions and 37 deletions.
21 changes: 8 additions & 13 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,14 @@ jobs:
Linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- name: Set PY variable
run: echo "PY=$(python -VV | sha256sum | cut -d' ' -f1)" >> $GITHUB_ENV
- uses: actions/cache@v2
- uses: actions/checkout@v3
with:
path: ~/.cache/pre-commit
key: pre-commit|${{ env.PY }}|${{ hashFiles('.pre-commit-config.yaml') }}
- name: Install pre-commit
run: |
pip install pre-commit
pre-commit install
- name: Run pre-commit
run: SKIP=no-commit-to-branch pre-commit run --all-files
# requites to grab the history of the PR
fetch-depth: 0
- uses: actions/setup-python@v3
- uses: pre-commit/[email protected]
with:
extra_args: --color=always --from-ref ${{ github.event.pull_request.base.sha }} --to-ref ${{ github.event.pull_request.head.sha }}

Pytest:
runs-on: ubuntu-latest
Expand All @@ -45,6 +39,7 @@ jobs:
- name: Install dependencies
run: |
pip install -e '.[dev]'
pip install poetry build
- name: Test with Pytest on Python ${{ matrix.python-version }}
run: python -m pytest --cov edspdf --cov-report xml
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ repos:
# ruff
- repo: https://github.com/charliermarsh/ruff-pre-commit
# Ruff version.
rev: 'v0.0.245'
rev: 'v0.0.287'
hooks:
- id: ruff
args: ['--config', 'pyproject.toml']
Expand Down
1 change: 1 addition & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

- Add multi-modal transformers (`huggingface-embedding`) with windowing options
- Add `render_page` option to `pdfminer` extractor, for multi-modal PDF features
- Packaging utils (`pipeline.package(...)`) to make a pip installable package from a pipeline

### Changed

Expand Down
30 changes: 29 additions & 1 deletion docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Processing PDFs usually involves many steps such as extracting lines, running OC
can use any technology in static components, we do not provide tools to train
components built with other deep learning frameworks.

## Creating a pipeline

A pipe is a processing block (like a function) that applies a transformation on its input and returns a modified object.

At the moment, four types of pipes are implemented in the library:
Expand Down Expand Up @@ -57,7 +59,33 @@ model(pdf_bytes)
model.pipe([pdf_bytes, ...])
```

## Hybrid models
### Hybrid models

EDS-PDF was designed to facilitate the training and inference of hybrid models that
arbitrarily chain static components or trained deep learning components. Static components are callable objects that take a PDFDoc object as input, perform arbitrary transformations over the input, and return the modified object. [Trainable pipes][edspdf.trainable_pipe.TrainablePipe], on the other hand, allow for deep learning operations to be performed on the [PDFDoc][edspdf.structures.PDFDoc] object and must be trained to be used.

## Saving and loading a pipeline

Pipelines can be saved and loaded using the `save` and `load` methods. The saved pipeline is not a pickled objet but a folder containing the config file, the weights and extra resources for each pipeline. This allows for easy inspection and modification of the pipeline, and avoids the execution of arbitrary code when loading a pipeline.

```python
model.save("path/to/your/model")
model = edspdf.load("path/to/your/model")
```

To share the pipeline and turn it into a pip installable package, you can use the `package` method, which will use or create a pyproject.toml file, fill it accordingly, and create a wheel file. At the moment, we only support the poetry package manager.

```python
model.package(
name="path/to/your/package",
version="0.0.1",
root_dir="path/to/project/root", # optional, to retrieve an existing pyproject.toml file
# if you don't have a pyproject.toml, you can provide the metadata here instead
metadata=dict(
authors="Firstname Lastname <[email protected]>",
description="A short description of your package",
),
)
```

This will create a wheel file in the root_dir/dist folder, which you can share and install with pip
33 changes: 33 additions & 0 deletions edspdf/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
Dict,
Iterable,
List,
Mapping,
Optional,
Sequence,
Set,
Expand All @@ -28,6 +29,7 @@
from confit.utils.collections import join_path, split_path
from confit.utils.xjson import Reference
from tqdm import tqdm
from typing_extensions import Literal

import edspdf

Expand Down Expand Up @@ -944,6 +946,37 @@ def select_pipes(
yield self
self._disabled = disabled_before

def package(
self,
name: str,
root_dir: Union[str, Path] = ".",
artifacts_name: str = "artifacts",
check_dependencies: bool = False,
project_type: Optional[Literal["poetry", "setuptools"]] = None,
version: str = "0.1.0",
metadata: Optional[Dict[str, Any]] = {},
distributions: Optional[Sequence[Literal["wheel", "sdist"]]] = ["wheel"],
config_settings: Optional[Mapping[str, Union[str, Sequence[str]]]] = None,
isolation: bool = True,
skip_build_dependency_check: bool = False,
):
from .utils.package import package

return package(
pipeline=self,
name=name,
root_dir=root_dir,
artifacts_name=artifacts_name,
check_dependencies=check_dependencies,
project_type=project_type,
version=version,
metadata=metadata,
distributions=distributions,
config_settings=config_settings,
isolation=isolation,
skip_build_dependency_check=skip_build_dependency_check,
)


def load(
config: Union[Path, str, Config],
Expand Down
Loading

0 comments on commit 3d04436

Please sign in to comment.