- Support pydantic v2
- Default to fp16 when inferring with gpu
- Support
inputs
parameter inTrainablePipe.postprocess(...)
method (as in edsnlp) - We now check that the user isn't trying to write a single file in a split fashion (when
write_in_worker is True
ornum_rows_per_file is not None
) and raise an error if they do
- Batches full of empty content boxes no longer crash the
huggingface-embedding
component - Ensure models are always loaded in non training mode
- Improved performance of
edspdf.data
methods over a filesystem (fs
parameter)
- It is now possible to recursively retrieve pdf files in a directory using
edspdf.data.read_files
- New unified
edspdf.data
api (pdf files, pandas, parquet) and LazyCollection object to efficiently read / write data from / to different formats & sources. This API is has been heavily inspired by theedsnlp.data
API. - New unified processing API to select the execution backend via
data.set_processing(...)
to replace the oldaccelerators
API (which is now deprecated, but still available). huggingface-embedding
now supports quantization and otherAutoModel.from_pretrained
kwargs- It is now possible to add convert a label to multiple labels in the
simple-aggregator
component :
# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
"text" : [ "title", "body", "table" ],
"title": "title",
}
huggingface-embedding
now resize bbox features for large PDFs, instead of making the model crashhuggingface-embedding
andsub-box-cnn-pooler
now handle empty PDFs correctly
- Fix typing to allow passing an accelerator dict to
Pipeline.pipe(...)
- Removed multiprocessing accelerator debug output
- Fixed absolute links in github-pages docs (e.g. image assets)
- Added auto-links to components in the docs (by comparing span contents with entry points)
- Add multi-modal transformers (
huggingface-embedding
) with windowing options - Add
render_page
option topdfminer
extractor, for multi-modal PDF features - Add inference utilities (
accelerators
), with simple mono process support and multi gpu / cpu support - Packaging utils (
pipeline.package(...)
) to make a pip installable package from a pipeline
- Updated API to follow EDS-NLP's refactoring
- Updated
confit
to 0.4.2 (better errors) andfoldedtensor
to 0.3.0 (better multiprocess support) - Removed
pipeline.score
. You should usepipeline.pipe
, a custom scorer andpipeline.select_pipes
instead. - Better test coverage
- Use
hatch
instead ofsetuptools
to build the package / docs and run the tests
- Fixed
attrs
dependency only being installed in dev mode
Major refactoring of the library:
- new pipeline system whose API is inspired by spaCy
- first-class support for pytorch
- hybrid model inference and training (rules + deep learning)
- moved from pandas DataFrame to attrs dataclasses (
PDFDoc
,Page
,Box
, ...) for representing PDF documents - new configuration system based on [config][https://github.com/aphp/config], with support for instantiation of complex deep learning models, off-the-shelf CLI, ...
- new extractors: pymupdf and poppler (separate packages for licensing reasons)
- many deep learning layers (box-transformer, 2d attention with relative position information, ...)
- trainable deep learning classifier
- training recipes for deep learning models
- Allow corrupted PDF to not raise an error by default (they are treated as empty PDFs)
- Fix classification and aggregation for empty PDFs
Cast bytes-like extractor inputs as bytes
Performance and cuda related fixes.
Many, many changes:
- added torch as the main deep learning framework instead of spaCy and thinc 🎉
- added poppler and mupdf as alternatives to pdfminer
- new pipeline / config / registry system to facilitate consistency between training and inference
- standardization of the exchange format between components with dataclass models (attrs more specifically) instead of pandas dataframes
- Add label mapping parameter to aggregators (to merge different types of blocks such as
title
andbody
) - Improved line aggregation formula
- Fix aggregation for empty documents
- Drop the
pdf2image
dependency, replacing it withpypdfium2
(easier installation)
- Major refactoring of the library. Moved from concepts (
aggregation
) to plural names (aggregators
).
- Multi page boxes alignment
package-resource.v1
in the misc registry
- Remove
importlib.metadata
dependency, which led to issues with Python 3.7
- Python 3.7 support, by relaxing dependency constraints
- Support for package-resource pipeline for
sklearn-pipeline.v1
compare_results
in visualisation
- Rescale transform now keeps origin on top-left corner
- Styles management within the extractor
styled.v1
aggregator, to handle stylesrescale.v1
transform, to go back to the original height and width
- Styles and text extraction is handled by the extractor directly
- The PDFMiner
line
object is not carried around any more
- Outdated
params
entry in the EDS-PDF registry.
- Fixed
merge_lines
bug when lines were empty - Modified the demo consequently
- The extractor always returns a pandas DataFrame, be it empty. It enhances robustness and stability.
aggregation
submodule to handle the specifics of aggregating text blocs- Base classes for better-defined modules
- Uniformise the columns to
labels
- Add arbitrary contextual information
typer
legacy dependencymodels
submodule, which handled the configurations for Spark distribution (deferred to another package)- specific
orbis
context, which was APHP-specific
Inception ! 🎉
- spaCy-like configuration system
- Available classifiers :
dummy.v1
, that classifies everything tobody
mask.v1
, for simple rule-based classificationsklearn.v1
, that uses a Scikit-Learn pipelinerandom.v1
, to better sow chaos
- Merge different blocs together for easier visualisation
- Streamlit demo with visualisation