EDS-PDF

EDS-PDF provides a modular framework to extract text information from PDF documents.

You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models:

📄 Extractors to parse PDFs (based on pdfminer, mupdf or poppler)
🎯 Classifiers to perform text box classification, in order to segment PDFs
🧩 Aggregators to produce an aggregated output from the detected text boxes
🧠 Trainable layers to incorporate machine learning in your pipeline (e.g., embedding building blocks or a trainable classifier)

Visit the 📖 documentation for more information!

Getting started

Installation

Install the library with pip:

pip install edspdf

Extracting text

Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.

Create a configuration file:

`config.cfg`

[pipeline]
pipeline = ["extractor", "classifier", "aggregator"]

[components.extractor]
@factory = "pdfminer-extractor"

[components.classifier]
@factory = "mask-classifier"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1

[components.aggregator]
@factory = "simple-aggregator"

and load it from Python:

import edspdf
from pathlib import Path

model = edspdf.load("config.cfg")  # (1)

Or create a pipeline directly from Python:

from edspdf import Pipeline

model = Pipeline()
model.add_pipe("pdfminer-extractor")
model.add_pipe(
    "mask-classifier",
    config=dict(
        x0=0.2,
        x1=0.9,
        y0=0.3,
        y1=0.6,
        threshold=0.1,
    ),
)
model.add_pipe("simple-aggregator")

This pipeline can then be applied (for instance with this PDF):

# Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
pdf = model(pdf)

body = pdf.aggregated_texts["body"]

text, style = body.text, body.properties

See the rule-based recipe for a step-by-step explanation of what is happening.

Citation

If you use EDS-PDF, please cite us as below.

@software{edspdf,
  author  = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
  doi     = {10.5281/zenodo.6902977},
  license = {BSD-3-Clause},
  title   = {{EDS-PDF: Smart text extraction from PDF documents}},
  url     = {https://github.com/aphp/edspdf}
}

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
.github		.github
demo		demo
docs		docs
edspdf		edspdf
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
changelog.md		changelog.md
contributing.md		contributing.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
roadmap.md		roadmap.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDS-PDF

Getting started

Installation

Extracting text

`config.cfg`

Citation

Acknowledgement

About

Releases 11

Packages

Contributors 4

Languages

License

aphp/edspdf

Folders and files

Latest commit

History

Repository files navigation

EDS-PDF

Getting started

Installation

Extracting text

config.cfg

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 11

Packages 0

Contributors 4

Languages

`config.cfg`

Packages