Skip to content

PDF parsers gateway. Access different parsers using a unified model.

Notifications You must be signed in to change notification settings

OneOffTech/parxy

Repository files navigation

CI Build Docker Image

Parxy

Parxy is a gateway service that provides a unified approach to accessing PDF parsing services and libraries. It is available as a library and as an http-based application.

Note

Parxy is under active development.

Getting started

The easiest way to get started with Parxy is to use the Docker image provided.

docker pull ghcr.io/OneOffTech/parxy:main

A sample docker-compose.yaml file is available within the repository.

Please refer to Releases and Packages for the available tags.

Usage

Parxy expose a web-based application programming interface (API). The available API receive a PDF file via a URL and return the extracted text as a JSON response.

The exposed service is unauthenticated therefore consider exposing it only within a trusted network. If you plan to make it available publicly consider adding a reverse proxy with authentication in front.

Text extraction endpoint

The service expose only one endpoint /extract-text that accepts a POST request with the following input as a json body:

  • url: the URL of the PDF file to process.
  • mime_type: the mime type of the file (it is expected to be application/pdf).
  • driver: two drivers are currently implemented pymupdf and pdfact. It defines the extraction backend to use.

Warning

The processing is performed synchronously

The response is a JSON structure following the Parse Document Model.

In particular, the structure is as follows:

  • category: A string specifying the node category, which is doc
  • content: A list of page nodes representing the pages within the document.

Each page node contains the following information:

  • category: A string specifying the node category, which is page.
  • attributes: A list containing attributes of the page. Currently, it includes only page, the number of the node page.
  • content: A list of chunk each representing a segment of text extracted from the page.

In particular, each content contains the following information:

  • role: The role of the chunk in the document (e.g., heading, body, etc.)
  • text: The text extracted from the chunk.
  • marks: A list of marks that characterize the text extracted from the chunk.
  • attributes: A list containing attributes of the chunk, currently including:
    • A list of bounding_box attributes that contain the text. Each bounding box is identified by 4 coordinated: min_x,min_y, max_x, max_y and page, which is the page number where the bounding box is located.

The marks of the chunks contains:

  • category: the type of the mark, which can be: bold, italic, textStyle, link

If the mark type is textStyle, it includes additional attributes:

  • font: An object representing the font of the text chunk. Each font is represented by name, id, and size. Available only using pdfact driver.
  • color: Which is the color of the text chunk. Each color is represented by r, g, b and id. Available only using pdfact driver.

if the mark category is link, it provides the url of the link.

Error handling

The service can return the following errors

code message description
422 No url found in request In case the url field in the request is missing
422 No mime_type found in request In case the mime_type field in the request is missing
422 Unsupported file type In case the file is not a PDF
500 Error while saving file In case it was not possible to download the file from the specified URL
500 Error while parsing file In case it was not possible to open the file after download

The body of the response can contain a JSON with the following fields:

  • code the error code
  • message the error description
  • type the type of the error
{
  "code": 500,
  "message": "Error while parsing file",
  "type": "Internal Server Error",
}

Development

Parxy is built using FastAPI and Python 3.9.

Given the selected stack the development requires:

Install all the required dependencies:

pip install -r requirements.txt

Run the local development application using:

fastapi dev text_extractor_api/main.py

Testing

to be documented

Contributing

Thank you for considering contributing to Parxy! The contribution guide can be found in the CONTRIBUTING.md file.

Supporters

The project is provided and supported by OneOff-Tech (UG).

Security Vulnerabilities

If you discover a security vulnerability within PDF Text Extract, please send an e-mail to OneOff-Tech team via [email protected]. All security vulnerabilities will be promptly addressed.