Document Layout Analysis

Document Layout Analysis repos for development with PdfPig.

Definition

From wikipedia: Document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.

In this repos, we will not considere scanned documents, but classic pdf documents and leverage all available information (e.g. letters bounding boxes, fonts).

Resources

Research papers on page segmentation, table extraction and chart and diagram extraction are available in the Resources section.

Progress

Done

Recursive XY Cut

Docstrum for bounding boxes

To do

Page segmentation: Constrained text-line detection
Table extraction
Diagram extraction

Pdf page to image converter

A Pdf page to image converter is available to help in the research proces. It relies on the mupdf library, available in the sumatra pdf reader.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
DocumentLayoutAnalysis		DocumentLayoutAnalysis
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Layout Analysis

Definition

Resources

Progress

Done

To do

Pdf page to image converter

About

Releases

Packages

Languages

kapitsa2811/DocumentLayoutAnalysis

Folders and files

Latest commit

History

Repository files navigation

Document Layout Analysis

Definition

Resources

Progress

Done

To do

Pdf page to image converter

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages