Skip to content

Document Layout Analysis repos for development with PdfPig.

Notifications You must be signed in to change notification settings

kapitsa2811/DocumentLayoutAnalysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Document Layout Analysis

Document Layout Analysis repos for development with PdfPig.

Definition

From wikipedia: Document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.

In this repos, we will not considere scanned documents, but classic pdf documents and leverage all available information (e.g. letters bounding boxes, fonts).

Research papers on page segmentation, table extraction and chart and diagram extraction are available in the Resources section.

Progress

Done

alt text

To do

  • Page segmentation: Constrained text-line detection
  • Table extraction
  • Diagram extraction

Pdf page to image converter

A Pdf page to image converter is available to help in the research proces. It relies on the mupdf library, available in the sumatra pdf reader.

About

Document Layout Analysis repos for development with PdfPig.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 100.0%