Document Layout Analysis repos for development with PdfPig.
From wikipedia: Document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.
In this repos, we will not considere scanned documents, but classic pdf documents and leverage all available information (e.g. letters bounding boxes, fonts).
Research papers on page segmentation, table extraction and chart and diagram extraction are available in the Resources section.
- Page segmentation: Constrained text-line detection
- Table extraction
- Diagram extraction
A Pdf page to image converter is available to help in the research proces. It relies on the mupdf library, available in the sumatra pdf reader.