How useful is OpenPDF while "curating" pdf files? #1234

Albretch · 2024-11-18T09:58:28Z

I work on corpora research and for the most part pdf files (whatever "document" means in the very name of the file format) need "cleansing" using unpaper or other utilities in order to parse the data out of them. Another problem with PDF files is that they could be from fully image-based, to html (containing javascript!), to plain text.

Most people see "documents" as a visual thing. Corpora research folks can only analyze actual texts. Take for example, the relatively complex bilingual edition of this very important text in public domain:

// __ Tractatus de signis : the semiotic of John Poinsot by John of St. Thomas, 1589-1644; Deely, John N; Powell, Ralph Austin

https://archive.org/details/tractatusdesigni00johnrich/

https://archive.org/download/tractatusdesigni00johnrich/tractatusdesigni00johnrich.pdf
~
$ date; ifl="tractatusdesigni00johnrich.pdf"; ls -l "${ifl}"; file --brief "${ifl}"; sha256sum -
-binary "${ifl}"; pdfinfo "${ifl}"

Mon 18 Nov 2024 03:33:51 AM CST

-rwxrwxrwx 1 user user 80891203 Nov 16 17:20 tractatusdesigni00johnrich.pdf

PDF document, version 1.5

5a55fba506e750a602057ba99ae202c26b24503b00d58dd53d6f50ea0e6722b8 *tractatusdesigni00johnrich.pdf

Producer: Recoded by LuraDocument PDF v2.16
CreationDate: Wed Mar 21 00:12:44 2007 CST
ModDate: Wed Mar 21 00:14:02 2007 CST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 628
Encrypted: no
Page size: 527 x 802 pts
Page rot: 0
File size: 80891203 bytes
Optimized: no
PDF version: 1.5
$

How can I use OpenPDF to read that file, linearize it and extract all text, constitutive (pictures, ...) and metadata data (links, styles, ...) as sort of a DAG to then analyze that data structure describing the text?

If not with OpenPDF which utility would you suggest?

lbrtchx

Albretch added the enhancement label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How useful is OpenPDF while "curating" pdf files? #1234

How useful is OpenPDF while "curating" pdf files? #1234

Albretch commented Nov 18, 2024

How useful is OpenPDF while "curating" pdf files? #1234

How useful is OpenPDF while "curating" pdf files? #1234

Comments

Albretch commented Nov 18, 2024