Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How useful is OpenPDF while "curating" pdf files? #1234

Open
Albretch opened this issue Nov 18, 2024 · 0 comments
Open

How useful is OpenPDF while "curating" pdf files? #1234

Albretch opened this issue Nov 18, 2024 · 0 comments

Comments

@Albretch
Copy link

I work on corpora research and for the most part pdf files (whatever "document" means in the very name of the file format) need "cleansing" using unpaper or other utilities in order to parse the data out of them. Another problem with PDF files is that they could be from fully image-based, to html (containing javascript!), to plain text.

Most people see "documents" as a visual thing. Corpora research folks can only analyze actual texts. Take for example, the relatively complex bilingual edition of this very important text in public domain:

// __ Tractatus de signis : the semiotic of John Poinsot by John of St. Thomas, 1589-1644; Deely, John N; Powell, Ralph Austin

https://archive.org/details/tractatusdesigni00johnrich/

https://archive.org/download/tractatusdesigni00johnrich/tractatusdesigni00johnrich.pdf
~
$ date; ifl="tractatusdesigni00johnrich.pdf"; ls -l "${ifl}"; file --brief "${ifl}"; sha256sum -
-binary "${ifl}"; pdfinfo "${ifl}"

Mon 18 Nov 2024 03:33:51 AM CST

-rwxrwxrwx 1 user user 80891203 Nov 16 17:20 tractatusdesigni00johnrich.pdf

PDF document, version 1.5

5a55fba506e750a602057ba99ae202c26b24503b00d58dd53d6f50ea0e6722b8 *tractatusdesigni00johnrich.pdf

Producer: Recoded by LuraDocument PDF v2.16
CreationDate: Wed Mar 21 00:12:44 2007 CST
ModDate: Wed Mar 21 00:14:02 2007 CST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 628
Encrypted: no
Page size: 527 x 802 pts
Page rot: 0
File size: 80891203 bytes
Optimized: no
PDF version: 1.5
$

How can I use OpenPDF to read that file, linearize it and extract all text, constitutive (pictures, ...) and metadata data (links, styles, ...) as sort of a DAG to then analyze that data structure describing the text?

If not with OpenPDF which utility would you suggest?

lbrtchx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant