You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I work on corpora research and for the most part pdf files (whatever "document" means in the very name of the file format) need "cleansing" using unpaper or other utilities in order to parse the data out of them. Another problem with PDF files is that they could be from fully image-based, to html (containing javascript!), to plain text.
Most people see "documents" as a visual thing. Corpora research folks can only analyze actual texts. Take for example, the relatively complex bilingual edition of this very important text in public domain:
// __ Tractatus de signis : the semiotic of John Poinsot by John of St. Thomas, 1589-1644; Deely, John N; Powell, Ralph Austin
Producer: Recoded by LuraDocument PDF v2.16
CreationDate: Wed Mar 21 00:12:44 2007 CST
ModDate: Wed Mar 21 00:14:02 2007 CST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 628
Encrypted: no
Page size: 527 x 802 pts
Page rot: 0
File size: 80891203 bytes
Optimized: no
PDF version: 1.5
$
How can I use OpenPDF to read that file, linearize it and extract all text, constitutive (pictures, ...) and metadata data (links, styles, ...) as sort of a DAG to then analyze that data structure describing the text?
If not with OpenPDF which utility would you suggest?
lbrtchx
The text was updated successfully, but these errors were encountered:
I work on corpora research and for the most part pdf files (whatever "document" means in the very name of the file format) need "cleansing" using unpaper or other utilities in order to parse the data out of them. Another problem with PDF files is that they could be from fully image-based, to html (containing javascript!), to plain text.
Most people see "documents" as a visual thing. Corpora research folks can only analyze actual texts. Take for example, the relatively complex bilingual edition of this very important text in public domain:
// __ Tractatus de signis : the semiotic of John Poinsot by John of St. Thomas, 1589-1644; Deely, John N; Powell, Ralph Austin
https://archive.org/details/tractatusdesigni00johnrich/
https://archive.org/download/tractatusdesigni00johnrich/tractatusdesigni00johnrich.pdf
~
$ date; ifl="tractatusdesigni00johnrich.pdf"; ls -l "${ifl}"; file --brief "${ifl}"; sha256sum -
-binary "${ifl}"; pdfinfo "${ifl}"
Mon 18 Nov 2024 03:33:51 AM CST
-rwxrwxrwx 1 user user 80891203 Nov 16 17:20 tractatusdesigni00johnrich.pdf
PDF document, version 1.5
5a55fba506e750a602057ba99ae202c26b24503b00d58dd53d6f50ea0e6722b8 *tractatusdesigni00johnrich.pdf
Producer: Recoded by LuraDocument PDF v2.16
CreationDate: Wed Mar 21 00:12:44 2007 CST
ModDate: Wed Mar 21 00:14:02 2007 CST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 628
Encrypted: no
Page size: 527 x 802 pts
Page rot: 0
File size: 80891203 bytes
Optimized: no
PDF version: 1.5
$
How can I use OpenPDF to read that file, linearize it and extract all text, constitutive (pictures, ...) and metadata data (links, styles, ...) as sort of a DAG to then analyze that data structure describing the text?
If not with OpenPDF which utility would you suggest?
lbrtchx
The text was updated successfully, but these errors were encountered: