Reading contents of a PDF #195

santiagomed · 2023-09-15T18:49:31Z

Is there an example on how to simply read the contents of a PDF successfully? I tried looking into read.rs but it seems to be outdated so I can't run it. Any way to read a PDF?

s3bk · 2023-09-17T01:42:46Z

What content do you want?
There is a lot in there.

Content stream? You can get that from the page object.
Text? See the pdf_render and pdf_text crates.

You can use the pdf crate in two version:

from crates.io, then use the example that match it: https://github.com/pdf-rs/pdf/tree/a6e2abc96b23b64aa1051966bb000aabf1275d9f
master with the latest fixes.

The pdf_render and pdf_text crates only work with the latest master.

vjau · 2023-12-08T15:27:37Z

What are the pdf_render and pdf_text crates ? crates.io doesn't know anything about that.

s3bk · 2023-12-08T16:21:07Z

They are not on crates.io because they do not meet my stability requirements for publishing there.
pdf_render … renders pdfs.
pdf_text extracts text.

alexis779 · 2024-05-07T19:56:00Z

pdf-extract crate exists, but depends on lopdf, not pdf. This video benchmarks it against poppler, a C library.

I'd be curious to see a C/Rust comparison but with poppler against pdf_text.

Gisbert12843 · 2024-07-19T19:54:04Z

Any chance for an easy example that just converts a PDF file to a String?

I need to search through valid utf8 text of a pdf and not panic if the pdf is formatted in any unexpected way..

Documentation found regarding this seems so scarce..

s3bk · 2024-07-19T20:34:25Z

If pdf_text does not do what you need, then no, there is no easy example.
This is not an easy problem. I have been working on this multiple years now and thrown many algorithms at it, and still it is not perfect.
pdf_render renders the pdf and allows you to capture the drawn strings. Thats as good as it gets.

Gisbert12843 · 2024-07-19T21:32:57Z

Ahh thank you for clarifying that!

Unrelated to this project i was working with lopdf on that task.
Everything worked up until a pdf file does not follow regular encoding aka is corrupted or chinese xd

Sadly lopdf just panics in every case and does not error instead.
Weird behaviour from my pov.

s3bk · 2024-07-19T21:44:39Z

Oh sure. If everything is in standard encoding, it is easy.
And yes, I tried to not panic in the pdf crate. pdf_render might panic, but that would be a bug and needs fixing.

pdf_render is used in production with "random" PDFs. And it's not great for a server to crash from a user supplied PDF.

acro5piano · 2024-07-31T09:02:22Z

@santiagomed
I wanted to do the same thing as you did. Thank you for flagging this issue!

@alexis779
Thank you, pdf-extract works like a charm!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading contents of a PDF #195

Reading contents of a PDF #195

santiagomed commented Sep 15, 2023

s3bk commented Sep 17, 2023

vjau commented Dec 8, 2023

s3bk commented Dec 8, 2023

alexis779 commented May 7, 2024

Gisbert12843 commented Jul 19, 2024 •

edited

Loading

s3bk commented Jul 19, 2024

Gisbert12843 commented Jul 19, 2024

s3bk commented Jul 19, 2024 •

edited

Loading

acro5piano commented Jul 31, 2024

Reading contents of a PDF #195

Reading contents of a PDF #195

Comments

santiagomed commented Sep 15, 2023

s3bk commented Sep 17, 2023

vjau commented Dec 8, 2023

s3bk commented Dec 8, 2023

alexis779 commented May 7, 2024

Gisbert12843 commented Jul 19, 2024 • edited Loading

s3bk commented Jul 19, 2024

Gisbert12843 commented Jul 19, 2024

s3bk commented Jul 19, 2024 • edited Loading

acro5piano commented Jul 31, 2024

Gisbert12843 commented Jul 19, 2024 •

edited

Loading

s3bk commented Jul 19, 2024 •

edited

Loading