Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading contents of a PDF #195

Open
santiagomed opened this issue Sep 15, 2023 · 9 comments
Open

Reading contents of a PDF #195

santiagomed opened this issue Sep 15, 2023 · 9 comments

Comments

@santiagomed
Copy link

Is there an example on how to simply read the contents of a PDF successfully? I tried looking into read.rs but it seems to be outdated so I can't run it. Any way to read a PDF?

@s3bk
Copy link
Contributor

s3bk commented Sep 17, 2023

What content do you want?
There is a lot in there.

  • Content stream? You can get that from the page object.
  • Text? See the pdf_render and pdf_text crates.

You can use the pdf crate in two version:

The pdf_render and pdf_text crates only work with the latest master.

@vjau
Copy link

vjau commented Dec 8, 2023

What are the pdf_render and pdf_text crates ? crates.io doesn't know anything about that.

@s3bk
Copy link
Contributor

s3bk commented Dec 8, 2023

They are not on crates.io because they do not meet my stability requirements for publishing there.
pdf_render … renders pdfs.
pdf_text extracts text.

@alexis779
Copy link

pdf-extract crate exists, but depends on lopdf, not pdf. This video benchmarks it against poppler, a C library.

I'd be curious to see a C/Rust comparison but with poppler against pdf_text.

@Gisbert12843
Copy link

Gisbert12843 commented Jul 19, 2024

Any chance for an easy example that just converts a PDF file to a String?

I need to search through valid utf8 text of a pdf and not panic if the pdf is formatted in any unexpected way..

Documentation found regarding this seems so scarce..

@s3bk
Copy link
Contributor

s3bk commented Jul 19, 2024

If pdf_text does not do what you need, then no, there is no easy example.
This is not an easy problem. I have been working on this multiple years now and thrown many algorithms at it, and still it is not perfect.
pdf_render renders the pdf and allows you to capture the drawn strings. Thats as good as it gets.

@Gisbert12843
Copy link

Ahh thank you for clarifying that!

Unrelated to this project i was working with lopdf on that task.
Everything worked up until a pdf file does not follow regular encoding aka is corrupted or chinese xd

Sadly lopdf just panics in every case and does not error instead.
Weird behaviour from my pov.

@s3bk
Copy link
Contributor

s3bk commented Jul 19, 2024

Oh sure. If everything is in standard encoding, it is easy.
And yes, I tried to not panic in the pdf crate. pdf_render might panic, but that would be a bug and needs fixing.

pdf_render is used in production with "random" PDFs. And it's not great for a server to crash from a user supplied PDF.

@acro5piano
Copy link

@santiagomed
I wanted to do the same thing as you did. Thank you for flagging this issue!

@alexis779
Thank you, pdf-extract works like a charm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants