Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proper documentation #152

Open
DoneDeal0 opened this issue Sep 25, 2022 · 8 comments
Open

Add proper documentation #152

DoneDeal0 opened this issue Sep 25, 2022 · 8 comments

Comments

@DoneDeal0
Copy link

Hi,

I'm interested in parsing a pdf file to generate some stats based on its content. Your crate seems great, but I don't know how to use it. Could you add some basic exemples in the readme.md?

Thank you!

@s3bk
Copy link
Contributor

s3bk commented Sep 25, 2022

The examples for the pdf crate are in https://github.com/pdf-rs/pdf/tree/master/pdf/examples

If you want to read the rendered content and now the raw data, you may want to look at
https://github.com/pdf-rs/pdf_render/blob/master/render/examples/trace.rs

If you still can't figure out what to do, join https://type.zulipchat.com/#narrow/stream/209232-pdf

@s3bk
Copy link
Contributor

s3bk commented Sep 25, 2022

It is also highly recommend to look at the PDF specification.
This crate is basically just a Rust-Typed translation of the specification, and not a high-level abstraction.

@septatrix
Copy link

This crate is basically just a Rust-Typed translation of the specification, and not a high-level abstraction.

Do you know of any higher-level wrapper or is this maybe planned for this project itself? I am especially interested in extracting text including it's position

@s3bk
Copy link
Contributor

s3bk commented Oct 17, 2022

The example from pdf_render above extracts text and its position.

@septatrix
Copy link

That already looks very promising after a few tests. Some of the segments it does not detect as a single string and instead spits out separate chars but I think that can be fixed. For now I will probably stay with pdf.js but in the future I might transition to pdf-rs as I already use rust+WASM in some other places. If everything works well it would result in a pure rust alternative to tabula-java ;)

@s3bk
Copy link
Contributor

s3bk commented Oct 20, 2022

Yes, there are no attempts to combine separate draw calls.
I am actually working on table extraction right now, but it is not open source.

@septatrix
Copy link

Do you plan to eventually open-source it?

@s3bk
Copy link
Contributor

s3bk commented Oct 20, 2022

I can't and I don't think it makes sense. This is such an impossible problem that there can only be approximations to a solution and there will be a never ending stream of bugs.
I have (low) standards for my open source code, but ... this is not even close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants