Add proper documentation #152

DoneDeal0 · 2022-09-25T15:23:32Z

Hi,

I'm interested in parsing a pdf file to generate some stats based on its content. Your crate seems great, but I don't know how to use it. Could you add some basic exemples in the readme.md?

Thank you!

s3bk · 2022-09-25T15:26:54Z

The examples for the pdf crate are in https://github.com/pdf-rs/pdf/tree/master/pdf/examples

If you want to read the rendered content and now the raw data, you may want to look at
https://github.com/pdf-rs/pdf_render/blob/master/render/examples/trace.rs

If you still can't figure out what to do, join https://type.zulipchat.com/#narrow/stream/209232-pdf

s3bk · 2022-09-25T15:30:16Z

It is also highly recommend to look at the PDF specification.
This crate is basically just a Rust-Typed translation of the specification, and not a high-level abstraction.

septatrix · 2022-10-17T08:39:08Z

This crate is basically just a Rust-Typed translation of the specification, and not a high-level abstraction.

Do you know of any higher-level wrapper or is this maybe planned for this project itself? I am especially interested in extracting text including it's position

s3bk · 2022-10-17T08:41:37Z

The example from pdf_render above extracts text and its position.

septatrix · 2022-10-20T15:40:58Z

That already looks very promising after a few tests. Some of the segments it does not detect as a single string and instead spits out separate chars but I think that can be fixed. For now I will probably stay with pdf.js but in the future I might transition to pdf-rs as I already use rust+WASM in some other places. If everything works well it would result in a pure rust alternative to tabula-java ;)

s3bk · 2022-10-20T17:56:39Z

Yes, there are no attempts to combine separate draw calls.
I am actually working on table extraction right now, but it is not open source.

septatrix · 2022-10-20T20:55:05Z

Do you plan to eventually open-source it?

s3bk · 2022-10-20T20:59:50Z

I can't and I don't think it makes sense. This is such an impossible problem that there can only be approximations to a solution and there will be a never ending stream of bugs.
I have (low) standards for my open source code, but ... this is not even close.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proper documentation #152

Add proper documentation #152

DoneDeal0 commented Sep 25, 2022

s3bk commented Sep 25, 2022

s3bk commented Sep 25, 2022

septatrix commented Oct 17, 2022

s3bk commented Oct 17, 2022

septatrix commented Oct 20, 2022

s3bk commented Oct 20, 2022

septatrix commented Oct 20, 2022

s3bk commented Oct 20, 2022

Add proper documentation #152

Add proper documentation #152

Comments

DoneDeal0 commented Sep 25, 2022

s3bk commented Sep 25, 2022

s3bk commented Sep 25, 2022

septatrix commented Oct 17, 2022

s3bk commented Oct 17, 2022

septatrix commented Oct 20, 2022

s3bk commented Oct 20, 2022

septatrix commented Oct 20, 2022

s3bk commented Oct 20, 2022