Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfString.as_str() utf8 errors #117

Open
mike-kfed opened this issue Nov 24, 2021 · 26 comments
Open

PdfString.as_str() utf8 errors #117

mike-kfed opened this issue Nov 24, 2021 · 26 comments

Comments

@mike-kfed
Copy link
Contributor

Hi, I've been playing with the master branch of this amazing crate. My goal was to extract text from a PDF, that has text in german. Seems like the umlauts in there are in latin-1 / iso-8859-1 encoding. I don't fully know, if the PDF file specifies an encoding somewhere to pick the encoding up for later decoding. Afaik latin1 is the default encoding? My quick hack as of now is to do this

// loop over contents.operations...
                        if let TextDrawAdjusted::Text(text) = data {
                            match &text.as_str() {
                                Ok(s) => buf.push_str(s),
                                Err(_) => {
                                    let from_latin1: String =
                                        text.as_bytes().iter().map(|&c| c as char).collect();
                                    buf.push_str(&from_latin1);
                                }
                            }
                        }
//...

this works because the iso-8859-1 codepoints map with utf-8, therefore the as char trick works. Which is not a robust solution for your crate I assume? However the current str::from_utf8() use in PdfString.as_str() doesn't work either for latin1 text. Given some guidance in how you want this to be fixed I can provide a PR.

Sadly I cannot easily provide example PDFs as they are from my bank, I'll try to come up with a file that shows the same behaviour though.

@mike-kfed
Copy link
Contributor Author

I looked more into this, string encoding inside PDF is hilariously complicated, if nobody is working on this I can give it a go. My idea would be to extend PdfString with an encoding property to allow for correct decoding to utf-8 later.

Otherwise I have identified public PDFs with the same problem that I could share.

@s3bk
Copy link
Contributor

s3bk commented Nov 25, 2021

I would HIGHLY recommend that you peek at https://github.com/pdf-rs/pdf_render/blob/master/render/src/cache.rs#L188
and the returned TraceResults.

We use it in production, so it is mostly proven (although not in German). It should work in any language. If not it is a bug and needs fixing.

@neko-para
Copy link

I've checked this source's history, but I cannot find the time it had over 188 lines. Also, I've searched 'TraceResults' in the whole directory of pdf_render, but only find the definition. Then how can I get it?
image

@s3bk
Copy link
Contributor

s3bk commented Sep 26, 2022

tracer.finish() will give you a Vec<DrawItem>.
TraceResults was unused and I removed it now.

@neko-para
Copy link

tracer.finish() will give you a Vec<DrawItem>. TraceResults was unused and I removed it now.

I trace the text rendering into TextState::draw_text. If things work as I expected, the PdfString with 'Tj' operator is encoding in either 'latin-1' or 'utf16-be'? But when I try to parse a pdf with Chinese characters, it just provides unknown codes. I've tested and found that it isn't any common encoding (latin-1, utf16-be, or gbk (It is common in CN region and match each character with exact 2 bytes, just similar to utf16))

Here is an example pdf, which only contains 你好, utf16-be is 0x4F60 0x597D; but I parse it with codes below and get

C:\Users\liaoh\Documents\Projects\ScanCashRs>cargo run build
   Compiling scan-cash-rs v0.1.0 (C:\Users\liaoh\Documents\Projects\ScanCashRs)
    Finished dev [unoptimized + debuginfo] target(s) in 0.83s
     Running `target\debug\scan-cash-rs.exe build`
Tf : /F4, 13
Tm : 1, 0, 0, -1, 62, 57
Tj : "\4\x82\5\xf1"

and string 0x0482 0x05F1 just make no sense.

use pdf::file::File;

fn main() {
    let file = File::open("2.pdf").unwrap();
    let page = file.get_page(0).unwrap();
    if let Some(ref content) = &page.contents {
        for op in &content.operations {
            if op.operator[0..1].eq("T") {
                println!("{}", op);
            }
        }
    }
}

2.pdf

@s3bk
Copy link
Contributor

s3bk commented Sep 26, 2022

The text is as CID, basically some font-specific encoding.
the pdf_render crate deals with the translation into unicode.

@neko-para
Copy link

Got it!

                TextEncoding::CID(Some(ref to_unicode)) => {
                    match to_unicode.get(&cid) {
                        Some(&(gid, ref unicode)) => (cid, gid, Some(unicode.clone())),
                        None => (cid, None, None)
                    }
                },

This block deals with it, isn't it? So I have an array of cid, and I need to translate them into unicode via to_unicode?

@s3bk
Copy link
Contributor

s3bk commented Sep 26, 2022

With Chinese text, yes that is the way.
With English there are a few more...

@neko-para
Copy link

So I could just follow the previous process of draw_text function (check cid flag if need to split into two byte chunk or just expand byte) , and the string result is the combination of third part of glyphs ?

@s3bk
Copy link
Contributor

s3bk commented Sep 26, 2022

Yes.

@neko-para
Copy link

Then is it necessary to add a method to translate vec into String? It is quite hard to get familiar with this logic. I've try some other pdf-related crates, but they either return the vec or cause the utf8 error.

@s3bk
Copy link
Contributor

s3bk commented Sep 26, 2022

Have you considered using the tracer and then the produced TextSpans inside? They contain the build strings.

The third part of the tuple (from to_unicode) is a String, you can just concatenate them.

@neko-para
Copy link

neko-para commented Sep 26, 2022

image
I've follow trace.rs this script, but it seems that there's sth. wrong.
FIrstly, the example doesn't contain line 13, which p has type PageRc while render_page want Page. (Maybe when two Page type matches, reference to the former could automatically cast into reference to the latter?)

pdf = "*"
pdf_render = { git = "https://github.com/pdf-rs/pdf_render" }

Secondly, I use git to get pdf_render, and cause the problem above. It announces that Page in pdf crate doesn't fit Page used in pdf_render crate. I'm beginner of rust and never meet this problem before. Could you please share me some experience of dealing this kind of problem? I've tried to use pdf_render::pdf but of course it doesn't work. Maybe I should use pdf from git instead of repository?

@s3bk
Copy link
Contributor

s3bk commented Sep 26, 2022

Yes, you need pdf and pdf_render from git.

I want to release a new version of the pdf crate, but there is a blocker remaining.

@neko-para
Copy link

Thanks! I've finally parsed my pdf. Though the program panic for missing STANDARD_FONTS env, but I've figured out that it just want that path to fonts.json which contains the path of avaiable fonts.

@neko-para
Copy link

I'm sorry, but after debugging, I've notice that RenderState::text function would just drop the text items that misses information for rendering (e.g. cannot find corresponding font to calculate size rect). But it's hard to meet all font requirements, and it seems that there isn't a fallback mechanism. So is it possible to grab those items(for text) without patching the crate? I notice that Render won't call tracer before it believe item is valid, so extend Tracer seems not work.

@neko-para
Copy link

image

@neko-para
Copy link

It's wierd. It seems that pdf_render loaded the required fonts, but cannot get any glyph from them. The program worked for some of pdf previously. It can load "AVUUDS+BWSongTi" "KHQECP+BWKaiTi", but cannot load "BWSimKai", "BWSimSun". However, I don't provide any of them in fonts.json, and I cannot find these fonts with BW prefix in my machine (under C:\Windows\Fonts). But all of them could extract toUnicode map.

@s3bk
Copy link
Contributor

s3bk commented Sep 27, 2022

Ideally the fonts are embedded in the PDF. Those are named prefix+name.
If not you will have to provide it with the fonts via the fonts.json for rendering.

to_unicode is part of the font descriptor in the pdf, so that works even when the font itself is missing.

@neko-para
Copy link

Yes, but I cannot get it (without patching source, as it is in RenderState.text_state, a private struct). I've tried to do font parsing myself, but finally I know that the huge funciton FontEntry::build is necessary, which is too hard for me to rewrite it 🤔

@s3bk
Copy link
Contributor

s3bk commented Sep 27, 2022

You should not need to rewrite anything.
This example gives you everything:
pdf_render/render/examples

Just match against the text variants

@neko-para
Copy link

You should not need to rewrite anything. This example gives you everything: pdf_render/render/examples

Just match against the text variants

I've tried this. But as I mentioned before, pdf_render cannot find the required font, so it cannot calculate some entry of the TextSpan, and just drop it.

@neko-para
Copy link

Debugger shows that it doesn't enter the if below

        inner(&mut self.backend, &mut self.text_state, &mut self.graphics_state, &mut span);

        if let (Some(bbox), Some(e)) = (span.bbox.rect(), self.text_state.font_entry.as_ref()) {
            let transform = self.graphics_state.transform * tm * Transform2F::from_scale(Vector2F::new(1.0, -1.0));
            let p1 = origin;
            let p2 = (tm * Transform2F::from_translation(Vector2F::new(span.width, self.text_state.font_size))).translation();

            debug!("text {}", span.text);
            self.backend.add_text(TextSpan {
                rect: self.graphics_state.transform * RectF::from_points(p1.min(p2), p1.max(p2)),
                width: span.width,
                bbox,
                text: span.text,
                chars: span.chars,
                font: e.clone(),
                font_size: self.text_state.font_size,
                color: self.graphics_state.fill_color,
                alpha: self.graphics_state.fill_color_alpha,
                transform,
            });
        }

@s3bk
Copy link
Contributor

s3bk commented Sep 27, 2022

Ah, i understand now.
I will add support for extracting text when the font is not embedded.

@neko-para
Copy link

neko-para commented Sep 27, 2022

Thanks! The core problem is that when fonts are not available, item's span.bbox is None that led them being filtered.

@s3bk
Copy link
Contributor

s3bk commented Sep 27, 2022

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants