PdfString.as_str() utf8 errors #117

mike-kfed · 2021-11-24T17:19:53Z

Hi, I've been playing with the master branch of this amazing crate. My goal was to extract text from a PDF, that has text in german. Seems like the umlauts in there are in latin-1 / iso-8859-1 encoding. I don't fully know, if the PDF file specifies an encoding somewhere to pick the encoding up for later decoding. Afaik latin1 is the default encoding? My quick hack as of now is to do this

// loop over contents.operations...
                        if let TextDrawAdjusted::Text(text) = data {
                            match &text.as_str() {
                                Ok(s) => buf.push_str(s),
                                Err(_) => {
                                    let from_latin1: String =
                                        text.as_bytes().iter().map(|&c| c as char).collect();
                                    buf.push_str(&from_latin1);
                                }
                            }
                        }
//...

this works because the iso-8859-1 codepoints map with utf-8, therefore the as char trick works. Which is not a robust solution for your crate I assume? However the current str::from_utf8() use in PdfString.as_str() doesn't work either for latin1 text. Given some guidance in how you want this to be fixed I can provide a PR.

Sadly I cannot easily provide example PDFs as they are from my bank, I'll try to come up with a file that shows the same behaviour though.

The text was updated successfully, but these errors were encountered:

mike-kfed · 2021-11-25T14:28:32Z

I looked more into this, string encoding inside PDF is hilariously complicated, if nobody is working on this I can give it a go. My idea would be to extend PdfString with an encoding property to allow for correct decoding to utf-8 later.

Otherwise I have identified public PDFs with the same problem that I could share.

s3bk · 2021-11-25T16:06:34Z

I would HIGHLY recommend that you peek at https://github.com/pdf-rs/pdf_render/blob/master/render/src/cache.rs#L188
and the returned TraceResults.

We use it in production, so it is mostly proven (although not in German). It should work in any language. If not it is a bug and needs fixing.

neko-para · 2022-09-26T03:48:47Z

I've checked this source's history, but I cannot find the time it had over 188 lines. Also, I've searched 'TraceResults' in the whole directory of pdf_render, but only find the definition. Then how can I get it?

s3bk · 2022-09-26T07:32:40Z

tracer.finish() will give you a Vec<DrawItem>.
TraceResults was unused and I removed it now.

neko-para · 2022-09-26T08:43:38Z

tracer.finish() will give you a Vec<DrawItem>. TraceResults was unused and I removed it now.

I trace the text rendering into TextState::draw_text. If things work as I expected, the PdfString with 'Tj' operator is encoding in either 'latin-1' or 'utf16-be'? But when I try to parse a pdf with Chinese characters, it just provides unknown codes. I've tested and found that it isn't any common encoding (latin-1, utf16-be, or gbk (It is common in CN region and match each character with exact 2 bytes, just similar to utf16))

Here is an example pdf, which only contains 你好, utf16-be is 0x4F60 0x597D; but I parse it with codes below and get

C:\Users\liaoh\Documents\Projects\ScanCashRs>cargo run build
   Compiling scan-cash-rs v0.1.0 (C:\Users\liaoh\Documents\Projects\ScanCashRs)
    Finished dev [unoptimized + debuginfo] target(s) in 0.83s
     Running `target\debug\scan-cash-rs.exe build`
Tf : /F4, 13
Tm : 1, 0, 0, -1, 62, 57
Tj : "\4\x82\5\xf1"

and string 0x0482 0x05F1 just make no sense.

use pdf::file::File;

fn main() {
    let file = File::open("2.pdf").unwrap();
    let page = file.get_page(0).unwrap();
    if let Some(ref content) = &page.contents {
        for op in &content.operations {
            if op.operator[0..1].eq("T") {
                println!("{}", op);
            }
        }
    }
}

2.pdf

s3bk · 2022-09-26T15:04:55Z

The text is as CID, basically some font-specific encoding.
the pdf_render crate deals with the translation into unicode.

neko-para · 2022-09-26T15:19:36Z

Got it!

                TextEncoding::CID(Some(ref to_unicode)) => {
                    match to_unicode.get(&cid) {
                        Some(&(gid, ref unicode)) => (cid, gid, Some(unicode.clone())),
                        None => (cid, None, None)
                    }
                },

This block deals with it, isn't it? So I have an array of cid, and I need to translate them into unicode via to_unicode?

s3bk · 2022-09-26T15:21:05Z

With Chinese text, yes that is the way.
With English there are a few more...

neko-para · 2022-09-26T15:26:17Z

So I could just follow the previous process of draw_text function (check cid flag if need to split into two byte chunk or just expand byte) , and the string result is the combination of third part of glyphs ?

s3bk · 2022-09-26T15:27:58Z

Yes.

neko-para · 2022-09-26T15:36:12Z

Then is it necessary to add a method to translate vec into String? It is quite hard to get familiar with this logic. I've try some other pdf-related crates, but they either return the vec or cause the utf8 error.

s3bk · 2022-09-26T15:39:55Z

Have you considered using the tracer and then the produced TextSpans inside? They contain the build strings.

The third part of the tuple (from to_unicode) is a String, you can just concatenate them.

neko-para · 2022-09-26T15:58:28Z

I've follow trace.rs this script, but it seems that there's sth. wrong.
FIrstly, the example doesn't contain line 13, which p has type PageRc while render_page want Page. (Maybe when two Page type matches, reference to the former could automatically cast into reference to the latter?)

pdf = "*"
pdf_render = { git = "https://github.com/pdf-rs/pdf_render" }

Secondly, I use git to get pdf_render, and cause the problem above. It announces that Page in pdf crate doesn't fit Page used in pdf_render crate. I'm beginner of rust and never meet this problem before. Could you please share me some experience of dealing this kind of problem? I've tried to use pdf_render::pdf but of course it doesn't work. Maybe I should use pdf from git instead of repository?

s3bk · 2022-09-26T17:43:24Z

Yes, you need pdf and pdf_render from git.

I want to release a new version of the pdf crate, but there is a blocker remaining.

neko-para · 2022-09-27T01:59:02Z

Thanks! I've finally parsed my pdf. Though the program panic for missing STANDARD_FONTS env, but I've figured out that it just want that path to fonts.json which contains the path of avaiable fonts.

neko-para · 2022-09-27T03:18:43Z

I'm sorry, but after debugging, I've notice that RenderState::text function would just drop the text items that misses information for rendering (e.g. cannot find corresponding font to calculate size rect). But it's hard to meet all font requirements, and it seems that there isn't a fallback mechanism. So is it possible to grab those items(for text) without patching the crate? I notice that Render won't call tracer before it believe item is valid, so extend Tracer seems not work.

neko-para · 2022-09-27T03:20:11Z

neko-para · 2022-09-27T04:43:03Z

It's wierd. It seems that pdf_render loaded the required fonts, but cannot get any glyph from them. The program worked for some of pdf previously. It can load "AVUUDS+BWSongTi" "KHQECP+BWKaiTi", but cannot load "BWSimKai", "BWSimSun". However, I don't provide any of them in fonts.json, and I cannot find these fonts with BW prefix in my machine (under C:\Windows\Fonts). But all of them could extract toUnicode map.

s3bk · 2022-09-27T07:14:07Z

Ideally the fonts are embedded in the PDF. Those are named prefix+name.
If not you will have to provide it with the fonts via the fonts.json for rendering.

to_unicode is part of the font descriptor in the pdf, so that works even when the font itself is missing.

neko-para · 2022-09-27T09:38:14Z

Yes, but I cannot get it (without patching source, as it is in RenderState.text_state, a private struct). I've tried to do font parsing myself, but finally I know that the huge funciton FontEntry::build is necessary, which is too hard for me to rewrite it 🤔

s3bk · 2022-09-27T10:00:49Z

You should not need to rewrite anything.
This example gives you everything:
pdf_render/render/examples

Just match against the text variants

neko-para · 2022-09-27T10:03:26Z

You should not need to rewrite anything. This example gives you everything: pdf_render/render/examples

Just match against the text variants

I've tried this. But as I mentioned before, pdf_render cannot find the required font, so it cannot calculate some entry of the TextSpan, and just drop it.

neko-para · 2022-09-27T10:08:50Z

Debugger shows that it doesn't enter the if below

        inner(&mut self.backend, &mut self.text_state, &mut self.graphics_state, &mut span);

        if let (Some(bbox), Some(e)) = (span.bbox.rect(), self.text_state.font_entry.as_ref()) {
            let transform = self.graphics_state.transform * tm * Transform2F::from_scale(Vector2F::new(1.0, -1.0));
            let p1 = origin;
            let p2 = (tm * Transform2F::from_translation(Vector2F::new(span.width, self.text_state.font_size))).translation();

            debug!("text {}", span.text);
            self.backend.add_text(TextSpan {
                rect: self.graphics_state.transform * RectF::from_points(p1.min(p2), p1.max(p2)),
                width: span.width,
                bbox,
                text: span.text,
                chars: span.chars,
                font: e.clone(),
                font_size: self.text_state.font_size,
                color: self.graphics_state.fill_color,
                alpha: self.graphics_state.fill_color_alpha,
                transform,
            });
        }

s3bk · 2022-09-27T10:12:20Z

Ah, i understand now.
I will add support for extracting text when the font is not embedded.

neko-para · 2022-09-27T10:14:49Z

Thanks! The core problem is that when fonts are not available, item's span.bbox is None that led them being filtered.

s3bk · 2022-09-27T13:29:02Z

Done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfString.as_str() utf8 errors #117

PdfString.as_str() utf8 errors #117

mike-kfed commented Nov 24, 2021

mike-kfed commented Nov 25, 2021

s3bk commented Nov 25, 2021

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022 •

edited

Loading

s3bk commented Sep 26, 2022

neko-para commented Sep 27, 2022

neko-para commented Sep 27, 2022

neko-para commented Sep 27, 2022

neko-para commented Sep 27, 2022

s3bk commented Sep 27, 2022

neko-para commented Sep 27, 2022

s3bk commented Sep 27, 2022

neko-para commented Sep 27, 2022

neko-para commented Sep 27, 2022

s3bk commented Sep 27, 2022

neko-para commented Sep 27, 2022 •

edited

Loading

s3bk commented Sep 27, 2022

PdfString.as_str() utf8 errors #117

PdfString.as_str() utf8 errors #117

Comments

mike-kfed commented Nov 24, 2021

mike-kfed commented Nov 25, 2021

s3bk commented Nov 25, 2021

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022

s3bk commented Sep 26, 2022

neko-para commented Sep 26, 2022 • edited Loading

s3bk commented Sep 26, 2022

neko-para commented Sep 27, 2022

neko-para commented Sep 27, 2022

neko-para commented Sep 27, 2022

neko-para commented Sep 27, 2022

s3bk commented Sep 27, 2022

neko-para commented Sep 27, 2022

s3bk commented Sep 27, 2022

neko-para commented Sep 27, 2022

neko-para commented Sep 27, 2022

s3bk commented Sep 27, 2022

neko-para commented Sep 27, 2022 • edited Loading

s3bk commented Sep 27, 2022

neko-para commented Sep 26, 2022 •

edited

Loading

neko-para commented Sep 27, 2022 •

edited

Loading