-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PdfString.as_str() utf8 errors #117
Comments
I looked more into this, string encoding inside PDF is hilariously complicated, if nobody is working on this I can give it a go. My idea would be to extend PdfString with an encoding property to allow for correct decoding to utf-8 later. Otherwise I have identified public PDFs with the same problem that I could share. |
I would HIGHLY recommend that you peek at https://github.com/pdf-rs/pdf_render/blob/master/render/src/cache.rs#L188 We use it in production, so it is mostly proven (although not in German). It should work in any language. If not it is a bug and needs fixing. |
|
I trace the text rendering into Here is an example pdf, which only contains
and string use pdf::file::File;
fn main() {
let file = File::open("2.pdf").unwrap();
let page = file.get_page(0).unwrap();
if let Some(ref content) = &page.contents {
for op in &content.operations {
if op.operator[0..1].eq("T") {
println!("{}", op);
}
}
}
} |
The text is as CID, basically some font-specific encoding. |
Got it! TextEncoding::CID(Some(ref to_unicode)) => {
match to_unicode.get(&cid) {
Some(&(gid, ref unicode)) => (cid, gid, Some(unicode.clone())),
None => (cid, None, None)
}
}, This block deals with it, isn't it? So I have an array of cid, and I need to translate them into unicode via to_unicode? |
With Chinese text, yes that is the way. |
So I could just follow the previous process of draw_text function (check cid flag if need to split into two byte chunk or just expand byte) , and the string result is the combination of third part of |
Yes. |
Then is it necessary to add a method to translate vec into String? It is quite hard to get familiar with this logic. I've try some other pdf-related crates, but they either return the vec or cause the utf8 error. |
Have you considered using the tracer and then the produced The third part of the tuple (from to_unicode) is a String, you can just concatenate them. |
pdf = "*"
pdf_render = { git = "https://github.com/pdf-rs/pdf_render" } Secondly, I use git to get pdf_render, and cause the problem above. It announces that Page in pdf crate doesn't fit Page used in pdf_render crate. I'm beginner of rust and never meet this problem before. Could you please share me some experience of dealing this kind of problem? I've tried to |
Yes, you need pdf and pdf_render from git. I want to release a new version of the |
Thanks! I've finally parsed my pdf. Though the program panic for missing STANDARD_FONTS env, but I've figured out that it just want that path to fonts.json which contains the path of avaiable fonts. |
I'm sorry, but after debugging, I've notice that RenderState::text function would just drop the text items that misses information for rendering (e.g. cannot find corresponding font to calculate size rect). But it's hard to meet all font requirements, and it seems that there isn't a fallback mechanism. So is it possible to grab those items(for text) without patching the crate? I notice that Render won't call tracer before it believe item is valid, so extend Tracer seems not work. |
It's wierd. It seems that pdf_render loaded the required fonts, but cannot get any glyph from them. The program worked for some of pdf previously. It can load "AVUUDS+BWSongTi" "KHQECP+BWKaiTi", but cannot load "BWSimKai", "BWSimSun". However, I don't provide any of them in fonts.json, and I cannot find these fonts with BW prefix in my machine (under C:\Windows\Fonts). But all of them could extract toUnicode map. |
Ideally the fonts are embedded in the PDF. Those are named prefix+name.
|
Yes, but I cannot get it (without patching source, as it is in RenderState.text_state, a private struct). I've tried to do font parsing myself, but finally I know that the huge funciton FontEntry::build is necessary, which is too hard for me to rewrite it 🤔 |
You should not need to rewrite anything. Just match against the text variants |
I've tried this. But as I mentioned before, pdf_render cannot find the required font, so it cannot calculate some entry of the TextSpan, and just drop it. |
Debugger shows that it doesn't enter the if below inner(&mut self.backend, &mut self.text_state, &mut self.graphics_state, &mut span);
if let (Some(bbox), Some(e)) = (span.bbox.rect(), self.text_state.font_entry.as_ref()) {
let transform = self.graphics_state.transform * tm * Transform2F::from_scale(Vector2F::new(1.0, -1.0));
let p1 = origin;
let p2 = (tm * Transform2F::from_translation(Vector2F::new(span.width, self.text_state.font_size))).translation();
debug!("text {}", span.text);
self.backend.add_text(TextSpan {
rect: self.graphics_state.transform * RectF::from_points(p1.min(p2), p1.max(p2)),
width: span.width,
bbox,
text: span.text,
chars: span.chars,
font: e.clone(),
font_size: self.text_state.font_size,
color: self.graphics_state.fill_color,
alpha: self.graphics_state.fill_color_alpha,
transform,
});
} |
Ah, i understand now. |
Thanks! The core problem is that when fonts are not available, item's span.bbox is None that led them being filtered. |
Done. |
Hi, I've been playing with the master branch of this amazing crate. My goal was to extract text from a PDF, that has text in german. Seems like the umlauts in there are in latin-1 / iso-8859-1 encoding. I don't fully know, if the PDF file specifies an encoding somewhere to pick the encoding up for later decoding. Afaik latin1 is the default encoding? My quick hack as of now is to do this
this works because the iso-8859-1 codepoints map with utf-8, therefore the
as char
trick works. Which is not a robust solution for your crate I assume? However the currentstr::from_utf8()
use inPdfString.as_str()
doesn't work either for latin1 text. Given some guidance in how you want this to be fixed I can provide a PR.Sadly I cannot easily provide example PDFs as they are from my bank, I'll try to come up with a file that shows the same behaviour though.
The text was updated successfully, but these errors were encountered: