Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content decoding does not handle inline images #78

Open
misos1 opened this issue Sep 9, 2019 · 1 comment
Open

Content decoding does not handle inline images #78

misos1 opened this issue Sep 9, 2019 · 1 comment

Comments

@misos1
Copy link

misos1 commented Sep 9, 2019

Example pdf file: bi.pdf

Screenshot 2019-09-06 at 19 06 30

Content stream contains:

100 0 0 100 0 0 cm
BI /W 4 /H 4 /CS /RGB /BPC 8
ID
00000z0z00zzz00z0zzz0zzzEI aazazaazzzaazazzzazzz
EI

There is chapter 4.8.6 about inline images in pdf reference.

extern crate lopdf;

fn main()
{
	let doc = lopdf::Document::load("bi.pdf").unwrap();
	let cont = doc.get_and_decode_page_content(doc.get_pages()[&1]);
	println!("{:#?}", cont);
}
Ok(
    Content {
        operations: [
            Operation {
                operator: "cm",
                operands: [
                    100,
                    0,
                    0,
                    100,
                    0,
                    0,
                ],
            },
            Operation {
                operator: "BI",
                operands: [],
            },
            Operation {
                operator: "ID",
                operands: [
                    /W,
                    4,
                    /H,
                    4,
                    /CS,
                    /RGB,
                    /BPC,
                    8,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzzEI",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "aazazaazzzaazazzzazzz",
                operands: [],
            },
            Operation {
                operator: "EI",
                operands: [],
            },
        ],
    },
)

To handle this properly it is needed to calculate size of decoded image data from parameters like width, height, bit per component, color space and decode using filters (note "EI " byte sequence in middle of image data, there can be any byte sequence). Unfortunately there is no required "Length" key which could be used to skip stream data like in normal pdf streams.

Also this affects other functionality of lopdf which depends on content decoding like text extraction. For example there can be false positive "Tj" inside image. Or in some circumstances could lopdf return error maybe when byte sequence in image data is not valid UTF-8 string and so on.

@Heinenen
Copy link
Collaborator

Although #356 handles the given file correctly, I'd like to keep this issue open until the most common filter types in inline images are handled as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants