Content decoding does not handle inline images #78

misos1 · 2019-09-09T19:36:45Z

Example pdf file: bi.pdf

Content stream contains:

100 0 0 100 0 0 cm
BI /W 4 /H 4 /CS /RGB /BPC 8
ID
00000z0z00zzz00z0zzz0zzzEI aazazaazzzaazazzzazzz
EI

There is chapter 4.8.6 about inline images in pdf reference.

extern crate lopdf;

fn main()
{
	let doc = lopdf::Document::load("bi.pdf").unwrap();
	let cont = doc.get_and_decode_page_content(doc.get_pages()[&1]);
	println!("{:#?}", cont);
}

Ok(
    Content {
        operations: [
            Operation {
                operator: "cm",
                operands: [
                    100,
                    0,
                    0,
                    100,
                    0,
                    0,
                ],
            },
            Operation {
                operator: "BI",
                operands: [],
            },
            Operation {
                operator: "ID",
                operands: [
                    /W,
                    4,
                    /H,
                    4,
                    /CS,
                    /RGB,
                    /BPC,
                    8,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "z",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzz",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "zzzEI",
                operands: [
                    0,
                ],
            },
            Operation {
                operator: "aazazaazzzaazazzzazzz",
                operands: [],
            },
            Operation {
                operator: "EI",
                operands: [],
            },
        ],
    },
)

To handle this properly it is needed to calculate size of decoded image data from parameters like width, height, bit per component, color space and decode using filters (note "EI " byte sequence in middle of image data, there can be any byte sequence). Unfortunately there is no required "Length" key which could be used to skip stream data like in normal pdf streams.

Also this affects other functionality of lopdf which depends on content decoding like text extraction. For example there can be false positive "Tj" inside image. Or in some circumstances could lopdf return error maybe when byte sequence in image data is not valid UTF-8 string and so on.

The text was updated successfully, but these errors were encountered:

Heinenen · 2024-11-24T22:17:56Z

Although #356 handles the given file correctly, I'd like to keep this issue open until the most common filter types in inline images are handled as well.

Heinenen added bug enhancement labels Aug 11, 2024

This was referenced Nov 18, 2024

it seems not get all decoded elements while reading a pdf generated by ghostscript 9.27 #221

Closed

How to extract images from a PDF？ #278

Open

Inline images #356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content decoding does not handle inline images #78

Content decoding does not handle inline images #78

misos1 commented Sep 9, 2019 •

edited

Loading

Heinenen commented Nov 24, 2024

Content decoding does not handle inline images #78

Content decoding does not handle inline images #78

Comments

misos1 commented Sep 9, 2019 • edited Loading

Heinenen commented Nov 24, 2024

misos1 commented Sep 9, 2019 •

edited

Loading