Text Extraction Yields cid and Fails on Mixed Content Pages in PDF #1036

hrhktkbzyy · 2024-08-22T06:23:06Z

Issue:

When attempting to extract text from the attached PDF, several pages return cid values instead of readable text. Additionally, pages containing mixed content (text and images) do not return any text at all.

Affected PDF:

The Phantom of the Opera.pdf

Code Sample:

from pdfminer.high_level import extract_text

def get_text_from_pdf_by_pdfminer(file_path):
    try:
        text = extract_text(file_path.absolute())
        number_of_pages = text.count('\f')
        return text, number_of_pages

    except Exception as e:
        print(e)

Output:

The extracted content includes cid values such as:

(cid:11550)(cid:450)(cid:5509)
(cid:12720)(cid:450)(cid:1275)
(cid:20)(cid:450)(cid:55)(cid:75)(cid:72)(cid:3)(cid:71)(cid:68)(cid:81)(cid:70)(cid:72)(cid:85)(cid:86)
(cid:20)(cid:714)(cid:14414)(cid:17528)(cid:9540)(cid:2696)(cid:1308)
(cid:21)(cid:450)(cid:55)(cid:75)(cid:72)(cid:3)(cid:71)(cid:76)(cid:85)(cid:72)(cid:70)(cid:87)(cid:82)(cid:85)(cid:86)(cid:3)(cid:82)(cid:73)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:50)(cid:83)(cid:72)(cid:85)(cid:68)(cid:3)(cid:43)(cid:82)(cid:88)(cid:86)(cid:72)
...

The text was updated successfully, but these errors were encountered:

dhdaines · 2024-09-19T15:59:22Z

This PDF has completely arbitrary and corrupt ToUnicode character mappings, it's unlikely that pdfminer can do much about it. You can see the problem by trying to copy and paste text out of it from your browser's PDF viewer (in my case Chrome). Even the English text is corrupted, for example, "The dancers" on page 3 comes out as:

7KHGDQFHUV

hrhktkbzyy changed the title ~~extract_text got~~ Text Extraction Yields cid and Fails on Mixed Content Pages in PDF Aug 22, 2024

dhdaines mentioned this issue Sep 19, 2024

CID characters when extracting text from Korean pdf #1035

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Extraction Yields cid and Fails on Mixed Content Pages in PDF #1036

Text Extraction Yields cid and Fails on Mixed Content Pages in PDF #1036

hrhktkbzyy commented Aug 22, 2024

dhdaines commented Sep 19, 2024

Text Extraction Yields cid and Fails on Mixed Content Pages in PDF #1036

Text Extraction Yields cid and Fails on Mixed Content Pages in PDF #1036

Comments

hrhktkbzyy commented Aug 22, 2024

Issue:

Affected PDF:

Code Sample:

Output:

dhdaines commented Sep 19, 2024