-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text Extraction Yields cid and Fails on Mixed Content Pages in PDF #1036
Comments
hrhktkbzyy
changed the title
extract_text got
Text Extraction Yields cid and Fails on Mixed Content Pages in PDF
Aug 22, 2024
This PDF has completely arbitrary and corrupt ToUnicode character mappings, it's unlikely that pdfminer can do much about it. You can see the problem by trying to copy and paste text out of it from your browser's PDF viewer (in my case Chrome). Even the English text is corrupted, for example, "The dancers" on page 3 comes out as:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Issue:
When attempting to extract text from the attached PDF, several pages return cid values instead of readable text. Additionally, pages containing mixed content (text and images) do not return any text at all.
Affected PDF:
The Phantom of the Opera.pdf
Code Sample:
Output:
The extracted content includes cid values such as:
The text was updated successfully, but these errors were encountered: