Encoding error with non-UTF-8 PDFs #49

PierreMesure · 2024-12-08T11:37:40Z

Hi,

I'm getting errors with PDFs encoded with latin-1. Here's an example.

The problem occurs at this line because the byte string isn't encoded with UTF-8. If I replace with title.decode('iso-8859-1'), it works flawlessly.

I think a solution would be to extract the info encoding using pdfminer but I couldn't find how. Another possibility is using chardet or testing for several encodings and catching the exceptions.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding error with non-UTF-8 PDFs #49

Encoding error with non-UTF-8 PDFs #49

PierreMesure commented Dec 8, 2024

Encoding error with non-UTF-8 PDFs #49

Encoding error with non-UTF-8 PDFs #49

Comments

PierreMesure commented Dec 8, 2024