[wishlist]: annotate PDFs #3243

yarikoptic · 2023-12-08T21:04:42Z

codespell is so great, would have been nice if we could get it to check and annotate text in other media, in particular in PDFs. Here is some background information from chatting with ChatGPT (adding references):

Yes, some of these libraries can be used to extract text along with their locations from a PDF, which would allow you to add annotations on mistyped words or similar tasks. The two most suitable libraries for this purpose are:

PDFMiner: PDFMiner (dead, community fork: https://github.com/pdfminer/pdfminer.six) is especially powerful for this task. It allows you to extract not just the text but also detailed information about the text layout, including the position of each character. This can be very useful if you want to find specific words and their locations for annotation purposes.
PyMuPDF (https://github.com/pymupdf/PyMuPDF) (Fitz): This is another library that's highly effective for extracting text with layout information. PyMuPDF provides detailed information about each text block, line, and even individual characters, including their positions on the page.

Both of these libraries will let you extract text in a way that retains information about where each piece of text is located on the page. You can then use this information to target specific words or phrases for annotation.

After extracting the text and its location, you can use a library like PyPDF2 or ReportLab to add annotations or highlight the text at the identified locations. This two-step process (text extraction with PDFMiner or PyMuPDF and annotation with PyPDF2 or ReportLab) can be quite effective for tasks like marking typos or adding notes to specific sections of text in a PDF document.

DimitriPapadopoulos added the enhancement label Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wishlist]: annotate PDFs #3243

[wishlist]: annotate PDFs #3243

yarikoptic commented Dec 8, 2023

[wishlist]: annotate PDFs #3243

[wishlist]: annotate PDFs #3243

Comments

yarikoptic commented Dec 8, 2023