Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wishlist]: annotate PDFs #3243

Open
yarikoptic opened this issue Dec 8, 2023 · 0 comments
Open

[wishlist]: annotate PDFs #3243

yarikoptic opened this issue Dec 8, 2023 · 0 comments

Comments

@yarikoptic
Copy link
Contributor

codespell is so great, would have been nice if we could get it to check and annotate text in other media, in particular in PDFs. Here is some background information from chatting with ChatGPT (adding references):

Yes, some of these libraries can be used to extract text along with their locations from a PDF, which would allow you to add annotations on mistyped words or similar tasks. The two most suitable libraries for this purpose are:

  • PDFMiner: PDFMiner (dead, community fork: https://github.com/pdfminer/pdfminer.six) is especially powerful for this task. It allows you to extract not just the text but also detailed information about the text layout, including the position of each character. This can be very useful if you want to find specific words and their locations for annotation purposes.

  • PyMuPDF (https://github.com/pymupdf/PyMuPDF) (Fitz): This is another library that's highly effective for extracting text with layout information. PyMuPDF provides detailed information about each text block, line, and even individual characters, including their positions on the page.

Both of these libraries will let you extract text in a way that retains information about where each piece of text is located on the page. You can then use this information to target specific words or phrases for annotation.

After extracting the text and its location, you can use a library like PyPDF2 or ReportLab to add annotations or highlight the text at the identified locations. This two-step process (text extraction with PDFMiner or PyMuPDF and annotation with PyPDF2 or ReportLab) can be quite effective for tasks like marking typos or adding notes to specific sections of text in a PDF document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants