Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve space detection and remove pdfminer high level code #25

Open
metebalci opened this issue Aug 10, 2021 · 4 comments
Open

improve space detection and remove pdfminer high level code #25

metebalci opened this issue Aug 10, 2021 · 4 comments
Assignees

Comments

@metebalci
Copy link
Owner

Text in the PDF file might not contain space character but the space might be indicated with an actual (additional) horizontal position difference between the glyphs before and after the space, so between the last char and the first char of the words. pdfminer has a high level code detecting this i.e. if the space between chars is greater than a certain threshold (possibly specified in the font file). It is better to do this manually and also implement spacing if vertical positions also changed (title in more than one lines). When this is done, I think, the get_title_from_io method can be simplified by removing the TextConverter and PDFPageInterpreter related parts.

@metebalci metebalci self-assigned this Nov 21, 2021
@mdbraber
Copy link

Is it currently expected if a headline spans multiple lines it will fail to output the right format (in my case: words on different lines are joined without spaces)?

@mdbraber
Copy link

Figured out what's happening in the current version. I've got PDF titles split over multiple lines, but the lines itself hold spaces, so the statement on line 564 (if " " not in title) doesn't return True. When forcing this it works in my case. Maybe possibly add an argument to force space correction (or just alway correct spaces)?

@metebalci
Copy link
Owner Author

metebalci commented Jan 29, 2022 via email

@mdbraber
Copy link

mdbraber commented Jan 29, 2022

I've sent the PDF files via e-mail for validation. It works now on some, but not yet on all articles when forcing space correction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants