-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve space detection and remove pdfminer high level code #25
Comments
Is it currently expected if a headline spans multiple lines it will fail to output the right format (in my case: words on different lines are joined without spaces)? |
Figured out what's happening in the current version. I've got PDF titles split over multiple lines, but the lines itself hold spaces, so the statement on line 564 ( |
Yes it makes sense to add an argument. If possible, can you share the pdf
so it can be used to validate this improvement ?
…On Sat, 29 Jan 2022 at 18:35, Maarten den Braber ***@***.***> wrote:
Figured out what's happening in the current version. I've got PDF titles
split over multiple lines, but the lines itself hold spaces, so the
statement on line 564 (if " " not in title) doesn't return True. When
forcing this it works in my case. Maybe possibly add an argument to force
space correction (or just alway correct spaces)?
—
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGGJB65C7YKQMTI5BHQLRDUYQQOFANCNFSM5B4PVJOQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
I've sent the PDF files via e-mail for validation. It works now on some, but not yet on all articles when forcing space correction. |
Text in the PDF file might not contain space character but the space might be indicated with an actual (additional) horizontal position difference between the glyphs before and after the space, so between the last char and the first char of the words. pdfminer has a high level code detecting this i.e. if the space between chars is greater than a certain threshold (possibly specified in the font file). It is better to do this manually and also implement spacing if vertical positions also changed (title in more than one lines). When this is done, I think, the get_title_from_io method can be simplified by removing the TextConverter and PDFPageInterpreter related parts.
The text was updated successfully, but these errors were encountered: