-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couldn't extract title from a PDF with first page image #26
Comments
You have to use the --page-number argument. pdftitle does not check all the file, it only checks a single page (first page by default).
|
@metebalci Since it can't be known before-hand which PDFs will have title on first page. Don't you think a better option would be to specify the last page that is checked? By default BTW, I use pdftitle in a script that renames PDFs with their titles: https://github.com/dufferzafar/.scripts/blob/master/pdf-titles |
For an ultimate tool to extract a title from anywhere in a PDF file, this would be correct, but it is pretty difficult to do this I think with traditional methods (I mean without using something more smart from gestalt theory etc.). The main purpose of the tool is to extract titles of (peer-reviewed) articles and they do not have a cover page and they usually have a simple layout. On the other hand, I am not 100% sure but it might not be difficult to implement what you say and it might have some use. So I reopen the issue, I will check this when I do some implementation. So the changes can be:
|
There's also the case where the PDF doesn't has any text (not OCR-ed) and all image. Of course there isn't much the tool can do about that (other than telling the user to e.g. run pdfsandwich), but you can also just gracefully exit when there isn't any block (instead of telling the user to report a bug; but then, looks like you hardly get any bug report anyway) |
Here is the file: test.pdf
The text was updated successfully, but these errors were encountered: