Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't extract title from a PDF with first page image #26

Open
dufferzafar opened this issue Nov 18, 2021 · 4 comments
Open

Couldn't extract title from a PDF with first page image #26

dufferzafar opened this issue Nov 18, 2021 · 4 comments
Assignees

Comments

@dufferzafar
Copy link

❯ pdftitle -p .\Downloads\test.pdf

Traceback (most recent call last):
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 701, in run
    title = get_title_from_file(args.pdf)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 581, in get_title_from_file
    return get_title_from_io(raw_file)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 476, in get_title_from_io
    dev.recover_last_paragraph()
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 341, in recover_last_paragraph
    raise Exception("current block is None, this might be a bug. " +
Exception: current block is None, this might be a bug. please report it together with the pdf file

# Using pdfminer's pdf2txt
➜ pdf2txt .\Downloads\test.pdf

C++/CLI in Action

# Using poppler/xpdf's pdftotext
➜ pdftotext .\Downloads\test.pdf -

C++/CLI in Action

Here is the file: test.pdf

@metebalci
Copy link
Owner

You have to use the --page-number argument. pdftitle does not check all the file, it only checks a single page (first page by default).

$ pdftitle -p test.pdf --page-number 2
C++/CLI in Action

@dufferzafar
Copy link
Author

@metebalci Since it can't be known before-hand which PDFs will have title on first page.

Don't you think a better option would be to specify the last page that is checked? By default --last-page-number would be 1, so only 1st would be check. But I could set --last-page-number to something like 2 or 3 where title would be detected in the FIRST 3 pages.

BTW, I use pdftitle in a script that renames PDFs with their titles: https://github.com/dufferzafar/.scripts/blob/master/pdf-titles

@metebalci metebalci self-assigned this Nov 21, 2021
@metebalci
Copy link
Owner

For an ultimate tool to extract a title from anywhere in a PDF file, this would be correct, but it is pretty difficult to do this I think with traditional methods (I mean without using something more smart from gestalt theory etc.). The main purpose of the tool is to extract titles of (peer-reviewed) articles and they do not have a cover page and they usually have a simple layout. On the other hand, I am not 100% sure but it might not be difficult to implement what you say and it might have some use. So I reopen the issue, I will check this when I do some implementation. So the changes can be:

  • deprecate but do not remove --page-number, defaults to 1
  • introduce --first-page-number, defaults to --page-number
  • introduce --last-page-number (inclusive), defaults to --first-page-number. If --last-page-number is different and the actual number of pages is less than this, I guess it makes sense to terminate the process silently at the end of the document.

@metebalci metebalci reopened this Nov 21, 2021
@user202729
Copy link

There's also the case where the PDF doesn't has any text (not OCR-ed) and all image.

Of course there isn't much the tool can do about that (other than telling the user to e.g. run pdfsandwich), but you can also just gracefully exit when there isn't any block (instead of telling the user to report a bug; but then, looks like you hardly get any bug report anyway)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants