Couldn't extract title from a PDF with first page image #26

dufferzafar · 2021-11-18T19:01:16Z

❯ pdftitle -p .\Downloads\test.pdf

Traceback (most recent call last):
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 701, in run
    title = get_title_from_file(args.pdf)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 581, in get_title_from_file
    return get_title_from_io(raw_file)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 476, in get_title_from_io
    dev.recover_last_paragraph()
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 341, in recover_last_paragraph
    raise Exception("current block is None, this might be a bug. " +
Exception: current block is None, this might be a bug. please report it together with the pdf file

# Using pdfminer's pdf2txt
➜ pdf2txt .\Downloads\test.pdf

C++/CLI in Action

# Using poppler/xpdf's pdftotext
➜ pdftotext .\Downloads\test.pdf -

C++/CLI in Action

Here is the file: test.pdf

The text was updated successfully, but these errors were encountered:

metebalci · 2021-11-21T12:21:55Z

You have to use the --page-number argument. pdftitle does not check all the file, it only checks a single page (first page by default).

$ pdftitle -p test.pdf --page-number 2
C++/CLI in Action

dufferzafar · 2021-11-21T13:28:43Z

@metebalci Since it can't be known before-hand which PDFs will have title on first page.

Don't you think a better option would be to specify the last page that is checked? By default --last-page-number would be 1, so only 1st would be check. But I could set --last-page-number to something like 2 or 3 where title would be detected in the FIRST 3 pages.

BTW, I use pdftitle in a script that renames PDFs with their titles: https://github.com/dufferzafar/.scripts/blob/master/pdf-titles

metebalci · 2021-11-21T16:30:20Z

For an ultimate tool to extract a title from anywhere in a PDF file, this would be correct, but it is pretty difficult to do this I think with traditional methods (I mean without using something more smart from gestalt theory etc.). The main purpose of the tool is to extract titles of (peer-reviewed) articles and they do not have a cover page and they usually have a simple layout. On the other hand, I am not 100% sure but it might not be difficult to implement what you say and it might have some use. So I reopen the issue, I will check this when I do some implementation. So the changes can be:

deprecate but do not remove --page-number, defaults to 1
introduce --first-page-number, defaults to --page-number
introduce --last-page-number (inclusive), defaults to --first-page-number. If --last-page-number is different and the actual number of pages is less than this, I guess it makes sense to terminate the process silently at the end of the document.

user202729 · 2024-12-21T16:16:43Z

There's also the case where the PDF doesn't has any text (not OCR-ed) and all image.

Of course there isn't much the tool can do about that (other than telling the user to e.g. run pdfsandwich), but you can also just gracefully exit when there isn't any block (instead of telling the user to report a bug; but then, looks like you hardly get any bug report anyway)

metebalci closed this as completed Nov 21, 2021

metebalci added the enhancement label Nov 21, 2021

metebalci self-assigned this Nov 21, 2021

metebalci reopened this Nov 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couldn't extract title from a PDF with first page image #26

Couldn't extract title from a PDF with first page image #26

dufferzafar commented Nov 18, 2021

metebalci commented Nov 21, 2021

dufferzafar commented Nov 21, 2021

metebalci commented Nov 21, 2021

user202729 commented Dec 21, 2024

Couldn't extract title from a PDF with first page image #26

Couldn't extract title from a PDF with first page image #26

Comments

dufferzafar commented Nov 18, 2021

metebalci commented Nov 21, 2021

dufferzafar commented Nov 21, 2021

metebalci commented Nov 21, 2021

user202729 commented Dec 21, 2024