Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails with out-of-memory for a very-very large pdf file #125

Open
LudeeD opened this issue Jul 1, 2019 · 7 comments
Open

Fails with out-of-memory for a very-very large pdf file #125

LudeeD opened this issue Jul 1, 2019 · 7 comments

Comments

@LudeeD
Copy link

LudeeD commented Jul 1, 2019

I have a pdf file that is 1.3 Gb in size ( it's a master thesis, that's why I am not annexing it here )
Okular can handle it pretty well but crashes Adobe
While trying to use pdfsizeopt it crashes too with a memory error

info: This is pdfsizeopt ZIP rUNKNOWN size=69734.
info: prepending to PATH: /home/ludee/Programs/pdfsizeopt/pdfsizeopt_libexec
info: loading PDF from: /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.pdf
info: loaded PDF of 1322590721 bytes
info: separated to 2269032 objs + xref + trailer
Traceback (most recent call last):
  File "/proc/self/exe/runpy.py", line 162, in _run_module_as_main
  File "/proc/self/exe/runpy.py", line 72, in _run_code
  File "./pdfsizeopt.single/__main__.py", line 1, in <module>
  File "./pdfsizeopt.single/m.py", line 6, in <module>
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 5622, in main
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2664, in Load
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 689, in __init__
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 942, in Get
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 1217, in ParseDict
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 1148, in ParseSimpleValue
MemoryError
@zvezdochiot
Copy link

zvezdochiot commented Jul 1, 2019

@LudeeD say> I have a pdf file that is 1.3 Gb in size

More information please:

pdfinfo /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.pdf

And see #119

@LudeeD
Copy link
Author

LudeeD commented Jul 1, 2019

More info

Title:          
Subject:        
Keywords:       
Author:         
Creator:        LaTeX with hyperref
Producer:       pdfTeX-1.40.19
CreationDate:   Sun Jun 30 21:11:45 2019 WEST
ModDate:        Sun Jun 30 21:11:45 2019 WEST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          308
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      1322590721 bytes
Optimized:      no
PDF version:    1.5

Following instructions on #119 cpdf also failed with a

Initial file size is 1322590721 bytes
Beginning squeeze: 2269033 objects
Fatal error: out of memory.

@zvezdochiot
Copy link

zvezdochiot commented Jul 1, 2019

@LudeeD say> Pages: 308, File size: 1322590721 bytes

1322590721/308 = 4294128 bytes/page. Hmm! Is big!

You can change /FlateDecode (~ png) to /DCTDecode (~ jpeg), use ghostscript:

ps2pdf /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.pdf /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.gs.pdf

@LudeeD
Copy link
Author

LudeeD commented Jul 1, 2019

After running for 3 hours I gave up on this.
Rebuilt the PDF with compressed versions of the images and now its in a more reasonable size.

feel free to close this issue if handling > 1Gb files is not really a priority

Thanks for the help

@rbrito
Copy link

rbrito commented Jul 2, 2019 via email

@zvezdochiot
Copy link

zvezdochiot commented Jul 2, 2019

@rbrito say> It sure sounds interesting and I would like to have a look at it.

Use pdftk to process the file in parts.

@pts pts changed the title Fails with very very large pdf file Fails with out-of-memory for a very-very large pdf file Feb 23, 2023
@pts
Copy link
Owner

pts commented Feb 23, 2023

pdfsizeopt indeed uses a lot of memory for large PDF files, because it keeps the parsed version of the entire PDF file in memory. It also keeps multiple versions of compressed image data in memory for the current image being optimized.

Throwing more memory at it should make it work. Unfortunately there is no easy estimate for the total required memory for a given input file.

In the meantime, splitting the PDF file on some page boundary (with pdftk or qpdf), running pdfsizeopt on the split PDF files individually, and joining the results may work for some PDFs.

I'm keeping this issue open as a reminder to add memory optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants