Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode #127

Open
maadjordan opened this issue Jul 20, 2019 · 14 comments
Open

change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode #127

maadjordan opened this issue Jul 20, 2019 · 14 comments

Comments

@maadjordan
Copy link

I've found this https://www.usmodernist.org/AF/AF-1928-01-1.PDF

which seems that all scanned jfif images are stored as deflated dct stream. is possible to strip the deflating code safely or transfer it into dct only stream. uncompress it will still preserve this code and using PSO will strill run it through deflate optimizing

@zvezdochiot
Copy link

zvezdochiot commented Jul 21, 2019

$ mutool info AF-1928-01-1.PDF 
AF-1928-01-1.PDF:

PDF-1.4

Pages: 202

Retrieving info from pages 1-202...
Mediaboxes (135):
	1	(97 0 R):	[ 0 0 619.56 878.04 ]
	3	(7 0 R):	[ 0 0 623.16 884.16 ]
...
	199	(1633 0 R):	[ Flate DCT ] 1691x2457 8bpc DevRGB (1637 0 R)
	200	(1639 0 R):	[ Flate DCT ] 1687x2454 8bpc DevRGB (1643 0 R)
	201	(1645 0 R):	[ Flate DCT ] 1681x2448 8bpc DevRGB (1649 0 R)
	202	(1651 0 R):	[ Flate DCT ] 1673x2443 8bpc DevRGB (1655 0 R)

See #95

@zvezdochiot
Copy link

zvezdochiot commented Jul 21, 2019

@maadjordan say> using PSO will strill run it through deflate optimizing

You can:

pdfimages -j AF-1928-01-1.PDF i
mkdir jq25; for tjpg in *.jpg; do jpegquant -q 25 "$tjpg" "jq25/$tjpg"; done
mkdir jr; for tjpg in *.jpg; do jpegrescan "$tjpg" "jr/$tjpg"; done
for tjpg in *.jpg; do img2pdf -d 200 -o "$tjpg.pdf" "$tjpg"; done
pdftk *.pdf cat output book.pdf

PS: OCR layer will be lost.

$ ls -l
-rw-r--r-- 1 user user 94997395 Jul 21 09:23 AF-1928-01-1.PDF
-rw-r--r-- 1 user user 34663639 Jul 21 09:50 book.pdf

@maadjordan
Copy link
Author

maadjordan commented Jul 21, 2019

thanks for the prompt reply. I managed to find a windows compile of "pdfimages" but not img2pdf, jpegquant or jpegscan.

jpegquant and jpegscan can be replaced with jpegrecompress and mozijpeg for lossy or lossless optimization.

Can you provide a link to latest compiled version of img2pdf ?

also some images are CCITT which is not viewable in Xnview. is there a way to view these? these are not recognized by PSO to passthrough Jbig2 encoder?

@zvezdochiot
Copy link

@maadjordan say> Can you provide a link to latest compiled version of img2pdf ?

Img2pdf is a python script using the PIL library. How the python support works in your OS is unknown to me. There is no such problem in Debian.

@maadjordan
Copy link
Author

it could be like pso exe files. its python wrapped into exe

@zvezdochiot
Copy link

@maadjordan say> it could be like pso exe files.

Maybe. Ask the developer: https://gitlab.mister-muffin.de/josch/img2pdf

@maadjordan
Copy link
Author

I managed to compile img2pdf into windows exe file using https://gitlab.mister-muffin.de/josch/img2pdf/issues/8

@zvezdochiot
Copy link

zvezdochiot commented Jul 22, 2019

@maadjordan say> I managed to compile img2pdf

Instead of jpegrescan use Voralent Antelope.

https://www.google.com/search?q=Voralent+Antelope

@pts
Copy link
Owner

pts commented Jul 22, 2019

FYI pdfsizeopt doesn't have any features right now to do JPEG (re)compression.

@maadjordan
Copy link
Author

@maadjordan say> I managed to compile img2pdf

Instead of jpegrescan use Voralent Antelope.

https://www.google.com/search?q=Voralent+Antelope

its a GUI to jpegtrans, pnguant and other tools. nothing special.

@maadjordan
Copy link
Author

maadjordan commented Jul 22, 2019

FYI pdfsizeopt doesn't have any features right now to do JPEG (re)compression.

I know and I will be waiting for this feature.

My main question was to simplify the file processing as jpg files are backed with deflate stream which means that reader need to inflate then read jpg files and both steps requires ram ! simplifying it would reduce ram considerably .. such feature is good to add.

Also on same pages i found ccitt streams deflated and PSO missed to pass the stream to Jbig2

@zvezdochiot
Copy link

@maadjordan say> I know and I will be waiting for this feature.

See #95

@pts say> It would be possible to add lossy optimizations (which can be enabled with a command-line flag) in general and lossy image optimizations with external tools such as jpeg-recompress in particular, but that would need substantial software development and maintenance work, and that would need either funding or volunteering (i.e. pull requests).

@pts
Copy link
Owner

pts commented Jul 24, 2019

Also on same pages i found ccitt streams deflated and PSO missed to pass the stream to Jbig2

This shouldn't be happening. maadjordan@, please report this as a separate issue, and attach the offending PDF file to the issue.

@pts pts changed the title decoding deflated DCT streams change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode Jul 24, 2019
@pts
Copy link
Owner

pts commented Jul 24, 2019

simplify the file processing as jpg files are backed with deflate stream which means that reader need to inflate then read jpg files and both steps requires ram ! simplifying it would reduce ram considerably .. such feature is good to add.

OK, if I understand you correctly, you want pdfsizeopt to change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode (and also similarly for /Filter [/FlateDecode /JPXDecode]) after decompressing the flate-compressed stream.

This is possible to do, but it's unlikely to make the PDF file any smaller, and the overall goal of pdfsizeopt (with its default settings) to make PDF files smaller.

To make this happen,

if ('/DCTDecode' in filter_value or '/JPXDecode' in filter_value):
needs to adjusted to allow /DCTDecode and /JPXDecode, and GetUncompressedStream also need to be extended so that it won't try to decompress those streams. Also needs to be removed so that images are not automatically ignored.

I'm keeping this issue open in case anyone wants to pick up this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants