change `/Filter [/FlateDecode /DCTDecode]` to `/Filter /DCTDecode` #127

maadjordan · 2019-07-20T21:39:11Z

I've found this https://www.usmodernist.org/AF/AF-1928-01-1.PDF

which seems that all scanned jfif images are stored as deflated dct stream. is possible to strip the deflating code safely or transfer it into dct only stream. uncompress it will still preserve this code and using PSO will strill run it through deflate optimizing

zvezdochiot · 2019-07-21T06:31:10Z

$ mutool info AF-1928-01-1.PDF 
AF-1928-01-1.PDF:

PDF-1.4

Pages: 202

Retrieving info from pages 1-202...
Mediaboxes (135):
	1	(97 0 R):	[ 0 0 619.56 878.04 ]
	3	(7 0 R):	[ 0 0 623.16 884.16 ]
...
	199	(1633 0 R):	[ Flate DCT ] 1691x2457 8bpc DevRGB (1637 0 R)
	200	(1639 0 R):	[ Flate DCT ] 1687x2454 8bpc DevRGB (1643 0 R)
	201	(1645 0 R):	[ Flate DCT ] 1681x2448 8bpc DevRGB (1649 0 R)
	202	(1651 0 R):	[ Flate DCT ] 1673x2443 8bpc DevRGB (1655 0 R)

See #95

zvezdochiot · 2019-07-21T07:08:59Z

@maadjordan say> using PSO will strill run it through deflate optimizing

You can:

use pdfimages (https://github.com/freedesktop/poppler) to extract images:

pdfimages -j AF-1928-01-1.PDF i

use jpegquant (https://github.com/ImageProcessing-ElectronicPublications/jpegquant) to reduce DCT coefficients (lossy):

mkdir jq25; for tjpg in *.jpg; do jpegquant -q 25 "$tjpg" "jq25/$tjpg"; done

use jpegrescan (https://github.com/kud/jpegrescan) to optimize compression of DCT coefficients (lossless):

mkdir jr; for tjpg in *.jpg; do jpegrescan "$tjpg" "jr/$tjpg"; done

use img2pdf (https://github.com/josch/img2pdf) to convert (lossless):

for tjpg in *.jpg; do img2pdf -d 200 -o "$tjpg.pdf" "$tjpg"; done

use pdftk (https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/) to merge pages:

pdftk *.pdf cat output book.pdf

PS: OCR layer will be lost.

$ ls -l
-rw-r--r-- 1 user user 94997395 Jul 21 09:23 AF-1928-01-1.PDF
-rw-r--r-- 1 user user 34663639 Jul 21 09:50 book.pdf

maadjordan · 2019-07-21T10:09:16Z

thanks for the prompt reply. I managed to find a windows compile of "pdfimages" but not img2pdf, jpegquant or jpegscan.

jpegquant and jpegscan can be replaced with jpegrecompress and mozijpeg for lossy or lossless optimization.

Can you provide a link to latest compiled version of img2pdf ?

also some images are CCITT which is not viewable in Xnview. is there a way to view these? these are not recognized by PSO to passthrough Jbig2 encoder?

zvezdochiot · 2019-07-21T10:29:44Z

@maadjordan say> Can you provide a link to latest compiled version of img2pdf ?

Img2pdf is a python script using the PIL library. How the python support works in your OS is unknown to me. There is no such problem in Debian.

maadjordan · 2019-07-21T10:34:59Z

it could be like pso exe files. its python wrapped into exe

zvezdochiot · 2019-07-21T10:38:45Z

@maadjordan say> it could be like pso exe files.

Maybe. Ask the developer: https://gitlab.mister-muffin.de/josch/img2pdf

maadjordan · 2019-07-21T23:40:27Z

I managed to compile img2pdf into windows exe file using https://gitlab.mister-muffin.de/josch/img2pdf/issues/8

zvezdochiot · 2019-07-22T03:10:36Z

@maadjordan say> I managed to compile img2pdf

Instead of jpegrescan use Voralent Antelope.

https://www.google.com/search?q=Voralent+Antelope

pts · 2019-07-22T09:32:33Z

FYI pdfsizeopt doesn't have any features right now to do JPEG (re)compression.

maadjordan · 2019-07-22T11:31:58Z

@maadjordan say> I managed to compile img2pdf

Instead of jpegrescan use Voralent Antelope.

https://www.google.com/search?q=Voralent+Antelope

its a GUI to jpegtrans, pnguant and other tools. nothing special.

maadjordan · 2019-07-22T11:36:21Z

FYI pdfsizeopt doesn't have any features right now to do JPEG (re)compression.

I know and I will be waiting for this feature.

My main question was to simplify the file processing as jpg files are backed with deflate stream which means that reader need to inflate then read jpg files and both steps requires ram ! simplifying it would reduce ram considerably .. such feature is good to add.

Also on same pages i found ccitt streams deflated and PSO missed to pass the stream to Jbig2

zvezdochiot · 2019-07-22T14:53:47Z

@maadjordan say> I know and I will be waiting for this feature.

See #95

@pts say> It would be possible to add lossy optimizations (which can be enabled with a command-line flag) in general and lossy image optimizations with external tools such as jpeg-recompress in particular, but that would need substantial software development and maintenance work, and that would need either funding or volunteering (i.e. pull requests).

pts · 2019-07-24T13:29:05Z

Also on same pages i found ccitt streams deflated and PSO missed to pass the stream to Jbig2

This shouldn't be happening. maadjordan@, please report this as a separate issue, and attach the offending PDF file to the issue.

pts · 2019-07-24T13:40:15Z

simplify the file processing as jpg files are backed with deflate stream which means that reader need to inflate then read jpg files and both steps requires ram ! simplifying it would reduce ram considerably .. such feature is good to add.

OK, if I understand you correctly, you want pdfsizeopt to change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode (and also similarly for /Filter [/FlateDecode /JPXDecode]) after decompressing the flate-compressed stream.

This is possible to do, but it's unlikely to make the PDF file any smaller, and the overall goal of pdfsizeopt (with its default settings) to make PDF files smaller.

To make this happen,

pdfsizeopt/lib/pdfsizeopt/main.py

Line 8143 in 33ec5e5

if ('/DCTDecode' in filter_value or '/JPXDecode' in filter_value):

needs to adjusted to allow /DCTDecode and /JPXDecode, and GetUncompressedStream also need to be extended so that it won't try to decompress those streams. Also

pdfsizeopt/lib/pdfsizeopt/main.py

Line 8131 in 33ec5e5

continue

needs to be removed so that images are not automatically ignored.

I'm keeping this issue open in case anyone wants to pick up this work.

pts changed the title ~~decoding deflated DCT streams~~ change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode Jul 24, 2019

pts added the enhancement label Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change `/Filter [/FlateDecode /DCTDecode]` to `/Filter /DCTDecode` #127

change `/Filter [/FlateDecode /DCTDecode]` to `/Filter /DCTDecode` #127

maadjordan commented Jul 20, 2019

zvezdochiot commented Jul 21, 2019 •

edited

Loading

zvezdochiot commented Jul 21, 2019 •

edited

Loading

maadjordan commented Jul 21, 2019 •

edited

Loading

zvezdochiot commented Jul 21, 2019

maadjordan commented Jul 21, 2019

zvezdochiot commented Jul 21, 2019

maadjordan commented Jul 21, 2019

zvezdochiot commented Jul 22, 2019 •

edited

Loading

pts commented Jul 22, 2019

maadjordan commented Jul 22, 2019

maadjordan commented Jul 22, 2019 •

edited

Loading

zvezdochiot commented Jul 22, 2019

pts commented Jul 24, 2019

pts commented Jul 24, 2019

change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode #127

change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode #127

Comments

maadjordan commented Jul 20, 2019

zvezdochiot commented Jul 21, 2019 • edited Loading

zvezdochiot commented Jul 21, 2019 • edited Loading

maadjordan commented Jul 21, 2019 • edited Loading

zvezdochiot commented Jul 21, 2019

maadjordan commented Jul 21, 2019

zvezdochiot commented Jul 21, 2019

maadjordan commented Jul 21, 2019

zvezdochiot commented Jul 22, 2019 • edited Loading

pts commented Jul 22, 2019

maadjordan commented Jul 22, 2019

maadjordan commented Jul 22, 2019 • edited Loading

zvezdochiot commented Jul 22, 2019

pts commented Jul 24, 2019

pts commented Jul 24, 2019

change `/Filter [/FlateDecode /DCTDecode]` to `/Filter /DCTDecode` #127

change `/Filter [/FlateDecode /DCTDecode]` to `/Filter /DCTDecode` #127

zvezdochiot commented Jul 21, 2019 •

edited

Loading

zvezdochiot commented Jul 21, 2019 •

edited

Loading

maadjordan commented Jul 21, 2019 •

edited

Loading

zvezdochiot commented Jul 22, 2019 •

edited

Loading

maadjordan commented Jul 22, 2019 •

edited

Loading