deal with pdf that embedding fonts #26

wanghaisheng · 2017-09-06T11:11:38Z

0708测试使用gs optimization 原来有问题的pdf 失败

(py3.5) ➜  pdftabextract git:(master) ✗ pdf2ps 111.pdf 111.ps     
(py3.5) ➜  pdftabextract git:(master) ✗ ps2pdf -dPDFSETTINGS=/ebook 111.ps 111-optimized.pdf

(py3.5) ➜  pdftabextract git:(master) ✗ pdffonts 111-optimized.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
(py3.5) ➜  pdftabextract git:(master) ✗  gs -o 111-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true 111.pdf

GPL Ghostscript 9.21 (2017-03-16)
Copyright (C) 2017 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
(py3.5) ➜  pdftabextract git:(master) ✗ pdffonts 111-optim.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
OHXVUR+SimSun                        TrueType          WinAnsi          yes yes yes     10  0
(py3.5) ➜  pdftabextract git:(master) ✗ pdftocairo 111.pdf 
Error: one of the output format options (-png, -jpeg, -ps, -eps, -pdf, -print, -printdlg, -svg) must be used.
(py3.5) ➜  pdftabextract git:(master) ✗ pdftocairo -pdf 111.pdf
Error: an output filename or '-' must be supplied when the output format is PDF and input PDF file is a local file.
(py3.5) ➜  pdftabextract git:(master) ✗ pdftocairo -o pdf 111.pdf
Error: one of the output format options (-png, -jpeg, -ps, -eps, -pdf, -print, -printdlg, -svg) must be used.
(py3.5) ➜  pdftabextract git:(master) ✗ pdftocairo -pdf 111.pdf 111-pdftocario.pdf
(py3.5) ➜  pdftabextract git:(master) ✗ pdffonts 111-pdftocario.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
JKQTOQ+SimSun                        CID TrueType      Identity-H       yes yes yes      9  0
OKUCXI+SimSun                        TrueType          WinAnsi          yes yes yes     10  0
(py3.5) ➜  pdftabextract git:(master) ✗ pdftotext 111-pdftocario.pdf 
(py3.5) ➜  pdftabextract git:(master) ✗ $ gs -sDEVICE=pdfwrite -o 111.pdf -dBATCH -f mypg3out.pdf Adobe-GB1-UCS2
zsh: command not found: $
(py3.5) ➜  pdftabextract git:(master) ✗ gs -sDEVICE=pdfwrite -o 111.pdf -dBATCH -f mypg3out.pdf Adobe-GB1-UCS2 
GPL Ghostscript 9.21 (2017-03-16)
Copyright (C) 2017 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Error: /undefinedfilename in (mypg3out.pdf)
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push
Dictionary stack:
   --dict:1204/1684(ro)(G)--   --dict:0/20(G)--   --dict:78/200(L)--
Current allocation mode is local
Last OS error: No such file or directory
GPL Ghostscript 9.21: Unrecoverable error, exit code 1
(py3.5) ➜  pdftabextract git:(master) ✗ gs -sDEVICE=pdfwrite -o 111-out.pdf -dBATCH -f 111.pdf Adobe-GB1-UCS2
GPL Ghostscript 9.21 (2017-03-16)
Copyright (C) 2017 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
Error: /rangecheck in defineresource
Operand stack:
   Adobe-GB1-UCS2   --dict:10/12(L)--   CMap   Adobe-GB1-UCS2   --dict:10/12(L)--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1983   1   3   %oparray_pop   1982   1   3   %oparray_pop   1966   1   3   %oparray_pop   1852   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   1929   3   5   %oparray_pop   defineresource   %errorexec_pop   --nostringval--   --nostringval--   --nostringval--
Dictionary stack:
   --dict:1204/1684(ro)(G)--   --dict:1/20(G)--   --dict:78/200(L)--   --dict:38/38(ro)(G)--   --dict:10/12(L)--   --dict:16/25(ro)(G)--
Current allocation mode is local
Current file position is 231880
GPL Ghostscript 9.21: Unrecoverable error, exit code 1
(py3.5) ➜  pdftabextract git:(master) ✗ gs -sDEVICE=pdfwrite -o 111-out.pdf -dBATCH -f 111.pdf Adobe-CNS1-UCS2 
GPL Ghostscript 9.21 (2017-03-16)
Copyright (C) 2017 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
Error: /rangecheck in defineresource
Operand stack:
   Adobe-CNS1-UCS2   --dict:10/12(L)--   CMap   Adobe-CNS1-UCS2   --dict:10/12(L)--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1983   1   3   %oparray_pop   1982   1   3   %oparray_pop   1966   1   3   %oparray_pop   1852   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   1929   3   5   %oparray_pop   defineresource   %errorexec_pop   --nostringval--   --nostringval--   --nostringval--
Dictionary stack:
   --dict:1204/1684(ro)(G)--   --dict:1/20(G)--   --dict:78/200(L)--   --dict:38/38(ro)(G)--   --dict:10/12(L)--   --dict:16/25(ro)(G)--
Current allocation mode is local
Current file position is 265113
GPL Ghostscript 9.21: Unrecoverable error, exit code 1
(py3.5) ➜  pdftabextract git:(master) ✗ gs -sDEVICE=pdfwrite -o 111-out.pdf -dBATCH -f 111.pdf Adobe-CNS1-UCS2
GPL Ghostscript 9.21 (2017-03-16)
Copyright (C) 2017 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
Error: /rangecheck in defineresource
Operand stack:
   Adobe-CNS1-UCS2   --dict:10/12(L)--   CMap   Adobe-CNS1-UCS2   --dict:10/12(L)--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1983   1   3   %oparray_pop   1982   1   3   %oparray_pop   1966   1   3   %oparray_pop   1852   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   1929   3   5   %oparray_pop   defineresource   %errorexec_pop   --nostringval--   --nostringval--   --nostringval--
Dictionary stack:
   --dict:1204/1684(ro)(G)--   --dict:1/20(G)--   --dict:78/200(L)--   --dict:38/38(ro)(G)--   --dict:10/12(L)--   --dict:16/25(ro)(G)--
Current allocation mode is local
Current file position is 265113
GPL Ghostscript 9.21: Unrecoverable error, exit code 1
(py3.5) ➜  pdftabextract git:(master) ✗ gs -sDEVICE=pdfwrite -o 111-out.pdf -dBATCH -f 111.pdf Adobe-GB1-UCS2 
GPL Ghostscript 9.21 (2017-03-16)
Copyright (C) 2017 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
Error: /rangecheck in defineresource
Operand stack:
   Adobe-GB1-UCS2   --dict:10/12(L)--   CMap   Adobe-GB1-UCS2   --dict:10/12(L)--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1983   1   3   %oparray_pop   1982   1   3   %oparray_pop   1966   1   3   %oparray_pop   1852   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   1929   3   5   %oparray_pop   defineresource   %errorexec_pop   --nostringval--   --nostringval--   --nostringval--
Dictionary stack:
   --dict:1204/1684(ro)(G)--   --dict:1/20(G)--   --dict:78/200(L)--   --dict:38/38(ro)(G)--   --dict:10/12(L)--   --dict:16/25(ro)(G)--
Current allocation mode is local
Current file position is 231880
GPL Ghostscript 9.21: Unrecoverable error, exit code 1
(py3.5) ➜  pdftabextract git:(master) ✗ pdftotext 111-out.pdf 
(py3.5) ➜  pdftabextract git:(master) ✗ pdffonts 111-out.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
OHXVUR+SimSun                        TrueType          WinAnsi          yes yes yes     10  0
(py3.5) ➜  pdftabextract git:(master) ✗ gs -sDEVICE=pdfwrite -o mypg3o2-111.pdf -dBATCH \                 
-c '/CIDSystemInfo << /Registry (Adobe) /Ordering (Unicode) /Supplement 1 >>' \
-f 111.pdf
GPL Ghostscript 9.21 (2017-03-16)
Copyright (C) 2017 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
(py3.5) ➜  pdftabextract git:(master) ✗ pdftotext mypg3o2-111.pdf 
(py3.5) ➜  pdftabextract git:(master) ✗ pdffonts mypg3o2-111.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
OHXVUR+SimSun                        TrueType          WinAnsi          yes yes yes     10  0
(py3.5) ➜  pdftabextract git:(master) ✗ pdffonts 11-reprint-osx.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SRPUEP+SimSun                        TrueType          WinAnsi          yes yes yes     13  0
(py3.5) ➜  pdftabextract git:(master) ✗ pdftotext 11-reprint-osx.pdf 
(py3.5) ➜  pdftabextract git:(master) ✗ pdffonts 11-reprint-osx.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ETOBOE+SimSun                        TrueType          WinAnsi          yes yes yes     13  0
(py3.5) ➜  pdftabextract git:(master) ✗ pdftotext 11-reprint-osx.pdf
(py3.5) ➜  pdftabextract git:(master) ✗ pstopdf 11-reprint-osx.ps 11-reprint-osx-2.pdf
/AppleBraille-Outline6Dot
/AppleBraille-Outline8Dot
/AppleBraille-Pinpoint6Dot
/AppleBraille-Pinpoint8Dot
/AppleBraille
/AppleColorEmoji
/.AppleColorEmojiUI
/AppleSymbols
/AquaKana
/AquaKana-Bold
/ArialHebrew
/ArialHebrew-Bold
/ArialHebrew-Light
/.ArialHebrewDeskInterface
/.ArialHebrewDeskInterface-Bold
/.ArialHebrewDeskInterface-Light
/ArialHebrewScholar
/ArialHebrewScholar-Bold
/ArialHebrewScholar-Light
/AvenirNextCondensed-Bold
/AvenirNextCondensed-BoldItalic
/AvenirNextCondensed-DemiBold
/AvenirNextCondensed-DemiBoldItalic
/AvenirNextCondensed-Italic
/AvenirNextCondensed-Medium
/AvenirNextCondensed-MediumItalic
/AvenirNextCondensed-Regular
/AvenirNextCondensed-Heavy
/AvenirNextCondensed-HeavyItalic
/AvenirNextCondensed-UltraLight
/AvenirNextCondensed-UltraLightItalic
/AvenirNext-Bold
/AvenirNext-BoldItalic
/AvenirNext-DemiBold
/AvenirNext-DemiBoldItalic
/AvenirNext-Italic
/AvenirNext-Medium
/AvenirNext-MediumItalic
/AvenirNext-Regular
/AvenirNext-Heavy
/AvenirNext-HeavyItalic
/AvenirNext-UltraLight
/AvenirNext-UltraLightItalic
/Avenir-Book
/Avenir-BookOblique
/Avenir-Black
/Avenir-BlackOblique
/Avenir-Heavy
/Avenir-HeavyOblique
/Avenir-Light
/Avenir-LightOblique
/Avenir-Medium
/Avenir-MediumOblique
/Avenir-Oblique
/Avenir-Roman
/Courier
/Courier-Bold
/Courier-Oblique
/Courier-BoldOblique
/GeezaPro
/GeezaPro-Bold
/.GeezaProInterface
/.GeezaProInterface-Bold
/.GeezaProInterface-Light
/.GeezaProPUA-Regular
/.GeezaProPUA-Bold
/Geneva
/Helvetica
/Helvetica-Bold
/Helvetica-Oblique
/Helvetica-BoldOblique
/Helvetica-Light
/Helvetica-LightOblique
/HelveticaNeue-Bold
/HelveticaNeue
/HelveticaNeue-UltraLight
/HelveticaNeue-Italic
/HelveticaNeue-Light
/HelveticaNeue-UltraLightItalic
/HelveticaNeue-CondensedBlack
/HelveticaNeue-CondensedBold
/HelveticaNeue-BoldItalic
/HelveticaNeue-LightItalic
/HelveticaNeue-Medium
/HelveticaNeue-Thin
/HelveticaNeue-ThinItalic
/HelveticaNeue-MediumItalic
/.HelveticaNeueDeskInterface-Regular
/.HelveticaNeueDeskInterface-Bold
/.HelveticaNeueDeskInterface-Italic
/.HelveticaNeueDeskInterface-BoldItalic
/.HelveticaNeueDeskInterface-MediumP4
/.HelveticaNeueDeskInterface-MediumItalicP4
/.HelveticaNeueDeskInterface-Light
/.HelveticaNeueDeskInterface-Thin
/.HelveticaNeueDeskInterface-UltraLightP2
/.HelveticaNeueDeskInterface-Heavy
/.Keyboard
/LastResort
/LucidaGrande
/LucidaGrande-Bold
/.LucidaGrandeUI
/.LucidaGrandeUI-Bold
/MarkerFelt-Thin
/MarkerFelt-Wide
/Menlo-Regular
/Menlo-Bold
/Menlo-Italic
/Menlo-BoldItalic
/Monaco
/Noteworthy-Light
/Noteworthy-Bold
/Optima-Regular
/Optima-Bold
/Optima-Italic
/Optima-BoldItalic
/Optima-ExtraBlack
/Palatino-Roman
/Palatino-Italic
/Palatino-Bold
/Palatino-BoldItalic
/.SFCompactDisplay-Black
/.SFCompactDisplay-Bold
/.SFCompactDisplay-Heavy
/.SFCompactDisplay-Light
/.SFCompactDisplay-Medium
/.SFCompactDisplay-Regular
/.SFCompactDisplay-Semibold
/.SFCompactDisplay-Thin
/.SFCompactDisplay-Ultralight
/.SFCompactRounded-Black
/.SFCompactRounded-Bold
/.SFCompactRounded-Heavy
/.SFCompactRounded-Light
/.SFCompactRounded-Medium
/.SFCompactRounded-Regular
/.SFCompactRounded-Semibold
/.SFCompactRounded-Thin
/.SFCompactRounded-Ultralight
/.SFCompactText-Bold
/.SFCompactText-BoldItalic
/.SFCompactText-Heavy
/.SFCompactText-HeavyItalic
/.SFCompactText-Light
/.SFCompactText-LightItalic
/.SFCompactText-Medium
/.SFCompactText-MediumItalic
/.SFCompactText-Regular
/.SFCompactText-Italic
/.SFCompactText-Semibold
/.SFCompactText-SemiboldItalic
/.SFNSDisplay
/.SFNSDisplayCondensed-Black
/.SFNSDisplayCondensed-Bold
/.SFNSDisplayCondensed-Heavy
/.SFNSDisplayCondensed-Light
/.SFNSDisplayCondensed-Medium
/.SFNSDisplayCondensed-Regular
/.SFNSDisplayCondensed-Semibold
/.SFNSDisplayCondensed-Thin
/.SFNSDisplayCondensed-Ultralight
/.SFNSText
/.SFNSTextCondensed-Bold
/.SFNSTextCondensed-Heavy
/.SFNSTextCondensed-Light
/.SFNSTextCondensed-Medium
/.SFNSTextCondensed-Regular
/.SFNSTextCondensed-Semibold
/.SFNSText-Italic
/STHeitiTC-Light
/STHeitiSC-Light
/STHeitiTC-Medium
/STHeitiSC-Medium
/Symbol
/Thonburi
/Thonburi-Bold
/Thonburi-Light
/Times-Roman
/Times-Bold
/Times-Italic
/Times-BoldItalic
/ZapfDingbatsITC
/ZapfDingbats
/ACaslonPro-Bold
/ACaslonPro-BoldItalic
/ACaslonPro-Italic
/ACaslonPro-Regular
/ACaslonPro-Semibold
/ACaslonPro-SemiboldItalic
/AdobeArabic-Bold
/AdobeArabic-BoldItalic
/AdobeArabic-Italic
/AdobeArabic-Regular
/AdobeDevanagari-Bold
/AdobeDevanagari-BoldItalic
/AdobeDevanagari-Italic
/AdobeDevanagari-Regular
/AdobeFangsongStd-Regular
/AdobeFanHeitiStd-Bold
/AdobeGothicStd-Bold
/AdobeHebrew-Bold
/AdobeHebrew-BoldItalic
/AdobeHebrew-Italic
/AdobeHebrew-Regular
/AdobeHeitiStd-Regular
/AdobeKaitiStd-Regular
/AdobeMingStd-Light
/AdobeMyungjoStd-Medium
/AdobeNaskh-Medium
/AdobeSongStd-Light
/AGaramondPro-Bold
/AGaramondPro-BoldItalic
/AGaramondPro-Italic
/AGaramondPro-Regular
/AlNile
/AlNile-Bold
/.AlNilePUA
/.AlNilePUA-Bold
/AlTarikh
/.AlTarikhPUA
/AlBayan
/.AlBayanPUA
/AlBayan-Bold
/.AlBayanPUA-Bold
/AmericanTypewriter
/AmericanTypewriter-Light
/AmericanTypewriter-Bold
/AmericanTypewriter-Semibold
/AmericanTypewriter-Condensed
/AmericanTypewriter-CondensedBold
/AmericanTypewriter-CondensedLight
/AndaleMono
/Apple-Chancery
/AppleGothic
/AppleMyungjo
/Arial-Black
/Arial-BoldItalicMT
/Arial-BoldMT
/Arial-ItalicMT
/ArialNarrow-BoldItalic
/ArialNarrow-Bold
/ArialNarrow-Italic
/ArialNarrow
/ArialRoundedMTBold
/ArialUnicodeMS
/ArialMT
/Athelas-Regular
/Athelas-Italic
/Athelas-BoldItalic
/Athelas-Bold
/Ayuthaya
/Baghdad
/.BaghdadPUA
/BanglaMN
/BanglaMN-Bold
/BanglaSangamMN
/BanglaSangamMN-Bold
/Baskerville
/Baskerville-Bold
/Baskerville-Italic
/Baskerville-BoldItalic
/Baskerville-SemiBold
/Baskerville-SemiBoldItalic
/Beirut
/.BeirutPUA
/BigCaslon-Medium
/BirchStd
/BlackoakStd
/BodoniSvtyTwoOSITCTT-Book
/BodoniSvtyTwoOSITCTT-BookIt
/BodoniSvtyTwoOSITCTT-Bold
/BodoniSvtyTwoSCITCTT-Book
/BodoniSvtyTwoITCTT-Book
/BodoniSvtyTwoITCTT-BookIta
/BodoniSvtyTwoITCTT-Bold
/BodoniOrnamentsITCTT
/BradleyHandITCTT-Bold
/BrushScriptMT
/BrushScriptStd
/Chalkboard
/Chalkboard-Bold
/ChalkboardSE-Light
/ChalkboardSE-Regular
/ChalkboardSE-Bold
/Chalkduster
/ChaparralPro-Bold
/ChaparralPro-BoldIt
/ChaparralPro-Italic
/ChaparralPro-LightIt
/ChaparralPro-Regular
/CharlemagneStd-Bold
/Charter-Roman
/Charter-Italic
/Charter-BoldItalic
/Charter-Bold
/Charter-BlackItalic
/Charter-Black
/Cochin
/Cochin-Bold
/Cochin-Italic
/Cochin-BoldItalic
/ComicSansMS-Bold
/ComicSansMS
/CooperBlackStd-Italic
/CooperBlackStd
/Copperplate
/Copperplate-Light
/Copperplate-Bold
/CorsivaHebrew
/CorsivaHebrew-Bold
/CourierNewPS-BoldItalicMT
/CourierNewPS-BoldMT
/CourierNewPS-ItalicMT
/CourierNewPSMT
/Damascus
/.DamascusPUA
/DamascusLight
/.DamascusPUALight
/DamascusMedium
/.DamascusPUAMedium
/DamascusBold
/.DamascusPUABold
/DamascusSemiBold
/.DamascusPUASemiBold
/DecoTypeNaskh
/.DecoTypeNaskhPUA
/DevanagariSangamMN
/DevanagariSangamMN-Bold
/DevanagariMT
/DevanagariMT-Bold
/Didot
/Didot-Italic
/Didot-Bold
/DINAlternate-Bold
/DINCondensed-Bold
/DiwanKufi
/.DiwanKufiPUA
/DiwanThuluth
/EuphemiaUCAS
/EuphemiaUCAS-Bold
/EuphemiaUCAS-Italic
/Farah
/.FarahPUA
/Farisi
/Futura-Medium
/Futura-MediumItalic
/Futura-Bold
/Futura-CondensedMedium
/Futura-CondensedExtraBold
/Georgia-BoldItalic
/Georgia-Bold
/Georgia-Italic
/Georgia
/GiddyupStd
/GillSans
/GillSans-Bold
/GillSans-Italic
/GillSans-BoldItalic
/GillSans-SemiBold
/GillSans-SemiBoldItalic
/GillSans-UltraBold
/GillSans-Light
/GillSans-LightItalic
/GujaratiSangamMN
/GujaratiSangamMN-Bold
/GujaratiMT
/GujaratiMT-Bold
/GurmukhiMN
/GurmukhiMN-Bold
/GurmukhiSangamMN
/GurmukhiSangamMN-Bold
/MonotypeGurmukhi
/Herculanum
/HoboStd
/HoeflerText-Ornaments
/HoeflerText-Regular
/HoeflerText-Black
/HoeflerText-Italic
/HoeflerText-BlackItalic
/Impact
/InaiMathi
/IowanOldStyle-Roman
/IowanOldStyle-Bold
/IowanOldStyle-Italic
/IowanOldStyle-BoldItalic
/IowanOldStyle-Black
/IowanOldStyle-BlackItalic
/IowanOldStyle-Titling
/Kailasa
/Kailasa-Bold
/KannadaMN
/KannadaMN-Bold
/KannadaSangamMN
/KannadaSangamMN-Bold
/Kefa-Regular
/Kefa-Bold
/KhmerMN
/KhmerMN-Bold
/KhmerSangamMN
/Kokonor
/KozGoPr6N-Bold
/KozGoPr6N-ExtraLight
/KozGoPr6N-Heavy
/KozGoPr6N-Light
/KozGoPr6N-Medium
/KozGoPr6N-Regular
/KozGoPro-Bold
/KozGoPro-ExtraLight
/KozGoPro-Heavy
/KozGoPro-Light
/KozGoPro-Medium
/KozGoPro-Regular
/KozMinPr6N-Bold
/KozMinPr6N-ExtraLight
/KozMinPr6N-Heavy
/KozMinPr6N-Light
/KozMinPr6N-Medium
/KozMinPr6N-Regular
/KozMinPro-Bold
/KozMinPro-ExtraLight
/KozMinPro-Heavy
/KozMinPro-Light
/KozMinPro-Medium
/KozMinPro-Regular
/Krungthep
/KufiStandardGK
/.KufiStandardGKPUA
/LaoMN
/LaoMN-Bold
/LaoSangamMN
/LetterGothicStd-Bold
/LetterGothicStd-BoldSlanted
/LetterGothicStd-Slanted
/LetterGothicStd
/LithosPro-Black
/LithosPro-Regular
/Luminari-Regular
/MalayalamMN
/MalayalamMN-Bold
/MalayalamSangamMN
/MalayalamSangamMN-Bold
/Marion-Regular
/Marion-Italic
/Marion-Bold
/MesquiteStd
/MicrosoftSansSerif
/MinionPro-Bold
/MinionPro-BoldCn
/MinionPro-BoldCnIt
/MinionPro-BoldIt
/MinionPro-It
/MinionPro-Medium
/MinionPro-MediumIt
/MinionPro-Regular
/MinionPro-Semibold
/MinionPro-SemiboldIt
/DiwanMishafiGold
/DiwanMishafi
/Mshtakan
/MshtakanOblique
/MshtakanBold
/MshtakanBoldOblique
/Muna
/.MunaPUA
/MunaBold
/.MunaPUABold
/MunaBlack
/.MunaPUABlack
/MyanmarMN
/MyanmarMN-Bold
/MyanmarSangamMN
/MyanmarSangamMN-Bold
/MyriadArabic-Bold
/MyriadArabic-BoldIt
/MyriadArabic-It
/MyriadArabic-Regular
/MyriadHebrew-Bold
/MyriadHebrew-BoldIt
/MyriadHebrew-It
/MyriadHebrew-Regular
/MyriadPro-Bold
/MyriadPro-BoldCond
/MyriadPro-BoldCondIt
/MyriadPro-BoldIt
/MyriadPro-Cond
/MyriadPro-CondIt
/MyriadPro-It
/MyriadPro-Regular
/MyriadPro-Semibold
/MyriadPro-SemiboldIt
/Nadeem
/.NadeemPUA
/NewPeninimMT
/NewPeninimMT-Inclined
/NewPeninimMT-BoldInclined
/NewPeninimMT-Bold
/NuevaStd-Bold
/NuevaStd-BoldCond
/NuevaStd-BoldCondItalic
/NuevaStd-Cond
/NuevaStd-CondItalic
/NuevaStd-Italic
/OCRAStd
/OratorStd-Slanted
/OratorStd
/OriyaMN
/OriyaMN-Bold
/OriyaSangamMN
/OriyaSangamMN-Bold
/Papyrus-Condensed
/Papyrus
/Phosphate-Inline
/Phosphate-Solid
/PlantagenetCherokee
/PoplarStd
/PrestigeEliteStd-Bd
/PTMono-Bold
/PTMono-Regular
/PTSans-Regular
/PTSans-Italic
/PTSans-NarrowBold
/PTSans-Narrow
/PTSans-CaptionBold
/PTSans-Caption
/PTSans-BoldItalic
/PTSans-Bold
/PTSerif-Regular
/PTSerif-Italic
/PTSerif-BoldItalic
/PTSerif-Bold
/PTSerif-Caption
/PTSerif-CaptionItalic
/Raanana
/RaananaBold
/RosewoodStd-Regular
/Sana
/.SanaPUA
/Sathu
/SavoyeLetPlain
/.SavoyeLetPlainCC
/Seravek
/Seravek-Italic
/Seravek-MediumItalic
/Seravek-Medium
/Seravek-LightItalic
/Seravek-Light
/Seravek-ExtraLightItalic
/Seravek-ExtraLight
/Seravek-BoldItalic
/Seravek-Bold
/ShreeDev0714
/ShreeDev0714-Bold
/ShreeDev0714-Italic
/ShreeDev0714-Bold-Italic
/Silom
/SinhalaMN
/SinhalaMN-Bold
/SinhalaSangamMN
/SinhalaSangamMN-Bold
/Skia-Regular
/SnellRoundhand
/SnellRoundhand-Bold
/SnellRoundhand-Black
/STSongti-SC-Black
/STSongti-SC-Bold
/STSongti-TC-Bold
/STSongti-SC-Light
/STSong
/STSongti-TC-Light
/STSongti-SC-Regular
/STSongti-TC-Regular
/StencilStd
/STIXGeneral-Regular
/STIXGeneral-Bold
/STIXGeneral-BoldItalic
/STIXGeneral-Italic
/STIXIntegralsD-Bold
/STIXIntegralsD-Regular
/STIXIntegralsSm-Bold
/STIXIntegralsSm-Regular
/STIXIntegralsUp-Bold
/STIXIntegralsUpD-Bold
/STIXIntegralsUpD-Regular
/STIXIntegralsUp-Regular
/STIXIntegralsUpSm-Bold
/STIXIntegralsUpSm-Regular
/STIXNonUnicode-Regular
/STIXNonUnicode-Bold
/STIXNonUnicode-BoldItalic
/STIXNonUnicode-Italic
/STIXSizeFiveSym-Regular
/STIXSizeFourSym-Bold
/STIXSizeFourSym-Regular
/STIXSizeOneSym-Bold
/STIXSizeOneSym-Regular
/STIXSizeThreeSym-Bold
/STIXSizeThreeSym-Regular
/STIXSizeTwoSym-Bold
/STIXSizeTwoSym-Regular
/STIXVariants-Regular
/STIXVariants-Bold
/SukhumvitSet-Thin
/SukhumvitSet-Light
/SukhumvitSet-Text
/SukhumvitSet-Medium
/SukhumvitSet-SemiBold
/SukhumvitSet-Bold
/Superclarendon-Regular
/Superclarendon-Italic
/Superclarendon-LightItalic
/Superclarendon-Light
/Superclarendon-BoldItalic
/Superclarendon-Bold
/Superclarendon-BlackItalic
/Superclarendon-Black
/Tahoma-Bold
/Tahoma
/TamilMN
/TamilMN-Bold
/TamilSangamMN
/TamilSangamMN-Bold
/TeamViewer10
/TektonPro-Bold
/TektonPro-BoldCond
/TektonPro-BoldExt
/TektonPro-BoldObl
/TeluguMN
/TeluguMN-Bold
/TeluguSangamMN
/TeluguSangamMN-Bold
/TimesNewRomanPS-BoldItalicMT
/TimesNewRomanPS-BoldMT
/TimesNewRomanPS-ItalicMT
/TimesNewRomanPSMT
/TrajanPro-Bold
/TrajanPro-Regular
/Trattatello
/Trebuchet-BoldItalic
/TrebuchetMS-Bold
/TrebuchetMS-Italic
/TrebuchetMS
/Verdana-BoldItalic
/Verdana-Bold
/Verdana-Italic
/Verdana
/Waseem
/WaseemLight
/Webdings
/Wingdings2
/Wingdings3
/Wingdings-Regular
/Zapfino
/AbadiMT-CondensedExtraBold
/AbadiMT-CondensedLight
/AndaleMono
/Arial-Black
/ArialNarrow
/ArialNarrow-Bold
/ArialNarrow-Italic
/ArialNarrow-BoldItalic
/ArialRoundedMTBold
/BaskOldFace
/Bauhaus93
/BellMT
/BellMTBold
/BellMTItalic
/BernardMT-Condensed
/BookAntiqua
/BookAntiqua-Bold
/BookAntiqua-Italic
/BookAntiqua-BoldItalic
/BookmanOldStyle
/BookmanOldStyle-Bold
/BookmanOldStyle-Italic
/BookmanOldStyle-BoldItalic
/Braggadocio
/BritannicBold
/Calibri-Light
/CalistoMT
/CalisMTBol
/CalistoMT-Italic
/CalistoMT-BoldItalic
/Century
/CenturyGothic
/CenturyGothic-Bold
/CenturyGothic-Italic
/CenturyGothic-BoldItalic
/CenturySchoolbook
/CenturySchoolbook-Bold
/CenturySchoolbook-Italic
/CenturySchoolbook-BoldItalic
/ColonnaMT
/ComicSansMS
/ComicSansMS-Bold
/CooperBlack
/CopperplateGothic-Bold
/CopperplateGothic-Light
/CurlzMT
/Desdemona
/EdwardianScriptITC
/EngraversMT
/EngraversMT-Bold
/EurostileRegular
/EurostileBold
/FootlightMTLight
/Garamond
/Garamond-Bold
/Garamond-Italic
/Georgia
/Georgia-Bold
/Georgia-Italic
/Georgia-BoldItalic
/GillSans-UltraBold
/GloucesterMT-ExtraCondensed
/GoudyOldStyleT-Regular
/GoudyOldStyleT-Bold
/GoudyOldStyleT-Italic
/Haettenschweiler
/Harrington
/Impact
/ImprintMT-Shadow
/KinoMT
/LucidaBlackletter
/LucidaBright
/LucidaBright-Demi
/LucidaBright-Italic
/LucidaBright-DemiItalic
/LucidaCalligraphy-Italic
/LucidaFax
/LucidaFax-Demi
/LucidaFax-Italic
/LucidaFax-DemiItalic
/LucidaHandwriting-Italic
/LucidaSans
/LucidaSans-Demi
/LucidaSans-Italic
/LucidaSans-DemiItalic
/LucidaSans-Typewriter
/LucidaSans-TypewriterBold
/LucidaSans-TypewriterOblique
/LucidaSans-TypewriterBoldOblique
/MaturaMTScriptCapitals
/Mistral
/Modern-Regular
/MonotypeCorsiva
/MonotypeSorts
/MT-Extra
/NewsGothicMT
/NewsGothicMT-Bold
/NewsGothicMT-Italic
/Onyx
/PerpetuaTitlingMT-Light
/PerpetuaTitlingMT-Bold
/Playbill
/Rockwell
/Rockwell-Bold
/Rockwell-Italic
/Rockwell-BoldItalic
/Rockwell-ExtraBold
/Stencil
/Tahoma
/Tahoma-Bold
/TrebuchetMS
/TrebuchetMS-Bold
/TrebuchetMS-Italic
/Trebuchet-BoldItalic
/LatinWide
/Courier
/NotDefFont
(py3.5) ➜  pdftabextract git:(master) ✗ pdffonts 11-reprint-osx.pdf  
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
(py3.5) ➜  pdftabextract git:(master) ✗ pdftotext 11-reprint-osx.pdf 
(py3.5) ➜  pdftabextract git:(master) ✗ pstopdf
Usage: pstopdf [inputfile] [-o outname] [-l] [-p] [-i]
Try: man pstopdf
(py3.5) ➜  pdftabextract git:(master) ✗ man pstopdf
(py3.5) ➜  pdftabextract git:(master) ✗ gs -dBATCH -dNOPAUSE -dSAFER  \
-dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
-dAutoFilterMonoImages=false \
-dAutoFilterGrayImages=false \
-dAutoFilterColorImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-sDEVICE=pdfwrite \
-dFirstPage=3 -dLastPage=3 \
-sOutputFile=mypg3out-111.pdf -f 111.pdf 
GPL Ghostscript 9.21 (2017-03-16)
Copyright (C) 2017 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.

Requested FirstPage is greater than the number of pages in the file: 1
   No pages will be processed (FirstPage > LastPage).

The text was updated successfully, but these errors were encountered:

wanghaisheng · 2017-09-06T11:12:13Z

https://stackoverflow.com/questions/2926159/copypasting-text-from-pdf-results-in-garbage

1	2
Very often in such cases, where you can't select, copy'n'paste text
from the Acrobat (Reader) window, there is another option which may work
nevertheless:

Open 'File' menu,
select 'Save as...',
select 'Text (normal) (*.txt)',
browse to the target directory,
type the name you want to use for the text file.

You'll have all text from all pages in the file and need to locate
the spot you wanted to copy'n'paste initially -- insofar it is not as
comfortable as direct copy'n'paste. But it works more reliably....

It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).

Update

You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.

Here is an example output, which demonstrates where a problem for
text extraction will very likely occur. It uses one of these hand-coded
PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:

$ pdffonts textextract-bad2.pdf
name type encoding emb sub uni object ID

BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0

How to interpret this table?

The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
Both fonts are of type TrueType.
Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn).
However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).

The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.

A missing /ToUnicode table for a specific font is almost
always a sure indicator that text strings using this font cannot be
extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is
there, text extraction may still pose a problem, because this table may
be damaged, incorrect or incomplete -- as seen in many real-world PDF
files, and as also demonstrated by a few companion files in the above
linked GitHub repository.)

wanghaisheng · 2017-09-06T11:12:35Z

Native PDF (not scanned) – Most likely the font in the PDF file is embedded. Embedded fonts cannot be perfectly extracted. To verify the font is embedded, open the PDF with Acrobat Reader, copy some text and paste it into another application such as Word or Notepad. If the text is not recognized, the font is embedded. To work around this problem please select ‘OCR’ from the view menu to force the OCR recognition. Scanned PDF – The OCR Engine is sensitive to poor quality scans. In order to improve the OCR recognition quality you can either:

Rescan your document in higher resolution.
Try to remove any hand written text, watermarks, etc.
Try to change the OCR advanced settings to achieve better results by going to Start->Options->OCR. You can read more about it here: User Interface Dialogs Options OCR .
If any of the columns have a specific format (Number, date etc.) you can set the column format in advance by right clicking the columns and choosing the correct format from “Column Format in Output”. You can read more about it here: Concepts-Conversion Formats.

https://www.cogniview.com/support/faq#
https://www.cogniview.com/help/pdf2xl-ocr/html/UserInterface-Dialogs-Options-OCR.php

wanghaisheng · 2017-09-06T11:13:36Z

http://marc.info/?l=cairo-bugs&m=134283298609591

使用poppler-util替换字体

wanghaisheng · 2017-09-06T11:14:15Z

问题
https://superuser.com/questions/137824/pdf-has-garbled-text-when-copy-pasting/268348#268348

Indeed, custom font encoding was the culprit for me. However, Chrome wasn't the solution. I solved the problem partially with Ghostscript regenerating a PDF from the PS (I was lucky to have the PS source). Any character groups to which LaTeX applies ligatures (e.g. ff, c, fi, etc.) don't show up in the copied text of the PDF, which requires some editing when you copy/paste. – Fuhrmanator Jan 28 '15 at 19:43

mozilla/pdf.js#6330
This PDF file looks like a scanned document, where OCR software was used to (try to) enable text-selection and copying. Unfortunately the PDF file itself actually specifies the text in the broken way seen in e.g. #6330 (comment) above.
Since this unfortunately is an issue with some OCR software incorrectly recognizing the text, there's really not much we can do about it (and please note that other PDF viewers have the exact same problem).

Closing as invalid, since the PDF file itself is causing the issue.

https://stackoverflow.com/questions/37870719/ghostscript-preserve-pdf-inputs-font

https://blog.idrsolutions.com/2010/01/embedded-pdf-truetype-fonts-are-always-mac-encoded-unless-they-are-not/

Embedded PDF Truetype fonts are always MAC encoded unless they are not

It is one of these features which is broken but it is now too late to fix.

Inside a PDF file, all text data is stored as a binary number and this value is decoded into the actual glyph value (ie the value 65 is converted into the text value ‘A’). Because the PDF file format is ‘multiplatform’, there are a several possible sets of Standard Encoding Formats to use for this conversion (ie WinAnsi for Windows, and MacRoman for standard MAC values). This is because Windows and MAC originally evolved with different charactersets and values. Most of the time values are identical (A is value 65 in both MAC and WIN encoding) but certain accented characters have different values. So values 132 is Ntilde (letter N with a wavy line above in MAC encoding) but quotedblbase (double quotes at bottom of the line) on Windows. So long as we know which translation table to use, this is not a problem of course….

The issue comes with embedded Truetype fonts because they will always list them as MAC encoded in the PDF file (which is what the specification says they should be) when they are actually WIN encoded. Using the wrong look-up table does not matter for most values (as the results are identical) but it does break certain letters.

So what you need to do is to figure out if the font is actually WIN or MAC encoded yourself and ignore the setting in the PDF file. There is (of course) no documented way to do and several values can appear as different values in either…

What we did was to develop some heuristics to work it out which we continually test against known files and tweak as needed looking at the actually font values present, seeing whether WIN or MAC encoding gives a ‘better fit’ and checking certain key values. It also needs to factor in the fact that the font maybe subsetted so only a selection of values will be present.

So if you get some odd characters working with PDF files containing Truetype fonts, this may well be the reason. And if you come across a file displayed in our PDF viewer which has some odd characters, please do send us the file so we can continue to improve our code.
https://ghostscript.com/pipermail/gs-bugs/2013-November/034047.html

使用gs 重新optimization pdf的话可能能实现 pdf fonts的问题
https://stackoverflow.com/questions/10450120/optimize-pdf-files-with-ghostscript-or-other
http://blogs.datalogics.com/2016/06/30/pdf-optimization-fonts-and-font-subsets/
https://stackoverflow.com/questions/2926159/copypasting-text-from-pdf-results-in-garbage

pdf2ps file.pdf file.ps
ps2pdf -dPDFSETTINGS=/ebook file.ps file-optimized.pdf

gs -o p3-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true p3.pdf

gs
-o output.pdf
[...other options...]
-dEmbedAllFonts=false
-dSubsetFonts=true
-dConvertCMYKImagesToRGB=true
-dCompressFonts=true
-c ".setpdfwrite <</AlwaysEmbed [ ]>> setdistillerparams"
-c ".setpdfwrite <</NeverEmbed [/Courier /Courier-Bold /Courier-Oblique /Courier-BoldOblique /Helvetica /Helvetica-Bold /Helvetica-Oblique /Helvetica-BoldOblique /Times-Roman /Times-Bold /Times-Italic /Times-BoldItalic /Symbol /ZapfDingbats /Arial]>> setdistillerparams"
-f input.pdf

https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/1678470

wanghaisheng · 2017-09-06T11:17:36Z

http://markmail.org/thread/43tb7q4qwor42fhy#query:+page:1+mid:7uu4wms3mdcv3y32+state:results

t output
device and on Windows platforms in general. It consists mainly of three
parts:

Adapt bug fix for bug 11413 to the postscript device
A small bug fix when locateFont doesn't find a suitable font and
returns a null pointer
CJK substitute implementation on WIndows platforms.

to 1.:
Adapting the implementation of the bug fix for splash and cairo to the
postscript device was quite easy. But my first proofs of the output with
ghostscript 8.71 shows some regressions where the CJK chars have a
smaller height than the default square of the font. But the "48" in the
output of bug-poppler11413.pdf which is set in a "normal" font but
rotated was at the right position. Then I stepped to ghostscript 9.04,
and now the CJK chars were shown correctly, but the 48 was positioned
wrong. But because of these different tests I think that it is still a
problem in ghostscript when using a mix of CJK fonts and "normal" fonts.
BTW, also Acrobat X distiller has problems with the position of the "48"!

I understand this applies to non-windows too, right?

Albert

to 2.:
On my first tests with PDF which uses non embedded CJK fonts on Windows
I got crashes. Reason for it was that GlobalParamWin returns Helvetica,
which is not a CID font, but locateFont accepts here only CID fonts and
therefore returns a NULL pointer. I first fixed that and then decided to
return as default MS Mincho if a CID font is expected.

to 3.:
When You install ghostscript on WIndows You're able to switch on CJK
support. This will create a cidfmap file in the gs-lib directory. The
ps file which creates it (mkcidfm.ps) runs over the windows font
directory and tries to create a suitable substitution table for missing
CJK fonts. The cidfmap file is more or less PDF like, so it's quite easy
to parse it with our parser and create a substitution table in
GlobalParamsWin and use that table. But I expect it in the poppler data
dir instead of searching for ghostscript installation. If it is not
there, it always returns the default CID font of point 2.
You can either copy it from the gs lib directory or create it with the
ghostscript tool calling

gswin32c -q -dBATCH -sFONTDIR=
-sCIDFMAP=/cidfmap mkcidfm.ps

To clearify the format of cidfmap I attach the file produced on my
installation. Keep care: I have not a default windows installation with
windows on c:/windows, my windows installation is on drive f!

wanghaisheng · 2017-09-06T11:24:09Z



$ gs -dBATCH -dNOPAUSE -dSAFER  \
-dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
-dAutoFilterMonoImages=false \
-dAutoFilterGrayImages=false \
-dAutoFilterColorImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-sDEVICE=pdfwrite \
-dFirstPage=3 -dLastPage=3 \
-sOutputFile=mypg3out.pdf -f fontspec.pdf

http://stackoverflow.com/questions/11093051/handling-remapping-missing-problematic-cid-cjk-fonts-in-pdf-with-ghostscript?rq=1