-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deal with pdf that embedding fonts #26
Comments
https://stackoverflow.com/questions/2926159/copypasting-text-from-pdf-results-in-garbage
Open 'File' menu, You'll have all text from all pages in the file and need to locate It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu). Update You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF. Here is an example output, which demonstrates where a problem for $ pdffonts textextract-bad2.pdf BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 How to interpret this table? The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold. The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters. A missing /ToUnicode table for a specific font is almost |
Native PDF (not scanned) – Most likely the font in the PDF file is embedded. Embedded fonts cannot be perfectly extracted. To verify the font is embedded, open the PDF with Acrobat Reader, copy some text and paste it into another application such as Word or Notepad. If the text is not recognized, the font is embedded. To work around this problem please select ‘OCR’ from the view menu to force the OCR recognition. Scanned PDF – The OCR Engine is sensitive to poor quality scans. In order to improve the OCR recognition quality you can either:
https://www.cogniview.com/support/faq# |
http://marc.info/?l=cairo-bugs&m=134283298609591 使用poppler-util替换字体 |
问题
mozilla/pdf.js#6330 Closing as invalid, since the PDF file itself is causing the issue. https://stackoverflow.com/questions/37870719/ghostscript-preserve-pdf-inputs-font
It is one of these features which is broken but it is now too late to fix. Inside a PDF file, all text data is stored as a binary number and this value is decoded into the actual glyph value (ie the value 65 is converted into the text value ‘A’). Because the PDF file format is ‘multiplatform’, there are a several possible sets of Standard Encoding Formats to use for this conversion (ie WinAnsi for Windows, and MacRoman for standard MAC values). This is because Windows and MAC originally evolved with different charactersets and values. Most of the time values are identical (A is value 65 in both MAC and WIN encoding) but certain accented characters have different values. So values 132 is Ntilde (letter N with a wavy line above in MAC encoding) but quotedblbase (double quotes at bottom of the line) on Windows. So long as we know which translation table to use, this is not a problem of course…. The issue comes with embedded Truetype fonts because they will always list them as MAC encoded in the PDF file (which is what the specification says they should be) when they are actually WIN encoded. Using the wrong look-up table does not matter for most values (as the results are identical) but it does break certain letters. So what you need to do is to figure out if the font is actually WIN or MAC encoded yourself and ignore the setting in the PDF file. There is (of course) no documented way to do and several values can appear as different values in either… What we did was to develop some heuristics to work it out which we continually test against known files and tweak as needed looking at the actually font values present, seeing whether WIN or MAC encoding gives a ‘better fit’ and checking certain key values. It also needs to factor in the fact that the font maybe subsetted so only a selection of values will be present. So if you get some odd characters working with PDF files containing Truetype fonts, this may well be the reason. And if you come across a file displayed in our PDF viewer which has some odd characters, please do send us the file so we can continue to improve our code. 使用gs 重新optimization pdf的话 可能能实现 pdf fonts的问题 pdf2ps file.pdf file.ps gs -o p3-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true p3.pdf gs https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/1678470 |
http://markmail.org/thread/43tb7q4qwor42fhy#query:+page:1+mid:7uu4wms3mdcv3y32+state:results
to 1.: I understand this applies to non-windows too, right? Albert to 2.: to 3.: gswin32c -q -dBATCH -sFONTDIR= To clearify the format of cidfmap I attach the file produced on my |
|
0708测试使用gs optimization 原来有问题的pdf 失败
The text was updated successfully, but these errors were encountered: