-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xps/pdf/png/json转换 #18
Comments
针对 xps pdf 图片 处理的整体pipeline |
|
for xpdf由于该库与poppler-util功能一致但又无人维护 主要可以提取txt 提取html 查看字体 查看基本信息 提取嵌套的图片 236
xpdfrc
原始文件为
转换后得到的
pdfminer 能够正常处理
得到的结果为 同样的一份报告 pdf如下
pdfminer 处理的结果则为cid乱码
惊喜的是
能够得到很好的结果
解决方案似乎是在 fontconfig 配置将通过pdffonts 1.pdf检测出来的字体全部替换为标准字体 也就是我们期望中的几种 https://lists.freedesktop.org/archives/poppler-bugs/2013-November/010909.html 1.对于这个特殊的pdf文件 使用 adobe reader 复制出来的就是乱码 啥文本都没有 4.按照这里的建议使用gs 重建该pdf
暂时套上OCR解决了问题 看起来 需要将字体变换
|
for libgxps2 libgxps-utils
问题现在是无论是使用在线还是libgxps转换得到的pdf 对于测试文件3 利用pdfminer得到的html 版面分析都是错误的 |
from xps <-->pdf<-->html<-->json for pdfminerbased on ALPINE LINUX
docker run -it --rm --name pdf-miner-demo -v /home/wanghs/dockerfiles-repo/docker-for-fun/docker-alpine/projects/pdf-parser:/tmp dc/alpine-python2 /bin/sh 将pdfminer嵌在or的镜像里 提供http服务
https://github.com/felipeochoa/minecart based on UBUTUN
步骤一
步骤二 pdf2txt.py -o output.html -Y exact 3.xps.pdf 步骤三 import subprocess |
from xps <-->pdf<--->png/jpeg<-->html<-->json |
from xps <--->png/jpeg |
中top可能值的数目决定整个版面的行数 如果同样的top值后续出现比它小的top 然后又重复出现原来的top值 需要将这两组top合并为一组 以较小top值为准 这里如果同样的top值后续出现比它大的多的top 然后又重复出现原来的top值 需要将该大很多的top移动到对应的顺序中去
以线为分隔符 将整个版面划分为若干块
4 .对块内所有出现的top值进行排序 如果相邻两个top值的差小于该两个top值对应的font-size之和,则认为该两个相邻的top为同一line,如果差大于该两个top值对应的font-size之和,则认为是不同的line 5 . 在单独的每个分块中对字体进行归一化 以小的为准 且对同样的top值的所有left值进行排序 如果连续两个left值之差小于font-size 则 对比较大的left(包含该left值在内)所有left值加上font-size的值 6 .如果同样top的两个left值之差大于两个font-size值/或可配置的值 则认为其是block区块之间具有意义的分隔 其他则视为不具有意义 为区块内部的分隔 区块内部所有的值中间的空格需要移除 (亦可以对所有left值进行归一化 保持最小值,按照次序在最小值基础上依次加上font-size值 这种方法如果top标记的本身就是空格 则拼接字符串后仍然需要移除空格 故放弃) 7 .同样的top 也就是同一行中 要计算下一个字符与上一个字符是否属于同一个segment,只需要在第一个字符的left值基础上加上从第一个字符到待计算字符之间所有字符的font-size值, 如果对应字符的left值小于该预期值,则认为该字符与上一个字符属于同一个segment 如果对应字符的left值大于该预期值,则认为该字符与上一个字符属于不同的segment 初步处理后的json如下所示:
|
其他参考资料 Fonts in PDFWith the goal of getting The question then becomes: how do we extract the font family name and the embedded font program (if any) from the PDF document? I'm putting together this wiki page to keep track of my efforts towards that question. How are fonts stored/referenced in PDF?When drawing text on a PDF page, the application keeps track of what's known as the text state. In the text state, there is a parameter Tf called text font. Whenever text is drawn on the page, it is drawn using the font stored in the Tf field of the text state. The text font is set and updated through the use of the When using the import minecart
import pdfminer.pdfpage
doc = minecart.Document(open("path/to/sample.pdf", 'rb'))
page = next(pdfminer.pdfpage.PDFPage.create_pages(doc.doc))
fonts = page.resources['Font']
print fonts
# {'F0': <PDFObjRef:7>}
font = fonts['F0'].resolve()
print font
# {'Encoding': /Identity-H,
# 'BaseFont': /HDIABS+AlbanyWTTC-Identity-H,
# 'DescendantFonts': [<PDFObjRef:26>],
# 'Subtype': /Type0,
# 'ToUnicode': <PDFObjRef:25>,
# 'Type': /Font} At this point, the exercise become more of a choose-your-own-adventure, since it will largely depend on the fonts that are referenced in your document. The different types of PDF fontsPDF allows documents to use a variety of font formats, which can be embedded with the document, included in the viewer application, or found elsewhere in the system. Font types are identified by the Type 1
Multiple Master
TrueType
TrueType fonts must also have a Type 3
Type 3 fonts have no font program to embed or reference. Instead, they specify PDF graphics procedures for rendering each character as a PDF shape. Rendering the text is thus a job for the shape engine and not for the text engine. I'd have to investigate how Type 0
Type 0 fonts are also called "composite fonts" in the spec. They have a "subfont" that's stored in the
Footnotes
|
for pdf2htmlEX 弃用
https://hub.docker.com/r/bwits/pdf2htmlex-alpine/
based on Ubuntu
span前面空格数量不同
https://github.com/fmalina/transcript 借助这个脚本来理解其中PDF2htmlEX中一些字段含义 pdf2htmlEX --external-hint-tool=ttfautohint --auto-hint 1 --zoom 2 转换得到的css html是分离的 意味着和pdfminer不同 如果要直接处理的话 需要把css的style 弄成html inline的形式 借助工具 关键词inline style attributes to style tags https://www.npmjs.com/package/gulp-inline-css
结果
|
html to json https://github.com/inikulin/parse5
|
问题1 w/px= PDFUnit.toFixedFloat(maxWidth), The unit for all width, height, length, etc, is in "Form Unit". If you need pixels value, you can use the converter below: https://github.com/modesty/pdf2json/blob/3fe724db05659ad12c2c0f1b019530c906ad23de/lib/pdffont.js
3.如何借鉴基于pdf2json的其他一些库的算法 URL decode 之后pdf2json的结果
|
|
|
ocrmypdf
测试对于编码错误的pdf输入
|
1 xps<-->pdf<--->html
docker build -t portia . |
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents. https://datascience.blog.wzb.eu/2017/… 源码 https://github.com/WZBSocialScienceCenter/pdftabextract 20170705测试 |
https://stackoverflow.com/questions/2926159/copypasting-text-from-pdf-results-in-garbage
Open 'File' menu, You'll have all text from all pages in the file and need to locate It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu). Update You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF. Here is an example output, which demonstrates where a problem for $ pdffonts textextract-bad2.pdf BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 How to interpret this table? The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold. The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters. A missing /ToUnicode table for a specific font is almost |
http://tm.durusau.net/?cat=1480 |
XEROX 的Herve Dejean等人 北大某实验室 |
With the advancements in information and communication technology, various forms of paper documents are being scanned in order to be interpreted and indexed. The bigger vision however, is to treat paper as a legitimate form of media (like magnetic tapes and optical discs) which can be both machine and human readable. One challenge is that the variety of paper documents being scanned today is much more diverse than what it was several years ago. Many new scripts, more complex, non-Manhattan page layouts and various font styles are making this vision challenging. Furthermore, a much larger percentage of handwritten material is being acquired which does not adhere to traditional layout constraints. Character recognition as well as various established pre-processing modules such as noise removal, layout analysis and zone classification are affected by this increased complexity. The process of identifying structures of a document image can be based on the physical (process of dividing the document into physical homogeneous zones) or logical (process of assigning logical roles and relations to detected zones) layout. Page segmentation algorithms fall into the category of physical layout analysis. They perform segmentation of a document page into homogeneous zones, each consisting of only one physical layout structure such as text, graphics, equations, logos, stamps. Physical layout analysis can be pixel based or texture based segmentation, but here the goal is that the final result is a region segmentation. In texture-based segmentation, isolated points or small areas could be classified as zonal objects disregarding the connectivity aspect of an object. In contrast, the work is concerned with non overlapping geometric zones where document components are separated by white space. Such connected component based approaches use macro level content information, and can be further classified into Manhattan and non-Manhattan layouts. |
http://www.jacobfenton.com/
I’m a journalist and software developer based in Portland, Oregon. I've spent the last decade working as a reporter, editor, and programmer in newsrooms and nonprofits in the U.S.
During the 2015-16 academic year I was a John S. Knight Journalism Fellow at Stanford University researching ways to make complex document processing affordable to reporters. I’m especially interested in turning unstructured images into data, and building tools to mine actionable news tips from some of the dullest corners of the web. You can read more about that project here.
Previously I was editorial engineer at The Sunlight Foundation, where I worked extensively on campaign finance, TV ad disclosure, and House and Senate expenditure reporting. Prior to that I was Director of Computer-Assisted Reporting, at the Investigative Reporting Workshop, a nonprofit at American University. I also reported for several newspapers in Pennsylvania.
Long ago I was an undergraduate physics major, and got my first real taste of programming hacking on C++ code to look at engineering runs at LIGO Hanford.
I can be reached at jsfenfen at gmail dot com.
dannyedel/dspdfviewer#163
这个pdf浏览器是能够正常查看无法使用poppler-util中自带的pdftohtml转换成正常中文的pdf文件
table detect
https://github.com/Booppey/table-detection
https://github.com/transpect/evolve-hub
pdf ocr
OCRmyPDF
pdfsandwich
3. xps<-->png/jpeg
My work-around is to save the PDF as a lossless or near lossless image such as .tiff format, then create a new PDF from the image and run OCR. Thus I lose no clarity/sharpness in the PDF image and get accurate OCR content that can be copied and pasted. And, yes, lots of folks do something similar with screenshots from protected PDFs to grab all the text (without the need to retype it). Simple non-expert scripts (such as Tornado's "Do It Again" freeware) and PDF generating software make it easy to process hundreds of pages quickly and accurately (at least as accurately as OCR from images can be from relatively high-res images - not screenshots of documents you are not zooming in on or otherwise capturing with tremendously low spatial resolution relative to the original document).
https://github.com/wanghaisheng/pdfconvertme-public
4. pdf<-->png/jpeg
5. png/jpeg<-->json
1 Online service for xps<--->pdf
2. Library for xps<--->pdf
(1) gs/gxps
(2) xpdf
(2.1) BePDF:This is a PDF reader that is based on XPDF 3.04. It handles PDF files up to PDF version 1.7 (Adobe Reader 9+).
(2.2) poppler-utils :Poppler is a PDF rendering library based on the xpdf-3.0 code base
(2.3)pdf2htmlEX based on poppler Fontforge
(3) libgxps
(4) Aspose.Pdf not free sdk
(5) mupdf
reference for xps<--->pdf
libgxps-utils
3. pdf<--->html<-->json
pdfminer
pdfminer pdf to html/txt demo
pdf2htmlEX
reference for pdf<--->html
https://github.com/Micka33/content-extractor
https://github.com/euske/pdfminer
https://github.com/galkahana/HummusJS
https://github.com/EbenZhang/PdfSharp.XPS
https://github.com/modesty/p2jsvc
https://github.com/modesty/pdf2json
https://github.com/coolwanglu/pdf2htmlEX
others
pdf embed font的处理
http://stackoverflow.com/questions/11093051/handling-remapping-missing-problematic-cid-cjk-fonts-in-pdf-with-ghostscript?rq=1
https://github.com/pts/pdfsizeopt
http://stackoverflow.com/questions/2656329/linux-pdf-postscript-optimizing
http://www.aivosto.com/vbtips/pdf-optimize.html
http://stackoverflow.com/questions/21279548/facing-issues-on-extracting-text-from-pdf-file-using-java
http://stackoverflow.com/questions/29633504/embedded-fonts-in-pdf-copy-and-paste-problems?rq=1
http://stackoverflow.com/questions/18762625/get-information-whether-text-is-extractable-from-pdf?rq=1
http://stackoverflow.com/questions/30222424/copy-text-from-pdf-with-custom-font?rq=1
http://stackoverflow.com/questions/3488042/how-can-i-extract-embedded-fonts-from-a-pdf-as-valid-font-files/3489099#3489099
http://stackoverflow.com/questions/7140476/pdf-font-mapping-error?rq=1
http://stackoverflow.com/questions/11093051/handling-remapping-missing-problematic-cid-cjk-fonts-in-pdf-with-ghostscript?rq=1
http://stackoverflow.com/questions/25602262/ghostscript-re-encoding-embedded-font?rq=1
http://stackoverflow.com/questions/28797418/replace-all-font-glyphs-in-a-pdf-by-converting-them-to-outline-shapes?rq=1
http://stackoverflow.com/questions/15722099/issues-decoding-flate-from-pdf-embedded-font?rq=1
http://stackoverflow.com/questions/3647940/pdf-on-linux-combine-font-subsets-and-replace-type-3-with-type-1?rq=1
http://stackoverflow.com/questions/3036373/altering-an-embedded-truetype-font-so-it-will-be-usable-by-windows-gdi?rq=1
The text was updated successfully, but these errors were encountered: