Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocr-transform alto hocr: HTML, but xmlns=xhtml #184

Open
jbarth-ubhd opened this issue May 3, 2024 · 2 comments
Open

ocr-transform alto hocr: HTML, but xmlns=xhtml #184

jbarth-ubhd opened this issue May 3, 2024 · 2 comments

Comments

@jbarth-ubhd
Copy link

jbarth-ubhd commented May 3, 2024

Example input, gzip'd, base64:

QlpoOTFBWSZTWQ0I/UwAAvTfgERUUGf/97/n3sC/7//6UAVedhYMQaNAaaXQklNMRNU80yaan6ph
Q9J6mnlB6mjRtRoPKepoyAaSNT0j9UAMjQZAA00GgGgAAAkSVNlPSeoaA0xDQyMQDQyA0AaGg4yZ
MmIxMAJkwTIAaMIwBDAFSSaE0yTFMT0Jqj2pih6nlD0jygPUyGCMThMpEeCSRUBGUiRxfz+10Z7I
+0Y9KwRhSQ+tKkfWIpKhhZ887jv+f/E1/zrNFZihDCc0yJfNatKwMMcxblFLX9EiDV903RNeU07g
5HfRKyslJN+VTQ+U9cYNFYJojmaDOePfOYkCyupbBFQwAy9FAIS0Fw4PaXDjatC91faiJxJH7P4T
Dqy9xKB+jITgtHis3WoyDxTmshBE1KqxJ8ufqDfpcukQKDk1hfaExvTwx3rrSQ896qGuUNuFDE1D
SkL4KnXXecQmyAtmtCdCOcoiJzjJl8WK89FWzPZsfBi2zs47nh1O9rdyfKM3kyxrRn1MMGWCr3Yx
jd5WnJkz546K0a8WmE/aXtyfeCeHvr+W/T4o6+/FK+Ky9PkEsi16Sr/CR6aSqT2+W2CT2VPXTCoM
uXLAj3CkZlQUVI+n3+Pu9mqYWvfPJnUxRjkZlMEqllLFIQzMyZk93ZF0IbOK6unoY4iuJkORHct3
r3TpGHh2qSdT9LpATyBUcqxIB3DucJRAdhzMdeFygHAX41Cb2GGlZGx0/Pd5Mevw6N/fszNq+4/F
EYqO6o2hQJN2kFnUylknybsvbhVpYybg9Xb7ZHWkS41TXnGibTWZRhwcUmLB+TSjScK3ZPq6r568
Z6yDveTP6tcWHVAnYru78qRtMp+GJvY6M+T57rHCMkTNazOY7t7Okj9ULTeyDLMmTDRpj2srRl9N
ettH3KtVqOlzu4VGMg86p1gxoUDRDabbaBwEJKAY1notPaqu1C3pc0xZccWSfh/4JeYyhlgLUSyc
6TfHoO7jeqKeapZNitG05YQWKTDaTnvWzl5jJ5ke9JvS+c9RlOAdHVi/Tzt4Wy7pyTiVIZkzMyVM
QInIzp5asS2kyqZGu/ROVtJE54WYK+v2aTGYmjV4xm7WG3dhXAqk06dvLZVYlSVTm3TN4+nHB6Dw
3KwnNbS5TsCZbC3FUToHDrU9erN6GK695Zln4uWK+rdvziLyMvbZIma6Hq2TbEdia2S5+U3ZdSrs
IUpGjrCngl4w4aY3OdBg2W9LRyYZyMIq1ef9eqMqOCR3uc+gV0vfzYOXAXeM8F2PGQ5rRxHl1NK+
gpdvfBtuj7ju885uzzpm3l5sdaw3bJjqMbbQJMQxttJibb6KaUkKrwIr2HLLgy4F5xLJgluZVqrx
lpSVe0wXSKHQ6OLYymMWnq6LxvguijSKkwqPjPTZIC2TnaZkzs4tJMnDDwTE92NSzQSH0rIQMtGe
S25ZGcZMM5jUG05OGnNtqvo5YYJZ2YMtyTXTVJjuK1ZK8iXj3JtBw3J+LXibLdmzw5gmQrYgTF4d
lepMTGUNKSFXzKwSF/dlVvP40vSMEQJEYyaASoKSTrS6TYCkBZAihirCYiA1nY5kQSLCYcr0kKtC
pX+X/F3JFOFCQDQj9TA=

output (PoCoTo complains missing closing tags in <meta> etc:

<!DOCTYPE HTML><html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <link rel="stylesheet" href="">
      <title>OCR Output</title>
      <meta name="description" content="OCR Output via XSLT of pageXML.">
   </head>
   <body>
      <div class="ocr_page" title="bbox 0 0 5553 7287; image 'image/OCR-D-IMG/00001.tif';"></div>
   </body>
@jbarth-ubhd
Copy link
Author

(not your problem: html2xhtml writes 2x xmlns=; even if correcting this, PoCoTo complains missing page segmentation)

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented May 3, 2024

page→alto & alto→hocr works with PoCoTo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant