You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extracting Chinese text from doc files using PyMuPDF Pro may result in some characters being extracted as variant forms. While these variant forms may look identical to the naked eye, they are actually different characters displayed with different encodings. For example, “⼈” (incorrectly extracted by PyMuPDF Pro) and “人” (the correct character).
How to reproduce the bug
Code Sample
import pymupdf.pro
pymupdf.pro.unlock()
doc = pymupdf.open("demo.doc")
for page in doc:
print(page.get_text())
break
I don't see your problem. In demo.doc the complete line is written using one font, FangSong.
In Linux, the result is also only one single font, namely "Droid Sans Fallback":
Are you saying that that character is different in these two fonts?
In general, the conversion routine that makes available an Office document in pymupdf.pro looks for available fonts in the OS environment whenever needed.
Please have a look at the output of this script to see which font was chosen as an adequate replacement for FangSong on your machine:
You also have submitted another bug #3998 of which this one probably is a duplicate.
We are investigating how the search for suitable system fonts can be adjusted to make sure that fonts present in an office document are matched adequately.
Description of the bug
Extracting Chinese text from doc files using PyMuPDF Pro may result in some characters being extracted as variant forms. While these variant forms may look identical to the naked eye, they are actually different characters displayed with different encodings. For example, “⼈” (incorrectly extracted by PyMuPDF Pro) and “人” (the correct character).
How to reproduce the bug
Code Sample
Output
DOCX Content
DOCX File
demo.doc.zip
PyMuPDF version
1.24.12
Operating system
MacOS
Python version
3.10
The text was updated successfully, but these errors were encountered: