Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The issue of identifying variant Chinese characters #3999

Open
maxyou2090 opened this issue Oct 29, 2024 · 2 comments
Open

The issue of identifying variant Chinese characters #3999

maxyou2090 opened this issue Oct 29, 2024 · 2 comments

Comments

@maxyou2090
Copy link

Description of the bug

Extracting Chinese text from doc files using PyMuPDF Pro may result in some characters being extracted as variant forms. While these variant forms may look identical to the naked eye, they are actually different characters displayed with different encodings. For example, “⼈” (incorrectly extracted by PyMuPDF Pro) and “人” (the correct character).

How to reproduce the bug

Code Sample

import pymupdf.pro

pymupdf.pro.unlock()
doc = pymupdf.open("demo.doc")
for page in doc:
    print(page.get_text())
    break

Output

法定代表⼈

DOCX Content

法定代表人

DOCX File
demo.doc.zip

PyMuPDF version

1.24.12

Operating system

MacOS

Python version

3.10

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Oct 29, 2024

I don't see your problem. In demo.doc the complete line is written using one font, FangSong.
In Linux, the result is also only one single font, namely "Droid Sans Fallback":

[{'size': 10.999500274658203,
  'flags': 4,
  'font': 'Droid_Sans_Fallback',
  'color': 0,
  'ascender': 1.04296875,
  'descender': -0.265625,
  'text': '法定代表人',
  'origin': (90.0, 94.343017578125),
  'bbox': (90.0, 82.87088012695312, 144.9974822998047, 97.26476287841797)}]

Under Windows, the corresponding output is analogous:

[{'size': 10.999500274658203,
  'flags': 4,
  'font': 'Microsoft_YaHei',
  'color': 0,
  'ascender': 1.05810546875,
  'descender': -0.26171875,
  'text': '法定代表人',
  'origin': (90.0, 94.447998046875),
  'bbox': (90.0, 82.80936431884766, 144.9974822998047, 97.32677459716797)}]

Are you saying that that character is different in these two fonts?

In general, the conversion routine that makes available an Office document in pymupdf.pro looks for available fonts in the OS environment whenever needed.
Please have a look at the output of this script to see which font was chosen as an adequate replacement for FangSong on your machine:

from pprint import pp
import pymupdf.pro

pymupdf.pro.unlock()
doc = pymupdf.open("demo.doc")
page = doc[0]
spans = [
    s for b in page.get_text("dict")["blocks"] for l in b["lines"] for s in l["spans"]
]
pp(spans)

@JorjMcKie
Copy link
Collaborator

You also have submitted another bug #3998 of which this one probably is a duplicate.
We are investigating how the search for suitable system fonts can be adjusted to make sure that fonts present in an office document are matched adequately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants