The issue of identifying variant Chinese characters #3999

maxyou2090 · 2024-10-29T08:00:38Z

Description of the bug

Extracting Chinese text from doc files using PyMuPDF Pro may result in some characters being extracted as variant forms. While these variant forms may look identical to the naked eye, they are actually different characters displayed with different encodings. For example, “⼈” (incorrectly extracted by PyMuPDF Pro) and “人” (the correct character).

How to reproduce the bug

Code Sample

import pymupdf.pro

pymupdf.pro.unlock()
doc = pymupdf.open("demo.doc")
for page in doc:
    print(page.get_text())
    break

Output

法定代表⼈

DOCX Content

法定代表人

DOCX File
demo.doc.zip

PyMuPDF version

1.24.12

Operating system

MacOS

Python version

3.10

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-10-29T12:10:00Z

I don't see your problem. In demo.doc the complete line is written using one font, FangSong.
In Linux, the result is also only one single font, namely "Droid Sans Fallback":

[{'size': 10.999500274658203,
  'flags': 4,
  'font': 'Droid_Sans_Fallback',
  'color': 0,
  'ascender': 1.04296875,
  'descender': -0.265625,
  'text': '法定代表人',
  'origin': (90.0, 94.343017578125),
  'bbox': (90.0, 82.87088012695312, 144.9974822998047, 97.26476287841797)}]

Under Windows, the corresponding output is analogous:

[{'size': 10.999500274658203,
  'flags': 4,
  'font': 'Microsoft_YaHei',
  'color': 0,
  'ascender': 1.05810546875,
  'descender': -0.26171875,
  'text': '法定代表人',
  'origin': (90.0, 94.447998046875),
  'bbox': (90.0, 82.80936431884766, 144.9974822998047, 97.32677459716797)}]

Are you saying that that character is different in these two fonts?

In general, the conversion routine that makes available an Office document in pymupdf.pro looks for available fonts in the OS environment whenever needed.
Please have a look at the output of this script to see which font was chosen as an adequate replacement for FangSong on your machine:

from pprint import pp
import pymupdf.pro

pymupdf.pro.unlock()
doc = pymupdf.open("demo.doc")
page = doc[0]
spans = [
    s for b in page.get_text("dict")["blocks"] for l in b["lines"] for s in l["spans"]
]
pp(spans)

JorjMcKie · 2024-10-29T12:17:25Z

You also have submitted another bug #3998 of which this one probably is a duplicate.
We are investigating how the search for suitable system fonts can be adjusted to make sure that fonts present in an office document are matched adequately.

JorjMcKie added the duplicate label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The issue of identifying variant Chinese characters #3999

The issue of identifying variant Chinese characters #3999

maxyou2090 commented Oct 29, 2024

JorjMcKie commented Oct 29, 2024 •

edited

Loading

JorjMcKie commented Oct 29, 2024

The issue of identifying variant Chinese characters #3999

The issue of identifying variant Chinese characters #3999

Comments

maxyou2090 commented Oct 29, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Oct 29, 2024 • edited Loading

JorjMcKie commented Oct 29, 2024

JorjMcKie commented Oct 29, 2024 •

edited

Loading