Inconsistent naming of fonts depending on insertion method; inconsistent runtime #4081

alexander-fried · 2024-11-22T16:26:43Z

Description of the bug

There are four classes of bugs:

The fitz.Font(fontbuffer=fitz.Font('Helvetica').buffer).name does not equal Helvetica and instead is Nimbus Sans Regular
The font names that are inserted by page.insert_text vs using the textwriter are inconsistent
The font names that are retrieved using the page.get_fonts method vs the page.get_text("dict") are different
The runtime of inserting text with page.insert_text vs using the textwriter are significantly different in the current version of the Pymupdf library, but were much more similar in version 1.24.0
font_bug.md

How to reproduce the bug

font_bug.md
Rename the file extension to .py
Run the code with python -m pytest -s font_bug.py

PyMuPDF version

1.24.14

Operating system

MacOS

Python version

3.10

The text was updated successfully, but these errors were encountered:

julian-smith-artifex-com · 2024-11-22T16:53:16Z

Could you paste the text of font_bug.py into this issue page (inside triple-quotes) so we can read it directly?

alexander-fried · 2024-11-22T16:58:47Z

# Run as python script

import fitz
import pandas as pd
import time
import pprint

# Helper function
def get_text_fonts(page: fitz.Page) -> set:
    """
    Get all fonts named in the page.get_text("dict") function
    """
    fontnames = set()
    for block in page.get_text("dict")["blocks"]:
        if "lines" not in block:
            continue 
        for line in block["lines"]:
            for span in line["spans"]:
                fontnames.add(span['font'])
    return fontnames

# Helper function
def get_fonts(page: fitz.Page) -> dict:
    """
    Get all basefonts named in the page.get_fonts() function
    From the xref in page.get_fonts, extract the font data, and get the font name.
    Assert that the basefont from page.parent.extract_font is the same as the basefont from page.get_fonts()
    """
    data = []
    for xref, _, _, basefont, _, _ in set(page.get_fonts()):
        basefont2, _, _, content = page.parent.extract_font(xref)
        font = fitz.Font(fontbuffer=content)

        # This assertion passes
        assert basefont2 == basefont

        data.append({"basefont": basefont, "buffer_font_name": font.name})

    return data

def test_font_buffer_helv():
    font = fitz.Font("helv")
    buffer_font = fitz.Font(fontbuffer=font.buffer)
    assert font.name == buffer_font.name, f"{font.name} != {buffer_font.name}"

    # RETURNS
    # AssertionError: Helvetica != Nimbus Sans Regular

    # EXPECT both to be 'Helvetica'

def test_font_buffer_zadb():
    font = fitz.Font("zadb")
    buffer_font = fitz.Font(fontbuffer=font.buffer)
    assert font.name == buffer_font.name, f"{font.name} != {buffer_font.name}"

    # RETURNS
    # AssertionError: ZapfDingbats != Dingbats Regular

    # EXPECT both to be 'ZapfDingbats'

def test_textwriter():
    pdf = fitz.open()
    page = pdf.new_page(width=600, height=712)
    tw = fitz.TextWriter(page.mediabox)
    tw.append((100,100), "Hello World", font=fitz.Font("helv"), fontsize=12)
    tw.append((100,200), "Hello Symbol", font=fitz.Font("zadb"), fontsize=12)
    tw.append((50,200), "Hello Text", font=fitz.Font("Helvetica"), fontsize=12)  # Redundant, but for testing purposes since helv==Helvetica
    tw.write_text(page)

    text_fonts = get_text_fonts(page)
    page_fonts = get_fonts(page)
    print("")
    print("Using textwriter")
    print("a) Fonts found in page.get_text('dict'): \n\t", text_fonts)
    print("b) Fonts found in page.get_fonts() and in the buffer:")
    print(pd.DataFrame(page_fonts))
    assert text_fonts == {'Helvetica', 'ZapfDingbats'}
    assert all([x["buffer_font_name"] == x["basefont"] for x in page_fonts])
    assert text_fonts == set([x["basefont"] for x in page_fonts])
    assert "Noto Serif Regular" not in [x["basefont"] for x in page_fonts]

    # RETURNS:
    # Using textwriter
    # a) Fonts found in page.get_text('dict'):  
    #          {'NimbusSans-Regular', 'Dingbats', 'NotoSerif-Regular'}   <- note the slightly different font names 'NotoSerif-Regular' vs 'Noto Serif Regular'; 'NimbusSans-Regular' vs 'Nimbus Sans Regular' for basefont Helvetica;  and "Dingbats" vs "Dingbats Regular" vs the basefont 'ZapfDingbats'
    # b) Fonts found in page.get_fonts() and in the buffer:
    #              basefont     buffer_font_name                         <- note that the buffer_font_name difference seemingly comes from the bug in fitz.Font described above
    # 0           Helvetica  Nimbus Sans Regular                         <- but compare with the buffer_font_names to the insertion method with insert_text
    # 1  Noto Serif Regular   Noto Serif Regular                         <- Note the introduction of the new font 'NotoSerif-Regular' (presumably because an unsupported character is inserted with the zapd font)
    # 2        ZapfDingbats     Dingbats Regular

    # EXPECT:
    # a) Fonts found in page.get_text('dict'):  
    #          {'Helvetica', 'ZapfDingbats'}
    # b) Fonts found in page.get_fonts() and in the buffer:
    #              basefont     buffer_font_name
    # 0           Helvetica       Helvetica
    # 1        ZapfDingbats     ZapfDingbats

def test_insert_text():
    pdf = fitz.open()
    page = pdf.new_page(width=600, height=712)
    page.insert_text(
        (100,100), 
        "Hello World",
        fontname="helv",
        fontsize=12,
    )

    page.insert_text(
        (100,200), 
        "Hello Symbol",
        fontname="zadb",
        fontsize=12,
    )

     # Redundant, but for testing purposes since helv==Helvetica
    page.insert_text(
        (50,200), 
        "Hello Text",
        fontname="Helvetica",
        fontsize=12,
    )

    text_fonts = get_text_fonts(page)
    page_fonts = get_fonts(page)
    print("")
    print("Using page.insert_text")
    print("a) Fonts found in page.get_text('dict'): \n\t", text_fonts)
    print("b) Fonts found in page.get_fonts() and in the buffer:")
    print(pd.DataFrame(page_fonts))
    assert text_fonts == {'Helvetica', 'ZapfDingbats'}
    assert all([x["buffer_font_name"] == x["basefont"] for x in page_fonts])
    assert text_fonts == set([x["basefont"] for x in page_fonts])
    assert "Noto Serif Regular" not in [x["basefont"] for x in page_fonts]  # Not sure if this is a bug or a feature

    # RETURNS:
    # Using page.insert_text
    # a) Fonts found in page.get_text('dict'):
    #          {'Helvetica', 'ZapfDingbats'}    <- note that the insert_text method uses the basename as the fontname retrieved by page.get_text('dict'), in contrast to when using textwriter
    # b) Fonts found in page.get_fonts() and in the buffer:
    #     basefont    buffer_font_name
    # 0     Helvetica  Noto Serif Regular
    # 1  ZapfDingbats  Noto Serif Regular       <- note that this method of insertion has different font names than the textwriter
    # 2     Helvetica  Noto Serif Regular

    # EXPECT:
    # Using page.insert_text
    # a) Fonts found in page.get_text('dict'):
    #          {'Helvetica', 'ZapfDingbats'}
    # b) Fonts found in page.get_fonts() and in the buffer:
    #     basefont    buffer_font_name
    # 0     Helvetica  Helvetica
    # 1  ZapfDingbats  ZapfDingbats

def test_show_full_font_mapping():
    # For the basefont names
    pprint.pp([(fitz.Font(fontbuffer=fitz.Font(f).buffer).name, f) for f in fitz.Base14_fontnames])

    # For the basefont abreviations
    pprint.pp([(fitz.Font(fontbuffer=fitz.Font(f).buffer).name, f) for f in fitz.Base14_fontdict.keys()])

    # Finding the named font in the page.get_text("dict") function after inserting each font with the textwriter
    text_map = {}
    for f in fitz.Base14_fontnames:
        with fitz.open() as pdf:
            page = pdf.new_page(width=600, height=712)
            tw = fitz.TextWriter(page.mediabox)
            tw.append((100,100), "Hello World", font=fitz.Font(f), fontsize=12)
            tw.write_text(page)
            extracted_fonts = get_text_fonts(page)
            text_map[f] = extracted_fonts

    pprint.pp(text_map)


def test_runtime_insert_text():
    print("")
    with fitz.open() as pdf:
        page = pdf.new_page(width=600, height=712)
        start = time.time()
        for _ in range(1000):
            page.insert_text(
                (100,100), 
                "Hello World",
                fontname="helv", 
                fontsize=12,
            )
        stop = time.time()
        print("Time to insert 1000 text strings with insert_text: ", stop-start)

    with fitz.open() as pdf:
        page = pdf.new_page(width=600, height=712)
        start = time.time()
        tw = fitz.TextWriter(page.mediabox)
        for _ in range(1000):
            tw.append((100,100), "Hello World", font=fitz.Font("helv"), fontsize=12)

        tw.write_text(page)
        stop = time.time()
        print("Time to insert 1000 text strings with textwriter:  ", stop-start)

    # RETURNS for 
    # PyMuPDF==1.24.0
    # PyMuPDFb==1.24.0
    # Time to insert 1000 text strings with insert_text:  0.13602900505065918
    # Time to insert 1000 text strings with textwriter:   0.08923721313476562
        
    # versus for 
    # PyMuPDF==1.24.14
    # PyMuPDFb==1.24.9        
    # Time to insert 1000 text strings with insert_text:  0.685333251953125
    # Time to insert 1000 text strings with textwriter:   0.09531903266906738
        
    # For the updated packages, the insertion times are longer and the difference in times is significantly larger
    # Not sure if this is a bug or a feature

JorjMcKie · 2024-11-22T20:10:16Z

This is no bug, but works as designed: Font "Nimbus Sans Regular"
is the font used to implement "Helvetica".
No inconsistency here.

JorjMcKie · 2024-11-22T20:12:20Z

Helvetica is a non-embeddable font as part of the Base14 fonts.
These 3 fonts (Helvetica, Times-Roman and Courier) are implemented metric-identical by the Nimbus* fonts made by company URW.

JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent naming of fonts depending on insertion method; inconsistent runtime #4081

Inconsistent naming of fonts depending on insertion method; inconsistent runtime #4081

alexander-fried commented Nov 22, 2024

julian-smith-artifex-com commented Nov 22, 2024 •

edited

Loading

alexander-fried commented Nov 22, 2024 •

edited

Loading

JorjMcKie commented Nov 22, 2024

JorjMcKie commented Nov 22, 2024

Inconsistent naming of fonts depending on insertion method; inconsistent runtime #4081

Inconsistent naming of fonts depending on insertion method; inconsistent runtime #4081

Comments

alexander-fried commented Nov 22, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

julian-smith-artifex-com commented Nov 22, 2024 • edited Loading

alexander-fried commented Nov 22, 2024 • edited Loading

JorjMcKie commented Nov 22, 2024

JorjMcKie commented Nov 22, 2024

julian-smith-artifex-com commented Nov 22, 2024 •

edited

Loading

alexander-fried commented Nov 22, 2024 •

edited

Loading