Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent naming of fonts depending on insertion method; inconsistent runtime #4081

Open
alexander-fried opened this issue Nov 22, 2024 · 4 comments
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@alexander-fried
Copy link

Description of the bug

There are four classes of bugs:

  1. The fitz.Font(fontbuffer=fitz.Font('Helvetica').buffer).name does not equal Helvetica and instead is Nimbus Sans Regular
  2. The font names that are inserted by page.insert_text vs using the textwriter are inconsistent
  3. The font names that are retrieved using the page.get_fonts method vs the page.get_text("dict") are different
  4. The runtime of inserting text with page.insert_text vs using the textwriter are significantly different in the current version of the Pymupdf library, but were much more similar in version 1.24.0
    font_bug.md

How to reproduce the bug

font_bug.md
Rename the file extension to .py
Run the code with python -m pytest -s font_bug.py

PyMuPDF version

1.24.14

Operating system

MacOS

Python version

3.10

@julian-smith-artifex-com
Copy link
Collaborator

julian-smith-artifex-com commented Nov 22, 2024

Could you paste the text of font_bug.py into this issue page (inside triple-quotes) so we can read it directly?

@alexander-fried
Copy link
Author

alexander-fried commented Nov 22, 2024

# Run as python script

import fitz
import pandas as pd
import time
import pprint

# Helper function
def get_text_fonts(page: fitz.Page) -> set:
    """
    Get all fonts named in the page.get_text("dict") function
    """
    fontnames = set()
    for block in page.get_text("dict")["blocks"]:
        if "lines" not in block:
            continue 
        for line in block["lines"]:
            for span in line["spans"]:
                fontnames.add(span['font'])
    return fontnames

# Helper function
def get_fonts(page: fitz.Page) -> dict:
    """
    Get all basefonts named in the page.get_fonts() function
    From the xref in page.get_fonts, extract the font data, and get the font name.
    Assert that the basefont from page.parent.extract_font is the same as the basefont from page.get_fonts()
    """
    data = []
    for xref, _, _, basefont, _, _ in set(page.get_fonts()):
        basefont2, _, _, content = page.parent.extract_font(xref)
        font = fitz.Font(fontbuffer=content)

        # This assertion passes
        assert basefont2 == basefont

        data.append({"basefont": basefont, "buffer_font_name": font.name})

    return data

def test_font_buffer_helv():
    font = fitz.Font("helv")
    buffer_font = fitz.Font(fontbuffer=font.buffer)
    assert font.name == buffer_font.name, f"{font.name} != {buffer_font.name}"

    # RETURNS
    # AssertionError: Helvetica != Nimbus Sans Regular

    # EXPECT both to be 'Helvetica'

def test_font_buffer_zadb():
    font = fitz.Font("zadb")
    buffer_font = fitz.Font(fontbuffer=font.buffer)
    assert font.name == buffer_font.name, f"{font.name} != {buffer_font.name}"

    # RETURNS
    # AssertionError: ZapfDingbats != Dingbats Regular

    # EXPECT both to be 'ZapfDingbats'

def test_textwriter():
    pdf = fitz.open()
    page = pdf.new_page(width=600, height=712)
    tw = fitz.TextWriter(page.mediabox)
    tw.append((100,100), "Hello World", font=fitz.Font("helv"), fontsize=12)
    tw.append((100,200), "Hello Symbol", font=fitz.Font("zadb"), fontsize=12)
    tw.append((50,200), "Hello Text", font=fitz.Font("Helvetica"), fontsize=12)  # Redundant, but for testing purposes since helv==Helvetica
    tw.write_text(page)

    text_fonts = get_text_fonts(page)
    page_fonts = get_fonts(page)
    print("")
    print("Using textwriter")
    print("a) Fonts found in page.get_text('dict'): \n\t", text_fonts)
    print("b) Fonts found in page.get_fonts() and in the buffer:")
    print(pd.DataFrame(page_fonts))
    assert text_fonts == {'Helvetica', 'ZapfDingbats'}
    assert all([x["buffer_font_name"] == x["basefont"] for x in page_fonts])
    assert text_fonts == set([x["basefont"] for x in page_fonts])
    assert "Noto Serif Regular" not in [x["basefont"] for x in page_fonts]

    # RETURNS:
    # Using textwriter
    # a) Fonts found in page.get_text('dict'):  
    #          {'NimbusSans-Regular', 'Dingbats', 'NotoSerif-Regular'}   <- note the slightly different font names 'NotoSerif-Regular' vs 'Noto Serif Regular'; 'NimbusSans-Regular' vs 'Nimbus Sans Regular' for basefont Helvetica;  and "Dingbats" vs "Dingbats Regular" vs the basefont 'ZapfDingbats'
    # b) Fonts found in page.get_fonts() and in the buffer:
    #              basefont     buffer_font_name                         <- note that the buffer_font_name difference seemingly comes from the bug in fitz.Font described above
    # 0           Helvetica  Nimbus Sans Regular                         <- but compare with the buffer_font_names to the insertion method with insert_text
    # 1  Noto Serif Regular   Noto Serif Regular                         <- Note the introduction of the new font 'NotoSerif-Regular' (presumably because an unsupported character is inserted with the zapd font)
    # 2        ZapfDingbats     Dingbats Regular

    # EXPECT:
    # a) Fonts found in page.get_text('dict'):  
    #          {'Helvetica', 'ZapfDingbats'}
    # b) Fonts found in page.get_fonts() and in the buffer:
    #              basefont     buffer_font_name
    # 0           Helvetica       Helvetica
    # 1        ZapfDingbats     ZapfDingbats

def test_insert_text():
    pdf = fitz.open()
    page = pdf.new_page(width=600, height=712)
    page.insert_text(
        (100,100), 
        "Hello World",
        fontname="helv",
        fontsize=12,
    )

    page.insert_text(
        (100,200), 
        "Hello Symbol",
        fontname="zadb",
        fontsize=12,
    )

     # Redundant, but for testing purposes since helv==Helvetica
    page.insert_text(
        (50,200), 
        "Hello Text",
        fontname="Helvetica",
        fontsize=12,
    )

    text_fonts = get_text_fonts(page)
    page_fonts = get_fonts(page)
    print("")
    print("Using page.insert_text")
    print("a) Fonts found in page.get_text('dict'): \n\t", text_fonts)
    print("b) Fonts found in page.get_fonts() and in the buffer:")
    print(pd.DataFrame(page_fonts))
    assert text_fonts == {'Helvetica', 'ZapfDingbats'}
    assert all([x["buffer_font_name"] == x["basefont"] for x in page_fonts])
    assert text_fonts == set([x["basefont"] for x in page_fonts])
    assert "Noto Serif Regular" not in [x["basefont"] for x in page_fonts]  # Not sure if this is a bug or a feature

    # RETURNS:
    # Using page.insert_text
    # a) Fonts found in page.get_text('dict'):
    #          {'Helvetica', 'ZapfDingbats'}    <- note that the insert_text method uses the basename as the fontname retrieved by page.get_text('dict'), in contrast to when using textwriter
    # b) Fonts found in page.get_fonts() and in the buffer:
    #     basefont    buffer_font_name
    # 0     Helvetica  Noto Serif Regular
    # 1  ZapfDingbats  Noto Serif Regular       <- note that this method of insertion has different font names than the textwriter
    # 2     Helvetica  Noto Serif Regular

    # EXPECT:
    # Using page.insert_text
    # a) Fonts found in page.get_text('dict'):
    #          {'Helvetica', 'ZapfDingbats'}
    # b) Fonts found in page.get_fonts() and in the buffer:
    #     basefont    buffer_font_name
    # 0     Helvetica  Helvetica
    # 1  ZapfDingbats  ZapfDingbats

def test_show_full_font_mapping():
    # For the basefont names
    pprint.pp([(fitz.Font(fontbuffer=fitz.Font(f).buffer).name, f) for f in fitz.Base14_fontnames])

    # For the basefont abreviations
    pprint.pp([(fitz.Font(fontbuffer=fitz.Font(f).buffer).name, f) for f in fitz.Base14_fontdict.keys()])

    # Finding the named font in the page.get_text("dict") function after inserting each font with the textwriter
    text_map = {}
    for f in fitz.Base14_fontnames:
        with fitz.open() as pdf:
            page = pdf.new_page(width=600, height=712)
            tw = fitz.TextWriter(page.mediabox)
            tw.append((100,100), "Hello World", font=fitz.Font(f), fontsize=12)
            tw.write_text(page)
            extracted_fonts = get_text_fonts(page)
            text_map[f] = extracted_fonts

    pprint.pp(text_map)


def test_runtime_insert_text():
    print("")
    with fitz.open() as pdf:
        page = pdf.new_page(width=600, height=712)
        start = time.time()
        for _ in range(1000):
            page.insert_text(
                (100,100), 
                "Hello World",
                fontname="helv", 
                fontsize=12,
            )
        stop = time.time()
        print("Time to insert 1000 text strings with insert_text: ", stop-start)

    with fitz.open() as pdf:
        page = pdf.new_page(width=600, height=712)
        start = time.time()
        tw = fitz.TextWriter(page.mediabox)
        for _ in range(1000):
            tw.append((100,100), "Hello World", font=fitz.Font("helv"), fontsize=12)

        tw.write_text(page)
        stop = time.time()
        print("Time to insert 1000 text strings with textwriter:  ", stop-start)

    # RETURNS for 
    # PyMuPDF==1.24.0
    # PyMuPDFb==1.24.0
    # Time to insert 1000 text strings with insert_text:  0.13602900505065918
    # Time to insert 1000 text strings with textwriter:   0.08923721313476562
        
    # versus for 
    # PyMuPDF==1.24.14
    # PyMuPDFb==1.24.9        
    # Time to insert 1000 text strings with insert_text:  0.685333251953125
    # Time to insert 1000 text strings with textwriter:   0.09531903266906738
        
    # For the updated packages, the insertion times are longer and the difference in times is significantly larger
    # Not sure if this is a bug or a feature

@JorjMcKie
Copy link
Collaborator

This is no bug, but works as designed: Font "Nimbus Sans Regular"
is the font used to implement "Helvetica".
No inconsistency here.

@JorjMcKie
Copy link
Collaborator

Helvetica is a non-embeddable font as part of the Base14 fonts.
These 3 fonts (Helvetica, Times-Roman and Courier) are implemented metric-identical by the Nimbus* fonts made by company URW.

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

3 participants