Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Expand zero-width support for ZWJ and emoji sequences #85

Closed
wants to merge 26 commits into from

Conversation

jquast
Copy link
Owner

@jquast jquast commented Sep 30, 2023

  • Bugfix: wcswidth() for zero-width join characters such as used in emoji
    sequences, and variations such as emoji skin tone, as well as many other
    non-printable characters now correctly identified as zero-width.
  • Enhancement: new function width(), a copy of wcswidth() without
    the POSIX compliance to return total width of -1 for any C0 or C1 control
    characters, those characters measured as width 0 by this new function.

Discussion,

  • This implementation of ZJW for emojis does not qualify whether the sequence is supported !! This is possible because of the publication of "recommended" sequences in emoji-zwj-sequences.txt. I experimented with creating a large optimized regex of all supported emoji sequences as part of code generation, but I refrained from implementing it -- it seems like a large cost to lookup every possible sequence, even an optimized regex (using triregex library) is many tens of thousands of characters. Whereas this implementation increments idx past the next character, assuming they are cojoined is very high-performing and easy to implement
  • It is possible to provide a EMOJI_ZWJ_VERSION environment variable, which when undefined is set accordingly to the same release of UNICODE_VERSION. This would require implementing "recommended sequence" checking for the given version level, and returning the length as assumed it would be rendered when supported, for example 2 wide co-joined vs. unsupported, 4-wide with two side-by-side.
  • I hope the results of automatic testing with popular terminals will help decide what weight the performance vs. correctness tradeoff. I predict most popular terminals or libraries do not support any ZWJ at all.

Work remaining,

  • Update bin/wcwidth-browser.py to use and render tables of EMOJI_ZWJ_SEQUENCES by version.
  • Write automatic tests for ZWJ behavior and and new width() function, returning to 100% coverage.
  • Write interactive program that automatically checks all of those sequences, and some other ZWJ such as Bengali phrases with the current terminal, and writes some kind of report file about current os+terminal and detected wide-character and ZWJ emoji support level, to support the activity like was done with this tool and report https://github.com/jquast/ucs-detect#wcwidth-support-in-terminals
  • test some terminals, the most popular, how well do they support these sequences, at what "Emoji version" level, if any?
  • update docs/intro.rst to suggest wcwidth.width() function in the example. Write small section about POSIX wcwidth and wcswidth functions, why they exist for function signature compatibility for POSIX, but why they are not suggested due to the final -1 return value on error.
  • Update docstrings of wcwidth and wcswidth with deprecation warning in docstring only, to suggest using width() instead, with reference to the POSIX section for the reason.

@jquast jquast changed the title Broaden zero-width definition and add zjw support Expand zero-width coverage, add support for ZWJ/emoji sequences Sep 30, 2023
wcwidth/wcwidth.py Fixed Show fixed Hide fixed
wcwidth/wcwidth.py Fixed Show fixed Hide fixed
wcwidth/wcwidth.py Fixed Show fixed Hide fixed
@codecov
Copy link

codecov bot commented Sep 30, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Files Coverage Δ
wcwidth/emoji_zwj_sequences.py 100.00% <ø> (ø)
wcwidth/table_wide.py 100.00% <ø> (ø)
wcwidth/table_zero.py 100.00% <ø> (ø)
wcwidth/unicode_versions.py 100.00% <ø> (ø)
wcwidth/wcwidth.py 94.68% <83.33%> (-5.32%) ⬇️

📢 Thoughts on this report? Let us know!.

@jquast jquast changed the title Expand zero-width coverage, add support for ZWJ/emoji sequences DRAFT: Expand zero-width coverage and support for ZWJ/emoji sequences Sep 30, 2023
@jquast jquast changed the title DRAFT: Expand zero-width coverage and support for ZWJ/emoji sequences DRAFT: Expand zero-width support for ZWJ and emoji sequences Sep 30, 2023
@jquast jquast marked this pull request as ready for review September 30, 2023 07:12
Comment on lines 790 to +794
return fmt.format(name_len=style.name_len,
ucs_printlen=UCS_PRINTLEN,
delimiter=delimiter,
name=name,
ucs=disp_ucs,
val=val)
name_val=name_val,
ucs=ucs)

Check warning

Code scanning / CodeQL

Unused named argument in formatting call

Surplus named argument for string format. An argument named 'name_len' is provided, but it is not required by [format "{{delimiter}}{{ucs}}{{delimiter}}{{name_val}}"](1). Surplus named argument for string format. An argument named 'ucs_printlen' is provided, but it is not required by [format "{{delimiter}}{{ucs}}{{delimiter}}{{name_val}}"](1).
bin/wcwidth-browser.py Fixed Show fixed Hide fixed

# local
from wcwidth import ZERO_WIDTH, wcwidth, list_versions, _wcmatch_version
import wcwidth
from wcwidth import ZERO_WIDTH, wcwidth, list_versions, _wcmatch_version, EMOJI_ZWJ_SEQUENCES

Check notice

Code scanning / CodeQL

Unused import

Import of 'ZERO_WIDTH' is not used.

# local
from wcwidth import ZERO_WIDTH, wcwidth, list_versions, _wcmatch_version
import wcwidth

Check notice

Code scanning / CodeQL

Module is imported with 'import' and 'import from'

Module 'wcwidth' is imported with both 'import' and 'import from'.

def run(self, writer, reader):
def run(self, writer, reader, automatic=False):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
# sequence clears to end-of-line
clear_eol = self.term.clear_eol
# sequence clears to end-of-screen
clear_eos = self.term.clear_eos

Check notice

Code scanning / CodeQL

Unused local variable

Variable clear_eos is not used.
bin/wcwidth-browser.py Fixed Show fixed Hide fixed
bin/wcwidth-browser.py Fixed Show fixed Hide fixed
it's too bad the tables aren't more consistent.
ZOC terminal gets nutty with vertical position instead of horizontal
around some characters. I think it is some terrible mistake, but it is
closed source, bugs are private, and it costs $80 !
writer(self.text_entry(ucs, name), end='')
ny, nx = self.term.get_location()
assert y == ny, ("y position is out of control on ucs value=", hex(ord(ucs)) if len(ucs) == 1 else repr(ucs))
if y == ny:

Check warning

Code scanning / CodeQL

Redundant comparison

Test is always true, because of [this condition](1).
attr_text=self.term.bright_red,
name_len=self.screen.style.name_len)

def measure_distance(ucs: str, name: str):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
@@ -59,10 +61,11 @@
expect_length_phrase = sum(expect_length_each)

# exercise,
length_phrase = wcwidth.wcswidth(phrase, end)
length_each = tuple(map(wcwidth.wcwidth, phrase))

Check notice

Code scanning / CodeQL

Unused local variable

Variable length_each is not used.
for test_type in ('narrow', 'wide', 'zero', 'emoji-zwj'):
self.initialize_page_data(test_type)
self.screen.change_test_type(test_type)
page_idx = page_offset = idx = offset = 0

Check notice

Code scanning / CodeQL

Unused local variable

Variable page_idx is not used.
for test_type in ('narrow', 'wide', 'zero', 'emoji-zwj'):
self.initialize_page_data(test_type)
self.screen.change_test_type(test_type)
page_idx = page_offset = idx = offset = 0

Check notice

Code scanning / CodeQL

Unused local variable

Variable page_offset is not used.
@jquast
Copy link
Owner Author

jquast commented Oct 17, 2023

Sorry, no need to review, I have been using this branch for multiple ideas. I will carve them into smaller PR's, beginning with the automatic tests.

I am just performing automatic tests against popular terminals, now, and committing their results, when I'm finished with that I'll start smaller PR's, thank you!

@jquast jquast closed this Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant