Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

qsantos · 2024-11-16T18:04:23Z

tl;dr: calculate_position should not use the lengths of graphemes as provided by unicode-width, but instead use the sum of the widths of the codepoints.

At least on Unix, when calculating the width of displayed characters, rustline uses grapheme segmentation.

However, using the minimal example and pasting 👨‍👩‍👧‍👦 (\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}), and then typing A, results in the following output:

This is because my terminal does not interpret the ZERO WIDTH JOINER (U+200D). In fact, I was able to reproduce this behavior in the following terminal emulators:

xfce4-terminal
gnome-terminal
rxvt-unicode
mate-terminal
blackbox-terminal
Putty
Kitty
Mac's Terminal
iTerm2
VS Code's built-in terminal
Intellij IDEA's terminal

Edit: Regarding the sentence betlow, the UAX #11 actually says nothing about graphemes. It mostly talks about CJK characters and half-width variants, which do not require grapheme handling either. In fact UTS #51 says that the handling of the ZERO WIDTH JOINER can vary by platform. So what we are seeing is a choice made by unicode-width. rustyline might not want to follow it, and use the sum of the widths of the individual code points instead.

~~Unicode does say that the full grapheme should be considered, and~~ unicode-width implement it so:

use unicode_width::{UnicodeWidthChar, UnicodeWidthStr};

fn main() {
    let s = "\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}";
    println!("{} {s}", s.width());
    for c in s.chars() {
        println!("{} {c}", c.width().unwrap_or(9));
    }
}

outputs:

2 👨‍👩‍👧‍👦
2 👨
0 ‍
2 👩
0 ‍
2 👧
0 ‍
2 👦

The first line looks correct in a graphical browser, but this is what I actually see:

Also note that this is not about legacy vs extended graphemes. ZERO WIDTH JOINER is considered in both.

If I remove the .graphemes(true) part from calculate_position (and adapt the code to use codepoints instead of grapheme clusters), I achieve the expected behavior:

Are there cases where we do need to use grapheme clusters when calculating widths? That is, either:

Common terminal emulators that follow the Unicode specification more closely.
Other grapheme clusters whose visual width is different from the sum of the graphemes corresponding to the codepoints taken separately, in common terminal emulators.

The text was updated successfully, but these errors were encountered:

gwenn · 2024-11-16T19:58:04Z

See #184

qsantos · 2024-11-16T20:00:46Z

Thanks, I should have searched for “width” instead of “grapheme”.

gwenn · 2024-11-17T13:53:54Z

Cannot reproduce with iterm2, WezTerm:

But with kitty:

And Mac terminal:

And Alacritty:

qsantos changed the title ~~Interpreting graphene clusters when calculating width breaks on most terminal emulators~~ Interpreting grapheme clusters when calculating width breaks on most terminal emulators Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

qsantos commented Nov 16, 2024 •

edited

Loading

gwenn commented Nov 16, 2024

qsantos commented Nov 16, 2024

gwenn commented Nov 17, 2024 •

edited

Loading

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

Comments

qsantos commented Nov 16, 2024 • edited Loading

gwenn commented Nov 16, 2024

qsantos commented Nov 16, 2024

gwenn commented Nov 17, 2024 • edited Loading

qsantos commented Nov 16, 2024 •

edited

Loading

gwenn commented Nov 17, 2024 •

edited

Loading