You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr:calculate_position should not use the lengths of graphemes as provided by unicode-width, but instead use the sum of the widths of the codepoints.
However, using the minimal example and pasting 👨👩👧👦 (\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}), and then typing A, results in the following output:
This is because my terminal does not interpret the ZERO WIDTH JOINER (U+200D). In fact, I was able to reproduce this behavior in the following terminal emulators:
Edit: Regarding the sentence betlow, the UAX #11 actually says nothing about graphemes. It mostly talks about CJK characters and half-width variants, which do not require grapheme handling either. In fact UTS #51 says that the handling of the ZERO WIDTH JOINER can vary by platform. So what we are seeing is a choice made by unicode-width. rustyline might not want to follow it, and use the sum of the widths of the individual code points instead.
Unicode does say that the full grapheme should be considered, andunicode-width implement it so:
use unicode_width::{UnicodeWidthChar,UnicodeWidthStr};fnmain(){let s = "\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}";println!("{} {s}", s.width());for c in s.chars(){println!("{} {c}", c.width().unwrap_or(9));}}
outputs:
2 👨👩👧👦
2 👨
0
2 👩
0
2 👧
0
2 👦
The first line looks correct in a graphical browser, but this is what I actually see:
Also note that this is not about legacy vs extended graphemes. ZERO WIDTH JOINER is considered in both.
If I remove the .graphemes(true) part from calculate_position (and adapt the code to use codepoints instead of grapheme clusters), I achieve the expected behavior:
Are there cases where we do need to use grapheme clusters when calculating widths? That is, either:
Common terminal emulators that follow the Unicode specification more closely.
Other grapheme clusters whose visual width is different from the sum of the graphemes corresponding to the codepoints taken separately, in common terminal emulators.
The text was updated successfully, but these errors were encountered:
qsantos
changed the title
Interpreting graphene clusters when calculating width breaks on most terminal emulators
Interpreting grapheme clusters when calculating width breaks on most terminal emulators
Nov 16, 2024
tl;dr:
calculate_position
should not use the lengths of graphemes as provided by unicode-width, but instead use the sum of the widths of the codepoints.At least on Unix, when calculating the width of displayed characters, rustline uses grapheme segmentation.
However, using the minimal example and pasting
👨👩👧👦
(\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}
), and then typingA
, results in the following output:This is because my terminal does not interpret the ZERO WIDTH JOINER (U+200D). In fact, I was able to reproduce this behavior in the following terminal emulators:
Edit: Regarding the sentence betlow, the UAX #11 actually says nothing about graphemes. It mostly talks about CJK characters and half-width variants, which do not require grapheme handling either. In fact UTS #51 says that the handling of the ZERO WIDTH JOINER can vary by platform. So what we are seeing is a choice made by
unicode-width
.rustyline
might not want to follow it, and use the sum of the widths of the individual code points instead.Unicode does say that the full grapheme should be considered, andunicode-width implement it so:outputs:
The first line looks correct in a graphical browser, but this is what I actually see:
Also note that this is not about legacy vs extended graphemes. ZERO WIDTH JOINER is considered in both.
If I remove the
.graphemes(true)
part fromcalculate_position
(and adapt the code to use codepoints instead of grapheme clusters), I achieve the expected behavior:Are there cases where we do need to use grapheme clusters when calculating widths? That is, either:
The text was updated successfully, but these errors were encountered: