Skip to content

Commit

Permalink
Merge pull request #142 from aphillips/aphillips-grapheme
Browse files Browse the repository at this point in the history
Add a note about EGCs
  • Loading branch information
aphillips authored Nov 7, 2024
2 parents 8e43f3b + 16ec996 commit beb132c
Showing 1 changed file with 11 additions and 7 deletions.
18 changes: 11 additions & 7 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1237,10 +1237,16 @@ <h2>Characters</h2>

<p>At their simplest, user-perceived characters are a single shape that can be tied one-to-one to the underlying computing representation. But a user-perceived character can be formed, in some scripts, from more than one character. And a given logical character can take many different shapes due to such influences as font selection, style, or the surrounding context (such as adjacent characters). In some cases, a single user-perceived character might be formed from a long sequence of logical characters. And some logical characters (so-called "combining marks") are always used in conjunction with another character.</p>

<p>When user-perceived characters are represented visibly (on screen or in print), they are represented by individual rendering units. This visual unit is called a <a>grapheme</a> (the word <a>glyph</a> is also used). Graphemes are the visual units found in fonts and rendering software.</p>
<p>When user-perceived characters are represented visibly (on screen or in print), they are represented by individual rendering units. This visual unit is called a [=grapheme=] (the word [=glyph=] is also used). Graphemes are the visual units found in fonts and rendering software.</p>

<aside class=note>
<p>[[Unicode]] provides a definition for approximately computing [=grapheme=] boundaries. The boundaries defined by [[Unicode]] are called [=grapheme clusters=]. Unless otherwise specified, the term [=grapheme cluster=] in this document refers to what [[UAX29]] refers to as an "extended default grapheme cluster".</p>
<p>For many languages and scripts, there is little difference between a [=grapheme=] and a [=grapheme cluster=]. However, for a number of languages and scripts, particularly those found in South Asia, the difference can be important.</p>
<p>For example, the Bangla user-perceived character <em>kshī</em> <span class="codepoint" translate="no"><bdi lang="bn">ক্ষী</bdi></span> is composed of four characters: <span class="codepoint" translate="no"><code class="uname">U+0995 BENGALI LETTER KA</code> + <code class="uname">U+09CD BENGALI SIGN VIRAMA</code> + <code class="uname">U+09B7 BENGALI LETTER SSA</code> + <code class="uname">U+09C0 BENGALI VOWEL SIGN II</code>.
<p>Unicode splits these into two grapheme clusters, unless language-specific tailoring is applied. For more information, see our article <a href="https://www.w3.org/International/articles/definitions-characters/index.en.html#characters">Character encodings: Essential concepts</a>.</p>
</aside>

<aside class=example>
<h5>Examples of user-perceived characters</h5>
<aside class=example title="Examples of graphemes and user-perceived characters">
<p>Here is the word for "Unicode" in the Latin, Katakana, Arabic, and Devanagari scripts.</p>
<p class=bigtext>Unicode
<span lang=ja>&#x30E6;&#x30CB;&#x30B3;&#x30FC;&#x30C9;</span>
Expand All @@ -1259,8 +1265,7 @@ <h5>Examples of user-perceived characters</h5>

<p>The relationship between code points and graphemes can be complex. In most cases, a code point sequence that forms a single grapheme should be treated as a single textual unit. For example, when cursoring across text, an entire grapheme should select together. It shouldn't be possible to cursor into the "middle" of a grapheme or delete only a part of user-perceived character. Because the relationship is not one-to-one between code points and graphemes and because the relationship can be somewhat complex, [[Unicode]] defines a specific type of grapheme: the <a>extended grapheme cluster</a> which most closely matches the mapping of the underlying logical character sequence to a user-perceived character. When referring to 'graphemes' in this document, we mean extended grapheme clusters (unless otherwise called out).</p>

<aside class=example>
<h5>Hindi example showing mapping from graphemes to code points</h5>
<aside class=example title="Example of the difference between graphemes and code points">
<p>Returning to the example above, the Hindi word for Unicode is made of four graphemes:</p>
<p class=bigtext lang=hi>&#x092F;&#x0942;&nbsp;<span style="color:red">&#x0928;&#x093F;</span>&nbsp;&#x0915;&#x094B;&nbsp;&#x0921;</p>
<p>Several of these graphemes are made up of more than one Unicode character because of the way that the Devanagari script works. In Devanagari, the basic set of "letters" are syllables ending with the short 'a' vowel sound. When you want to use a different vowel, you add a combining vowel character that changes the shape of the grapheme. The red text in the example above is the syllable "ni" in "Unicode". It is made of two characters: U+0928 (the syllable "na") and U+093F (combining "short i" sound):</p>
Expand Down Expand Up @@ -1294,8 +1299,7 @@ <h5>Hindi example showing mapping from graphemes to code points</h5>

<p>A set of rules for converting code points to or from code units is called a <a>character encoding form</a> (or just "character encoding" for short.</p>

<aside class=example>
<h2>UTF-8 Character Encoding Form</h2>
<aside class=example title="UTF-8 Character Encoding Form">

<p>The most common character encoding used on the Web is UTF-8. UTF-8 uses 8-bit bytes as its code unit. Each Unicode code point encoded into UTF-8 takes between one and four bytes to encode. ASCII characters take one byte to encode. Code points from 0x80 to 0x7FF take two bytes. Code points from 0x800 to 0xFFFF take three bytes. And code points from 0x10000 to 0x10FFFF (that is, the rest of Unicode) take four bytes each.</p>

Expand Down

0 comments on commit beb132c

Please sign in to comment.