From ee3539ad67330a9aaaea6c52f9adc6ea900a356b Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Thu, 14 Dec 2023 08:47:42 -0800 Subject: [PATCH] Add requirement for character encoding in trunction Addresses #124 Addresses w3c/i18n-actions#62 - Add a requirement with explanation such that byte length truncation needs to specify a character encoding (and that legacy encodings should be avoided) - Add links to glossary terms in this section in some places - Small tweaks to other text --- index.html | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/index.html b/index.html index 5b06603..dd9e8aa 100644 --- a/index.html +++ b/index.html @@ -2969,7 +2969,7 @@

Text truncation in UTF-8

-

Specifications that limit the length of a string SHOULD require truncation on grapheme boundaries, as truncation in the midst of a combining or joining sequence can alter the meaning of the string.

+

Specifications that limit the length of a string SHOULD require truncation on grapheme boundaries, as truncation in the midst of a grapheme or combining character sequence can alter the meaning of the string.

@@ -2977,8 +2977,14 @@

Text truncation in UTF-8

-

When specifying a length limitation in code units (such as bytes), specifications SHOULD set the maximum length in a way that accommodates users whose language requires multibyte code unit sequences.

+

When specifying a length limitation in code units (such as bytes), specifications SHOULD set the limit in a way that accommodates users whose language requires multibyte code unit sequences.

+ +
+

If a specification specifies a length limit in code units (such as bytes), it MUST specify the character encoding used in measuring the limit; such a limit SHOULD NOT specify a legacy character encoding.

+
+ +

If a specification permits or requires truncation of a field, the character encoding is important in knowing what the limit means. If the limit is in bytes and legacy character encodings are permitted, note that conversion of Unicode data to a non-Unicode encoding can also result in data loss (since most legacy character encodings encode only a subset of Unicode).