Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of UTF-16 surrogate pairs / 2x 16-bit code units, for code points outside Unicode BMP (Basic Multilingual Plane) #42

Open
danielweck opened this issue Oct 2, 2015 · 0 comments

Comments

@danielweck
Copy link
Member

This is not a bug report, but a reminder to check whether we are "doing the right thing" in our CFI generator / processor code, and in the "annotations" (highlighter) plugin which handles DOM selections / ranges.

See:
w3c/epub-specs#555 (comment)

Verbatim reproduction of the comment linked above:


Yes, it is also my understanding that CFI "character offsets" are expressed relative to the number of 16-bits code units within the strings of characters encoded as UTF-16, not in terms of the actual number of code points (which are commonly referred-to as Unicode "characters").

This way, the processing of surrogate pairs (i.e. two 16-bits code units) for code points outside of Unicode BMP (Basic Multilingual Plane) must be explicit (no implicit normalization / conversion), which is compatible with the DOM Ranges API ( http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Position-h3 ), and the Javascript String API (e.g. .length, .substr(), see http://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type + http://www.ecma-international.org/ecma-262/6.0/#sec-string-objects, and .charAt() http://www.ecma-international.org/ecma-262/6.0/#sec-string.prototype.charat vs. .codePointAt() http://www.ecma-international.org/ecma-262/6.0/#sec-string.prototype.codepointat ).

Additional literature on the subject:
https://mathiasbynens.be/notes/javascript-encoding
http://www.2ality.com/2013/09/javascript-unicode.html

Popular library to deal with Unicode in Javascript:
https://github.com/bestiejs/punycode.js#punycodeucs2

In other words, a CFI library (such as Readium's own https://github.com/readium/readium-cfi-js ) effectively treats strings of characters as though they were encoded using UCS-2 (16-bits / 2-byte Universal Character Set), unaware of sequences of UTF-16 surrogate pairs potentially contained within.
Note that CFI "assertions" for text locations / ranges (i.e. based on the aforementioned CFI character offsets, which are counts of UTF-16 code units) are URI-escaped via UTF-8 encoding. See:
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-text-location
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-epubcfi-escaping

So, back to the proposed specification updates:

http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-terminating-char

For XML character data, the offset is zero-based and always refers to a position between characters, so 0 means before the first character and a number equal to the total UTF-16 length means after the last character. A character offset value greater than the UTF-16 length of the available text must not be specified.

May I suggest the following edits? (I added a non-normative note)

In this specification, the definition of an "offset" within XML character data is based on the UTF-16 text encoding, whereby each "character" (Unicode code point) may be represented using a single 16-bit code unit, or two units (surrogate pairs, for Unicode characters outside of BMP / Basic Multilingual Plane) [ http://www.unicode.org ]. A CFI "character offset" is a zero-based number that refers to a position between UTF-16 code units. Here, the "length" of the text is the total count of 16-bit units. Offset zero therefore means before the first 16-bit unit, and a number equal to the "length" of the text means after the last 16-bit unit. An offset value greater than the "length" of the text must not be specified. NOTE: note to implementors: counting the number of text "characters" based on UTF-16 code units (instead of Unicode code points) is compatible with the DOM Range model [ http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Position-h3 ], and with the ECMA / Javascript String API [ http://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type ].


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants