You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a bug report, but a reminder to check whether we are "doing the right thing" in our CFI generator / processor code, and in the "annotations" (highlighter) plugin which handles DOM selections / ranges.
Verbatim reproduction of the comment linked above:
Yes, it is also my understanding that CFI "character offsets" are expressed relative to the number of 16-bits code units within the strings of characters encoded as UTF-16, not in terms of the actual number of code points (which are commonly referred-to as Unicode "characters").
For XML character data, the offset is zero-based and always refers to a position between characters, so 0 means before the first character and a number equal to the total UTF-16 length means after the last character. A character offset value greater than the UTF-16 length of the available text must not be specified.
May I suggest the following edits? (I added a non-normative note)
In this specification, the definition of an "offset" within XML character data is based on the UTF-16 text encoding, whereby each "character" (Unicode code point) may be represented using a single 16-bit code unit, or two units (surrogate pairs, for Unicode characters outside of BMP / Basic Multilingual Plane) [ http://www.unicode.org ]. A CFI "character offset" is a zero-based number that refers to a position between UTF-16 code units. Here, the "length" of the text is the total count of 16-bit units. Offset zero therefore means before the first 16-bit unit, and a number equal to the "length" of the text means after the last 16-bit unit. An offset value greater than the "length" of the text must not be specified. NOTE: note to implementors: counting the number of text "characters" based on UTF-16 code units (instead of Unicode code points) is compatible with the DOM Range model [ http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Position-h3 ], and with the ECMA / Javascript String API [ http://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type ].
The text was updated successfully, but these errors were encountered:
This is not a bug report, but a reminder to check whether we are "doing the right thing" in our CFI generator / processor code, and in the "annotations" (highlighter) plugin which handles DOM selections / ranges.
See:
w3c/epub-specs#555 (comment)
Verbatim reproduction of the comment linked above:
Yes, it is also my understanding that CFI "character offsets" are expressed relative to the number of 16-bits code units within the strings of characters encoded as UTF-16, not in terms of the actual number of code points (which are commonly referred-to as Unicode "characters").
This way, the processing of surrogate pairs (i.e. two 16-bits code units) for code points outside of Unicode BMP (Basic Multilingual Plane) must be explicit (no implicit normalization / conversion), which is compatible with the DOM Ranges API ( http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Position-h3 ), and the Javascript String API (e.g.
.length
,.substr()
, see http://www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type + http://www.ecma-international.org/ecma-262/6.0/#sec-string-objects, and.charAt()
http://www.ecma-international.org/ecma-262/6.0/#sec-string.prototype.charat vs..codePointAt()
http://www.ecma-international.org/ecma-262/6.0/#sec-string.prototype.codepointat ).Additional literature on the subject:
https://mathiasbynens.be/notes/javascript-encoding
http://www.2ality.com/2013/09/javascript-unicode.html
Popular library to deal with Unicode in Javascript:
https://github.com/bestiejs/punycode.js#punycodeucs2
In other words, a CFI library (such as Readium's own https://github.com/readium/readium-cfi-js ) effectively treats strings of characters as though they were encoded using
UCS-2
(16-bits / 2-byte Universal Character Set), unaware of sequences of UTF-16 surrogate pairs potentially contained within.Note that CFI "assertions" for text locations / ranges (i.e. based on the aforementioned CFI character offsets, which are counts of UTF-16 code units) are URI-escaped via UTF-8 encoding. See:
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-text-location
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-epubcfi-escaping
So, back to the proposed specification updates:
http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-terminating-char
May I suggest the following edits? (I added a non-normative note)
The text was updated successfully, but these errors were encountered: