Add Unicode escaping #2461

privat · 2017-05-23T15:37:35Z

Nit should have literal Unicode escape sequence \u008B and \U0000080B added in escape_to_nit and unescape_nit. I assume the change in the lib will be automatically used by the Nit compiler and tools

#2459 (comment)

The text was updated successfully, but these errors were encountered:

lbajolet · 2017-05-24T02:49:51Z

I think this should only be added to unescape_nit, nothing should be done C-wise, except maybe compile every non-ASCII Unicode character to their byte representation as \x sequences (this could probably become a compatibility option at some point).

There is however one point that may need some discussion: invalid UTF-8 code-points. Should we replace invalid UTF-8 sequences like surrogates? My guess would be yes, but this would mean that valid strings in languages like Java or .Net languages might not be understood in Nit.
Also, adding surrogate support will likely induce some performance hit due to the lookahead, though it will probably be minor since \u escape sequences are not that popular, and especially surrogate ones.

Other than that, since Unicode is limited to 10FFFF, no \U escape sequence should support more than that.

And tool-wise, there will probably be one minor modification to the grammar if we want to support this kind of sequences, but this should not be too much of a hassle to implement.

PR will likely follow in the next couple of days

jcbrinfo · 2017-05-24T14:22:35Z

One problem I think we have to consider is that people are used to \u and \U to always take 4 digits (so they are limited to BMP or UCS-2/UTF-16 wydes). It is especially important for strings like "1\u00A0000\u00A0000" (1 000 000).

lbajolet · 2017-05-24T14:35:41Z

This is linked with what I pointed out yesterday, imo the spec should be something along the lines of: * Allow \(u|U)[0-9A-Fa-f]{1,6} * Disallow characters above the Unicode maximum (0x10FFFF) The only question remaining is what to do with surrogate pairs, should we allow them? In some languages (C# being one example if my memory serves me right), \u and \U have different masks. The capital one expects 8 digits while the other expects 4. This feels like an example of what not to do if you ask me

…

On 24 May 2017 10:22 am, "Jean-Christophe Beaupré" ***@***.***> wrote: One problem I think we have to consider is that people are used to \u and \U to always take 4 digits (so they are limited to BMP or UCS-2/UTF-16 wydes). It is especially important for strings like "1\u00A0000\u00A0000" (1 000 000). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2461 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYL2Qf63ErL7CvkCDTnYux-PO0RYA3eks5r9D0rgaJpZM4Nj4wa> .

jcbrinfo · 2017-05-24T15:31:16Z

IMO, since you allow non-BMP code points, the behavior should be in sync with Int::code_point, in order to avoid confusion.

privat · 2017-05-25T14:52:01Z

what is the behavior of JS and python on surrogate pairs and on \u vs \U?

jcbrinfo · 2017-05-25T16:19:36Z

JS and JSON was designed for UCS-2/UTF-16. So, they simply handle them as UTF-16 prescribe. Furthermore, \u must always be written with a lower case u (U+0075).

Sources:

ECMAScript 2016, section 11.8.4
ECMA-404, section 9
RFC 7159, section 7

privat assigned lbajolet May 23, 2017

xymus mentioned this issue May 23, 2017

gamnit: intro support for Angel Code bitmap font #2459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unicode escaping #2461

Add Unicode escaping #2461

privat commented May 23, 2017

lbajolet commented May 24, 2017

jcbrinfo commented May 24, 2017

lbajolet commented May 24, 2017 via email

jcbrinfo commented May 24, 2017

privat commented May 25, 2017

jcbrinfo commented May 25, 2017

Add Unicode escaping #2461

Add Unicode escaping #2461

Comments

privat commented May 23, 2017

lbajolet commented May 24, 2017

jcbrinfo commented May 24, 2017

lbajolet commented May 24, 2017 via email

jcbrinfo commented May 24, 2017

privat commented May 25, 2017

jcbrinfo commented May 25, 2017