Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode escaping #2461

Open
privat opened this issue May 23, 2017 · 6 comments
Open

Add Unicode escaping #2461

privat opened this issue May 23, 2017 · 6 comments
Assignees

Comments

@privat
Copy link
Member

privat commented May 23, 2017

Nit should have literal Unicode escape sequence \u008B and \U0000080B added in escape_to_nit and unescape_nit. I assume the change in the lib will be automatically used by the Nit compiler and tools

#2459 (comment)

@lbajolet
Copy link
Contributor

I think this should only be added to unescape_nit, nothing should be done C-wise, except maybe compile every non-ASCII Unicode character to their byte representation as \x sequences (this could probably become a compatibility option at some point).

There is however one point that may need some discussion: invalid UTF-8 code-points. Should we replace invalid UTF-8 sequences like surrogates? My guess would be yes, but this would mean that valid strings in languages like Java or .Net languages might not be understood in Nit.
Also, adding surrogate support will likely induce some performance hit due to the lookahead, though it will probably be minor since \u escape sequences are not that popular, and especially surrogate ones.

Other than that, since Unicode is limited to 10FFFF, no \U escape sequence should support more than that.

And tool-wise, there will probably be one minor modification to the grammar if we want to support this kind of sequences, but this should not be too much of a hassle to implement.

PR will likely follow in the next couple of days

@jcbrinfo
Copy link
Contributor

One problem I think we have to consider is that people are used to \u and \U to always take 4 digits (so they are limited to BMP or UCS-2/UTF-16 wydes). It is especially important for strings like "1\u00A0000\u00A0000" (1 000 000).

@lbajolet
Copy link
Contributor

lbajolet commented May 24, 2017 via email

@jcbrinfo
Copy link
Contributor

IMO, since you allow non-BMP code points, the behavior should be in sync with Int::code_point, in order to avoid confusion.

@privat
Copy link
Member Author

privat commented May 25, 2017

what is the behavior of JS and python on surrogate pairs and on \u vs \U?

@jcbrinfo
Copy link
Contributor

JS and JSON was designed for UCS-2/UTF-16. So, they simply handle them as UTF-16 prescribe. Furthermore, \u must always be written with a lower case u (U+0075).

Sources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants