-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing should be less Unicode-aware #8
Comments
It would be nice to have a consistent spec that can deal with non-ASCII characters. |
The format should definitely permit non-ASCII characters, but IMHO not as syntactically significant. Whitespace has syntactic meaning, and it's non-trivial to keep track of all Unicode whitespace codepoints. We don't want to force all POSE implementations to ship with big Unicode tables. |
@johnwcowan is a Unicode expert; any advice? |
I've had some pains in dealing with different encodings. I agree on "should not be syntactically significant". |
The Go spec says:
So you can write code like I think Go also allows non-ASCII identifiers. If we have vertical bar symbols in POSE, IMHO we should permit non-ASCII in them. |
So POSE would permit non-ASCII in:
and nowhere else. Is this reasonable? |
Comments and quoted strings, yes, provided the only encoding is UTF-8. For my view on symbols, see #3. |
Agreed. |
In the F# code (and possibly others) we're using the host language's native char functions, which I assume are Unicode-aware.
The grammar in the current draft has
whitespace = HT | LF | VT | FF | CR | space
. The same expressed is ASCII codepoints is 0x09..0x0D (HT..CR) and 0x20 (space). To get consistent parsing across languages, we should detect these bytes explicitly.Here's an example of ASCII-only character detection from the SML code:
The text was updated successfully, but these errors were encountered: