Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing should be less Unicode-aware #8

Open
lassik opened this issue May 13, 2021 · 8 comments
Open

Parsing should be less Unicode-aware #8

lassik opened this issue May 13, 2021 · 8 comments

Comments

@lassik
Copy link
Collaborator

lassik commented May 13, 2021

In the F# code (and possibly others) we're using the host language's native char functions, which I assume are Unicode-aware.

The grammar in the current draft has whitespace = HT | LF | VT | FF | CR | space. The same expressed is ASCII codepoints is 0x09..0x0D (HT..CR) and 0x20 (space). To get consistent parsing across languages, we should detect these bytes explicitly.

Here's an example of ASCII-only character detection from the SML code:

fun charIsWhitespace char =
    let val cc = Char.ord char in
        (cc = 0x20) orelse (cc >= 0x09 andalso cc <= 0x0D)
    end;

fun charIsAlphabetic char =
    ((char >= #"A") andalso (char <= #"Z")) orelse
    ((char >= #"a") andalso (char <= #"z"));

fun charIsNumeric char =
    ((char >= #"0") andalso (char <= #"9"));
@wallymathieu
Copy link
Member

It would be nice to have a consistent spec that can deal with non-ASCII characters.

@lassik
Copy link
Collaborator Author

lassik commented May 14, 2021

The format should definitely permit non-ASCII characters, but IMHO not as syntactically significant. Whitespace has syntactic meaning, and it's non-trivial to keep track of all Unicode whitespace codepoints. We don't want to force all POSE implementations to ship with big Unicode tables.

@lassik
Copy link
Collaborator Author

lassik commented May 14, 2021

@johnwcowan is a Unicode expert; any advice?

@wallymathieu
Copy link
Member

I've had some pains in dealing with different encodings. I agree on "should not be syntactically significant".

@lassik
Copy link
Collaborator Author

lassik commented May 14, 2021

The Go spec says:

White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored [...]

So you can write code like fmt.Println("Hello, 世界") but the non-ASCII characters are inside string literals.

I think Go also allows non-ASCII identifiers. If we have vertical bar symbols in POSE, IMHO we should permit non-ASCII in them.

@lassik
Copy link
Collaborator Author

lassik commented May 14, 2021

So POSE would permit non-ASCII in:

  • comments
  • double-quoted strings
  • vertical-bar symbols

and nowhere else. Is this reasonable?

@johnwcowan
Copy link
Collaborator

Comments and quoted strings, yes, provided the only encoding is UTF-8. For my view on symbols, see #3.

@lassik
Copy link
Collaborator Author

lassik commented May 20, 2021

Agreed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants