Parsing should be less Unicode-aware #8

lassik · 2021-05-13T06:28:19Z

In the F# code (and possibly others) we're using the host language's native char functions, which I assume are Unicode-aware.

The grammar in the current draft has whitespace = HT | LF | VT | FF | CR | space. The same expressed is ASCII codepoints is 0x09..0x0D (HT..CR) and 0x20 (space). To get consistent parsing across languages, we should detect these bytes explicitly.

Here's an example of ASCII-only character detection from the SML code:

fun charIsWhitespace char =
    let val cc = Char.ord char in
        (cc = 0x20) orelse (cc >= 0x09 andalso cc <= 0x0D)
    end;

fun charIsAlphabetic char =
    ((char >= #"A") andalso (char <= #"Z")) orelse
    ((char >= #"a") andalso (char <= #"z"));

fun charIsNumeric char =
    ((char >= #"0") andalso (char <= #"9"));

The text was updated successfully, but these errors were encountered:

wallymathieu · 2021-05-14T10:16:41Z

It would be nice to have a consistent spec that can deal with non-ASCII characters.

lassik · 2021-05-14T11:04:11Z

The format should definitely permit non-ASCII characters, but IMHO not as syntactically significant. Whitespace has syntactic meaning, and it's non-trivial to keep track of all Unicode whitespace codepoints. We don't want to force all POSE implementations to ship with big Unicode tables.

lassik · 2021-05-14T11:04:38Z

@johnwcowan is a Unicode expert; any advice?

wallymathieu · 2021-05-14T11:42:09Z

I've had some pains in dealing with different encodings. I agree on "should not be syntactically significant".

lassik · 2021-05-14T12:02:45Z

The Go spec says:

White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored [...]

So you can write code like fmt.Println("Hello, 世界") but the non-ASCII characters are inside string literals.

I think Go also allows non-ASCII identifiers. If we have vertical bar symbols in POSE, IMHO we should permit non-ASCII in them.

lassik · 2021-05-14T12:03:43Z

So POSE would permit non-ASCII in:

comments
double-quoted strings
vertical-bar symbols

and nowhere else. Is this reasonable?

johnwcowan · 2021-05-20T18:04:26Z

Comments and quoted strings, yes, provided the only encoding is UTF-8. For my view on symbols, see #3.

lassik · 2021-05-20T18:12:46Z

Agreed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing should be less Unicode-aware #8

Parsing should be less Unicode-aware #8

lassik commented May 13, 2021

wallymathieu commented May 14, 2021

lassik commented May 14, 2021

lassik commented May 14, 2021

wallymathieu commented May 14, 2021

lassik commented May 14, 2021

lassik commented May 14, 2021

johnwcowan commented May 20, 2021

lassik commented May 20, 2021