RFC: Revise `char` primitive #7028

cometkim · 2024-09-08T16:52:01Z

State of `char` type

ReScript has the char primitive type, which is rarely used. (I was one of those who used char to handle ASCII keycodes)

https://rescript-lang.org/docs/manual/latest/primitive-types#char

Note: Char doesn't support Unicode or UTF-8 and is therefore not recommended.

The char doesn't support Unicode, but only supports UTF-16 codepoint.

let a = '👋'

compiles to

let a = 128075;

Its value is the same as '👋'.codePointAt(0) result in JavaScript, which means that in the value representation, char is equivalent to int (16-bit subset).

Then, why don't we use just int instead of char?

char literals are automatically compiled to codepoints. This is much more efficient than string representation when dealing with the Unicode data table.
char supports range pattern (e.g. 'a' .. 'z') in pattern matching. This is very useful when writing parsers.

However, a char literal is not really useful to represent a Unicode character because it doesn't cover the entire Unicode sequence. It only returns the first codepoint value and discards the rest of the character segment.

To avoid problems, we should limit the value range of char literal to the BMP(Basic Multilingual Plane, U+0000~U+FFFF).

Suggestion

I suggest some changes that would keep the useful parts of char but remove its confusion.

Get rid of char type or make it an alias of int
Keep char literal syntax, but with internal representation as regular integers
Limit the char literal range to BMP in the syntax level.
Support range patterns for regular integers
Remove the Char module.

The text was updated successfully, but these errors were encountered:

zth · 2024-09-08T19:26:09Z

I like it! "Support range patterns for regular integers" especially. But I'm not very read up on these things.

cristianoc · 2024-09-08T19:33:15Z

Looks good in principle. Just wondering about what kind of functionality one would want.

Any way to use former chars, now int, to build things, as opposed to just taking things apart?
And how do we guarantee they are in the BMP range at that point?

cristianoc · 2024-09-08T19:34:15Z

If ints, formerly char, are not used for building things, then what else should be used? Always strings?

cometkim · 2024-09-08T23:05:49Z

The char type was originally used to interact with the String module from OCaml, but this becomes problematic when supporting a wider range than OCaml's.

In the current version:

// '𐀀' = 0x10000

let _ = String.get("𐀀", n) // behave like Js's String.prototype.codePointAt

let _ = String.make(10, '𐀀')
//  ^ This has unexpected value "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
//    Wrong on both Js and OCaml side

The proposal "fixes" current behavior while not breaking it. After removing the OCaml modules in #6984, I expect it to be unused, but it shouldn't cause any problems even in the code that is still using it.

BMP is a boundary that a UTF-16 character appears as a surrogate pair. If the codepoint is in the BMP, it can be represented in JS as a string with length=1 and it's safe to treat a string as a sequence of chars.

I think it can be easily checked in the scanner.

cometkim · 2024-09-08T23:15:31Z

When writing tokenizers in ReScript, It will still find something like this useful

switch String.codePointAt(input, cursor) {
  | Some(ch) => switch ch {
    | '\r' | '\n' => Whitespace
    | 'a' .. 'z' | 'A' .. 'Z' => ...
  }
  | None => ...
}

cristianoc · 2024-09-08T23:42:01Z

When writing tokenizers in ReScript, It will still find something like this useful

switch String.codePointAt(input, cursor) {

  | Some(ch) => switch ch {

    | '\r' | '\n' => Whitespace

    | 'a' .. 'z' | 'A' .. 'Z' => ...

  }

  | None => ...

}

That's how one takes things apart.
I guess to put them back together, one will also use functions from String, correct?

cristianoc · 2024-09-08T23:53:22Z

One observation is that the internal representation as integers might have to wait until after parsing, as pretty printing will restore the original.
Unless pretty printing itself will do some sort of normalization (when converting back from int).

cometkim mentioned this issue Sep 22, 2024

Drop Caml runtimes and primitives #6984

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Revise `char` primitive #7028

RFC: Revise `char` primitive #7028

cometkim commented Sep 8, 2024 •

edited

Loading

zth commented Sep 8, 2024

cristianoc commented Sep 8, 2024

cristianoc commented Sep 8, 2024

cometkim commented Sep 8, 2024 •

edited

Loading

cometkim commented Sep 8, 2024

cristianoc commented Sep 8, 2024

cristianoc commented Sep 8, 2024

RFC: Revise char primitive #7028

RFC: Revise char primitive #7028

Comments

cometkim commented Sep 8, 2024 • edited Loading

State of char type

Suggestion

zth commented Sep 8, 2024

cristianoc commented Sep 8, 2024

cristianoc commented Sep 8, 2024

cometkim commented Sep 8, 2024 • edited Loading

cometkim commented Sep 8, 2024

cristianoc commented Sep 8, 2024

cristianoc commented Sep 8, 2024

RFC: Revise `char` primitive #7028

RFC: Revise `char` primitive #7028

cometkim commented Sep 8, 2024 •

edited

Loading

State of `char` type

cometkim commented Sep 8, 2024 •

edited

Loading