Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change bare key characters to Letter and Digit #990

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Commits on Sep 23, 2023

  1. Change bare key characters to Letter and Digit

    I believe this would greatly improve things and solves all the issues,
    mostly. It's a bit more complex, but not overly so, and can be
    implemented without a Unicode library without too much effort. It offers
    a good middle ground, IMHO.
    
    I don't think there are ANY perfect solutions here and that *anything*
    will be a trade-off. That said, I do believe some trade-offs are better
    than others, and I've made it no secret that I feel the current
    trade-off is a bad one. After looking at a bunch of different options I
    believe this is by far the best path for TOML.
    
    Advantages:
    
    - This is what I would consider the "minimal set" of characters we need
      to add for reasonable international support, meaning we can't really
      make a mistake with this by accidentally allowing too much.
    
      We can add new ranges in TOML 1.2 (or even change the entire approach,
      although I'd be very surprised if we need to), based on actual
      real-world feedback, but any approach we will take will need to
      include letters and digits from all scripts.
    
      This is a strong argument in favour of this and a huge improvement: we
      can't really do anything wrong here in a way that we can't correct
      later, unlike what we have now, which is "well I think it probably
      won't cause any problems, based on what these 5 European/American guys
      think, but if it does: we won't be able to correct it".
    
      Being conservative for these type of things is good!
    
    - This solves the normalisation issues, since combining characters are
      no longer allowed in bare keys, so it becomes a moot point.
    
      For quoted keys normalisation is mostly a non-issue because few people
      use them, which is why this gone largely unnoticed and undiscussed
      before the "Unicode in bare keys" PR was merged.[1]
    
    - It's consistent in what we allow: no "this character is allowed, but
      this very similar other thing isn't, what gives?!"
    
      Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
      "this character works fine, but this very similar doesn't". This shows
      up in a number of things aside from emojis:
    
          a.toml:
                  Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
                  Error:   line 1: expected '.' or '=', but got ';' instead
    
          b.toml:
                  Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
                  Error:   (none)
    
          c.toml:
                  Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
                  Error:   line 1: expected '.' or '=', but got '–' instead
    
          d.toml:
                  Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
                  Error:   (none)
    
          e.toml:
                  Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
                  Error:   (none)
    
      "Some punctuation is allowed but some isn't" is hard to explain, and
      also not what the specification says: "Punctuation, spaces, arrows,
      box drawing and private use characters are not allowed." In reality, a
      lot of punctuation IS allowed, but not all (especially outside of the
      Latin character range by the way, which shows the Euro/US bias in how
      it's written).
    
      People don't read specifications in great detail, nor should they.
      People try something and sees if it works. Now it seems to work on
      first approximation, and then (possibly months or years later) it
      seems to "suddenly break". From the user's perspective this seems like
      a bug in the TOML parser, but it's not: it's a bug in the
      specification. It should either allow everything or nothing. This
      in-between is confusing and horrible.
    
      There is no good way to communicate this other than "these codepoints,
      which cover most of what you'd write in a sentence, except when it
      doesn't".
    
      In contrast, "we allow letters and digits" is simple to spec, simple
      to communicate, and should have a minimum potential for confusion. The
      current spec disallows some things seemingly almost arbitrary while
      allowing other very similar characters.
    
    - This avoids a long list of confusable special TOML characters; some
      were mentioned above but there are many more:
    
          '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
          '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
          '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
          '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
          '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
          '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
          '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
          '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
          '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
          'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
          '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
          '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
          '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)
    
      Is this a big problem? I guess it depends; I can certainly imagine an
      Armenian speaker accidentally leaving an Armenian apostrophe.
    
      Confusables is also an issue with different scripts (Latin and
      Cyrillic is well-known), but this is less of an issue since it's not
      syntax, and also something that's fundamentally unavoidable in any
      multi-script environment.
    
    - Maps closer to identifiers in more (though not all) languages. We
      discussed whether TOML keys are "strings" or "identifiers" last week
      in toml-lang#966 and while views differ (mostly because they're both) it seems
      to me that making it map *closer* is better. This is a minor issue,
      but it's nice.
    
    That does not mean it's perfect; as I mentioned all solutions come with
    a trade-off. The ones made here are:
    
    - The biggest issue by far is that the check to see if a character is
      valid may become more complex for some languages and environments that
      can't rely on a Unicode database being present.
    
      However, implementing this check is trivial logic-wise: it just needs
      to loop over every character and check if it's in a range table. You
      already need this with TOML 1.0, it's just that the range tables
      become larger.
    
      The downside is it needs a somewhat large-ish "allowed characters"
      table with 716 start/stop ranges, which is not ideal, but entirely
      doable and easily auto-generated. It's ~164 lines hard-wrapped at
      column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
      lines, so that seems within the limits of reason (actually, reading
      through the tomlc99 code adding multibyte support at all will be the
      harder part, with this range table being a minor part).
    
    - There's a new Unicode version roughly every year or so, and the way
      it's written now means it's "locked" to Unicode 9 or, optionally, a
      later version. This is probably fine: Apple's APFS filesystem (which
      does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
      Go is Unicode 8.0. etc. I don't think this is really much of an issue
      in practice.
    
      I choose Unicode 9 as everyone supports this; I doubted a long time
      over it, and we can also use a more recent version. I feel this gives
      us a nice balance between reasonable interoperability while also
      future-proofing things.
    
    - ABNF doesn't support Unicode. This is a tooling issue, and in my
      opinion the tooling should adjust to how we want TOML to look like,
      rather than adjusting TOML to what tooling supports. AFAIK no one uses
      the ABNF directly in code, and it's merely "informational".
    
      I'm not happy with this, but personally I think this should be a
      non-issue when considering what to do here. We're not the only people
      running in to this limitation, and is really something that IETF
      should address in a new RFC or something ("Extra Augmented BNF"?)
    
    Another solution I tried is restricting the code ranges; I twice tried
    to do this (with some months in-between) and spent a long time looking
    at Unicode blocks and ranges, and I found this impractical: we'll end up
    with a long list which isn't all that different from what this proposal
    adds.
    
    Fixes toml-lang#954
    Fixes toml-lang#966
    Fixes toml-lang#979
    Ref toml-lang#687
    Ref toml-lang#891
    Ref toml-lang#941
    
    ---
    
    [1]:
    Aside: I encountered this just the other day as I created a TOML file
    with all UK election results since 1945, which looks like:
    
         [1950]
         Labour       = [13_266_176, 315, 617]
         Conservative = [12_492_404, 298, 619]
         Liberal      = [ 2_621_487,   9, 475]
         Sinn_Fein    = [    23_362,   0,   2]
    
    That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
    it as Sinn_Fein. This is what most people seem to do.
    arp242 committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    2543ac3 View commit details
    Browse the repository at this point in the history
  2. Include all codepoints in the ABNF; add Other_Alphabetic

    Per comments; also commit the script used to generate the ABNF ranges.
    Probably want to replace that with something less, ehm, crap, so it's
    easier for people to modify and run ... This was just quick and easy to
    write for me now.
    arp242 committed Sep 23, 2023
    Configuration menu
    Copy the full SHA
    611af82 View commit details
    Browse the repository at this point in the history