Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String start/end delimiters: square brackets or something else? #5

Open
robla opened this issue Jun 12, 2021 · 6 comments
Open

String start/end delimiters: square brackets or something else? #5

robla opened this issue Jun 12, 2021 · 6 comments

Comments

@robla
Copy link
Contributor

robla commented Jun 12, 2021

@nealmcb raised the issue of UTF-8 support in ABIF. I fully agree that it should be UTF-8, and I think you'd be hard pressed to find someone who objects. I've been referring to it as ASCII, but ASCII is the subset of UTF-8 that most English-speaking developers know how to deal with. That said, the test cases on the electowiki ABIF page already use several names that imply UTF-8 support is needed:

  • Doña García Márquez
  • Sue Ye (蘇業)
  • Adam Muñoz

Thank you, @nealmcb for updating the ABIF electowiki page (I'm assuming you're the same "nealmcb" in both places). We should assume that all ABIF documents will have UTF-8 characters outside of lower ASCII (i.e. it may contain characters above U+007E, which will be interpreted according to the UTF-8 spec)

One important note: I also suggest that bare words be much more limited (e.g. the ASCII characters for [A-Z] and [a-z]), and that we have a mapping mechanism from full strings to bare tokens, such as an optional header like this:

[Doña García Márquez]: DGM
[Sue Ye (蘇業)]: SY
[Adam Muñoz]: AM

...which would allow for mapping full UTF-8 strings to bare tokens. For example, the header above would map the string "Sue Ye (蘇業)" to the characters "SY". Pretty much any UTF-8 string should be allowed between square brackets ("[" and "]"), except for square brackets themselves. It seems wise to leave the presence of an opening square bracket after the first opening square bracket unspecified, and wait until we have some implementations that need a specification in order to interoperate.

It also seems wise to limit everything outside of square brackets to 7-bit ASCII characters, much like many popular programming languages and data formats used internationally, and only allow full 8-bit UTF-8 characters in quoted strings.

An alternative to square brackets could be to use quotation marks (much like JSON, YAML and others) and then use the backslash escaping mechanism. Given the balancing issues that I've seen over the years with quotation marks, as well as the weird ambiguity and varying use of single quote (') and double quote ("), and given that many candidate names seem more likely to include quoted nicknames in English-language speaking countries (e.g. ([Richard "Dick" Nixon]), I think square brackets make a better quoting mechanism. But I also think now is the time to make the case for an alternative.

Thoughts?

@brainbuz
Copy link
Contributor

brainbuz commented Jun 15, 2021

I think square brackets are reasonable at this point, however, they might not even be necessary. If a choices list is a required component of a file (and possibly an optional divisions list if the format is going to support preservation of precinct or other subdivisions) it could be for something like this:

=choices
DGM: Doña García Márquez
SY: Sue Ye (蘇業)
AM: Adam Muñoz

# If the spec is everything from the colon to the newline (\n) with leading and trailing white space stripped is the description.
=divisions
WARD1_DIV1
WARD1_DIV2

# when the description is missing it defaults to the identifier.

This is more readable and less typing, and easy to parse the lines.

Also since these identifiers will likely become the Keys in Key Value lookups in code, restricting the character sets more such as to A-Z 0-9 and limited punctuation is a good idea and helps prevent humans from making typos (Ward1_Div2 has more opportunity to have an entry error on character case than WARD1_DIV2).

To word this for a specification:

  A data block begins with the name of the block preceded by an equal sign on a line.
  A data block ends with a blank line.
  For Key value pairs the key may contain the characters A-Z and 0-9 plus 
  underscores and possibly some other punctuation characters.
  A Key Value Pair is the key followed by a colon. The value is everything on the line after the colon, 
  with any leading or trailing white-space stripped. If a key is given and the value is omitted,
  the value is defaulted to the key. 

Probably worth a separate issue but the mention of newline made me think of it, \r and BOM should be discouraged, possibly by requiring a validator to reject them as invalid characters.

@nealmcb
Copy link

nealmcb commented Jun 15, 2021 via email

@robla
Copy link
Contributor Author

robla commented Jun 15, 2021

I think this issue (issue #5) has become the place where we answer the question "what is a valid candidate token?", and I think we're coming to rough consensus on this point. Here's the taxonomy I'm seeing for candidate tokens:

  • Bare token - This is a continuous string of English uppercase English letters A through Z as well as the lowercase English letters a through z, with no whitespace or other intervening characters.
  • Fullname token - This is an arbitrary UTF-8 string, surrounded by square brackets

There's a separate issue I wish to file regarding the mapping of bare tokens to fullname tokens. We also still have issue #6 , which is where we're discussing the structure of an overall ABIF file, and how much line-by-line state is necessary for interpreting ABIF. There are a number of issues that one of us could file as separate issues, but I'll just enumerate them here in a bulleted list in hopes that others who care about speedy and clear resolution of the issues will file issues and elaborate/clarify their thoughts:

  • Should fullname tokens be allowed everywhere that bare tokens are allowed? (I'm leaning toward "yes")
  • Are barename tokens and fullname tokens equivalent when they resolve to the same string? For example: is "Adams" equal to "[Adams]"? (I'm leaning toward "yes")
  • Are candidate tokens case sensitive? Are uppercase English letters and lowercase English letters semantically equivalent, or are they distinct? For example, are "Adams" and "ADAMS" equivalent or distinct? (I'm leaning toward answering these questions "case sensitive" and "distinct")

I'm almost certainly not inclined to file any of the three bullets above as issues in the next 24 hours, so I would be delighted if someone filed one (or all) of these as separate issues.

@simberaj
Copy link

I'm leaning towards the same options as @robla.

@brainbuz
Copy link
Contributor

brainbuz commented Jan 2, 2022

Since I started work on writing a parser I would like to resolve more about fullname tokens.

  • Because fullname tokens at this point are only restricted from including [], the possibility of including characters that have other meaning on the ballot makes it difficult to read the line with regexes.
  • A dictionary of valid choices is good structure and provides a validation mechanism for the ballot files.

So I would prefer to not allow fullname tokens in the ballot lines.

While I think there would be support for an optional metadata flags for this, the decision will probably be to not by default enforce the choices list and to allow fullname tags in the ballot lines. I would prefer not to write code for all possible options.

It appears resolved at this point that the fullname token delimiter will be [] and that [] will not be allowed in them. Beyond that:

  1. Can fullname tokens appear in a ballot line?
  2. Are there any character restrictions on fullname tokens other than not [] and must be a visible character
  3. If a choices list is defined, can the fullname and bare tokens be used interchangeably in ballots?
  4. If a choices list is defined, does that restrict valid choices to it,
  5. If choices are restricted to the defined list are violations fatal to parsing, ignore or warn?

my votes are 1: no, 2: no, 3: no, 4 yes, 5: fatal or warn.

@simberaj
Copy link

simberaj commented Feb 2, 2022

My votes: 1-yes, 2-no, 3-no, 4-yes, 5-warn.

My preferences here are probably influenced by my language that contains many non-ASCII characters, so I would approach them leniently; I can easily imagine the case where all ballot lines contain only fullname tokens.

Otherwise, while I think the option of having fullname tokens mixed with bare ones on a ballot line could also be useful for write-in candidates, disallowing it will probably help catch many unwanted errors (typos), so I agree with your view; I would stick to warnings in the case of candidate list violation since the parser can still resolve the situation unambiguously.

Ad 2: would this rather be a non-control character?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants