String start/end delimiters: square brackets or something else? #5

robla · 2021-06-12T08:57:54Z

@nealmcb raised the issue of UTF-8 support in ABIF. I fully agree that it should be UTF-8, and I think you'd be hard pressed to find someone who objects. I've been referring to it as ASCII, but ASCII is the subset of UTF-8 that most English-speaking developers know how to deal with. That said, the test cases on the electowiki ABIF page already use several names that imply UTF-8 support is needed:

Doña García Márquez
Sue Ye (蘇業)
Adam Muñoz

Thank you, @nealmcb for updating the ABIF electowiki page (I'm assuming you're the same "nealmcb" in both places). We should assume that all ABIF documents will have UTF-8 characters outside of lower ASCII (i.e. it may contain characters above U+007E, which will be interpreted according to the UTF-8 spec)

One important note: I also suggest that bare words be much more limited (e.g. the ASCII characters for [A-Z] and [a-z]), and that we have a mapping mechanism from full strings to bare tokens, such as an optional header like this:

[Doña García Márquez]: DGM
[Sue Ye (蘇業)]: SY
[Adam Muñoz]: AM

...which would allow for mapping full UTF-8 strings to bare tokens. For example, the header above would map the string "Sue Ye (蘇業)" to the characters "SY". Pretty much any UTF-8 string should be allowed between square brackets ("[" and "]"), except for square brackets themselves. It seems wise to leave the presence of an opening square bracket after the first opening square bracket unspecified, and wait until we have some implementations that need a specification in order to interoperate.

It also seems wise to limit everything outside of square brackets to 7-bit ASCII characters, much like many popular programming languages and data formats used internationally, and only allow full 8-bit UTF-8 characters in quoted strings.

An alternative to square brackets could be to use quotation marks (much like JSON, YAML and others) and then use the backslash escaping mechanism. Given the balancing issues that I've seen over the years with quotation marks, as well as the weird ambiguity and varying use of single quote (') and double quote ("), and given that many candidate names seem more likely to include quoted nicknames in English-language speaking countries (e.g. ([Richard "Dick" Nixon]), I think square brackets make a better quoting mechanism. But I also think now is the time to make the case for an alternative.

Thoughts?

The text was updated successfully, but these errors were encountered:

brainbuz · 2021-06-15T02:38:27Z

I think square brackets are reasonable at this point, however, they might not even be necessary. If a choices list is a required component of a file (and possibly an optional divisions list if the format is going to support preservation of precinct or other subdivisions) it could be for something like this:

=choices
DGM: Doña García Márquez
SY: Sue Ye (蘇業)
AM: Adam Muñoz

# If the spec is everything from the colon to the newline (\n) with leading and trailing white space stripped is the description.
=divisions
WARD1_DIV1
WARD1_DIV2

# when the description is missing it defaults to the identifier.

This is more readable and less typing, and easy to parse the lines.

Also since these identifiers will likely become the Keys in Key Value lookups in code, restricting the character sets more such as to A-Z 0-9 and limited punctuation is a good idea and helps prevent humans from making typos (Ward1_Div2 has more opportunity to have an entry error on character case than WARD1_DIV2).

To word this for a specification:

  A data block begins with the name of the block preceded by an equal sign on a line.
  A data block ends with a blank line.
  For Key value pairs the key may contain the characters A-Z and 0-9 plus 
  underscores and possibly some other punctuation characters.
  A Key Value Pair is the key followed by a colon. The value is everything on the line after the colon, 
  with any leading or trailing white-space stripped. If a key is given and the value is omitted,
  the value is defaulted to the key.

Probably worth a separate issue but the mention of newline made me think of it, \r and BOM should be discouraged, possibly by requiring a validator to reject them as invalid characters.

nealmcb · 2021-06-15T13:02:36Z

Just a quick note to say that lowercase should be allowed.

…

On Mon, Jun 14, 2021, 19:38 John Karr ***@***.***> wrote: I think square brackets are reasonable at this point, however, they might not even be necessary. If a choices list is a required component of a file (and possibly an optional divisions list if the format is going to support preservation of precinct or other subdivisions) it could be for something like this: =choices DGM: Doña García Márquez SY: Sue Ye (蘇業) AM: Adam Muñoz If the spec is everything from the colon to the newline (\n) with leading and trailing white space stripped is the description. =divisions WARD1_DIV1 WARD1_DIV2 when the description is missing it defaults to the identifier. This is more readable and less typing, and easy to parse the lines. Also since these identifiers will likely become the Keys in Key Value lookups in code, restricting the character sets more such as to A-Z 0-9 and limited punctuation is a good idea and helps prevent humans from making typos (Ward1_Div2 has more opportunity to have an entry error on character case than WARD1_DIV2). To word this for a specification: A data block begins with the name of the block preceded by an equal sign on a line. A data block ends with a blank line. For Key value pairs the key may contain the characters A-Z and 0-9 plus underscores and possibly some other punctuation characters. A Key Value Pair is the key followed by a colon. The value is everything on the line after the colon, with any leading or trailing white-space stripped. Probably worth a separate issue but the mention of newline made me think of it, \r and BOM should be discouraged, possibly by requiring a validator to reject them as invalid characters. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA5FMEXWTOLT7W2FCE2YXTTS24LBANCNFSM46SJ7YUQ> .

robla · 2021-06-15T22:15:36Z

I think this issue (issue #5) has become the place where we answer the question "what is a valid candidate token?", and I think we're coming to rough consensus on this point. Here's the taxonomy I'm seeing for candidate tokens:

Bare token - This is a continuous string of English uppercase English letters A through Z as well as the lowercase English letters a through z, with no whitespace or other intervening characters.
Fullname token - This is an arbitrary UTF-8 string, surrounded by square brackets

There's a separate issue I wish to file regarding the mapping of bare tokens to fullname tokens. We also still have issue #6 , which is where we're discussing the structure of an overall ABIF file, and how much line-by-line state is necessary for interpreting ABIF. There are a number of issues that one of us could file as separate issues, but I'll just enumerate them here in a bulleted list in hopes that others who care about speedy and clear resolution of the issues will file issues and elaborate/clarify their thoughts:

Should fullname tokens be allowed everywhere that bare tokens are allowed? (I'm leaning toward "yes")
Are barename tokens and fullname tokens equivalent when they resolve to the same string? For example: is "Adams" equal to "[Adams]"? (I'm leaning toward "yes")
Are candidate tokens case sensitive? Are uppercase English letters and lowercase English letters semantically equivalent, or are they distinct? For example, are "Adams" and "ADAMS" equivalent or distinct? (I'm leaning toward answering these questions "case sensitive" and "distinct")

I'm almost certainly not inclined to file any of the three bullets above as issues in the next 24 hours, so I would be delighted if someone filed one (or all) of these as separate issues.

simberaj · 2021-06-16T20:02:59Z

I'm leaning towards the same options as @robla.

brainbuz · 2022-01-02T20:01:04Z

Since I started work on writing a parser I would like to resolve more about fullname tokens.

Because fullname tokens at this point are only restricted from including [], the possibility of including characters that have other meaning on the ballot makes it difficult to read the line with regexes.
A dictionary of valid choices is good structure and provides a validation mechanism for the ballot files.

So I would prefer to not allow fullname tokens in the ballot lines.

While I think there would be support for an optional metadata flags for this, the decision will probably be to not by default enforce the choices list and to allow fullname tags in the ballot lines. I would prefer not to write code for all possible options.

It appears resolved at this point that the fullname token delimiter will be [] and that [] will not be allowed in them. Beyond that:

Can fullname tokens appear in a ballot line?
Are there any character restrictions on fullname tokens other than not [] and must be a visible character
If a choices list is defined, can the fullname and bare tokens be used interchangeably in ballots?
If a choices list is defined, does that restrict valid choices to it,
If choices are restricted to the defined list are violations fatal to parsing, ignore or warn?

my votes are 1: no, 2: no, 3: no, 4 yes, 5: fatal or warn.

simberaj · 2022-02-02T20:18:15Z

My votes: 1-yes, 2-no, 3-no, 4-yes, 5-warn.

My preferences here are probably influenced by my language that contains many non-ASCII characters, so I would approach them leniently; I can easily imagine the case where all ballot lines contain only fullname tokens.

Otherwise, while I think the option of having fullname tokens mixed with bare ones on a ballot line could also be useful for write-in candidates, disallowing it will probably help catch many unwanted errors (typos), so I agree with your view; I would stick to warnings in the case of candidate list violation since the parser can still resolve the situation unambiguously.

Ad 2: would this rather be a non-control character?

robla mentioned this issue Jun 12, 2021

Decide what the letter "I" in "ABIF" stands for, or if "ABIF" is the correct acronym #2

Closed

robla mentioned this issue Jun 16, 2021

Candidate tokens: mechanism for bare token to fullname token mapping? #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String start/end delimiters: square brackets or something else? #5

String start/end delimiters: square brackets or something else? #5

robla commented Jun 12, 2021

brainbuz commented Jun 15, 2021 •

edited

Loading

nealmcb commented Jun 15, 2021 via email

robla commented Jun 15, 2021

simberaj commented Jun 16, 2021

brainbuz commented Jan 2, 2022

simberaj commented Feb 2, 2022

String start/end delimiters: square brackets or something else? #5

String start/end delimiters: square brackets or something else? #5

Comments

robla commented Jun 12, 2021

brainbuz commented Jun 15, 2021 • edited Loading

nealmcb commented Jun 15, 2021 via email

robla commented Jun 15, 2021

simberaj commented Jun 16, 2021

brainbuz commented Jan 2, 2022

simberaj commented Feb 2, 2022

brainbuz commented Jun 15, 2021 •

edited

Loading