-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String start/end delimiters: square brackets or something else? #5
Comments
I think square brackets are reasonable at this point, however, they might not even be necessary. If a choices list is a required component of a file (and possibly an optional divisions list if the format is going to support preservation of precinct or other subdivisions) it could be for something like this: =choices # If the spec is everything from the colon to the newline (\n) with leading and trailing white space stripped is the description. # when the description is missing it defaults to the identifier. This is more readable and less typing, and easy to parse the lines. Also since these identifiers will likely become the Keys in Key Value lookups in code, restricting the character sets more such as to A-Z 0-9 and limited punctuation is a good idea and helps prevent humans from making typos (Ward1_Div2 has more opportunity to have an entry error on character case than WARD1_DIV2). To word this for a specification:
Probably worth a separate issue but the mention of newline made me think of it, \r and BOM should be discouraged, possibly by requiring a validator to reject them as invalid characters. |
Just a quick note to say that lowercase should be allowed.
…On Mon, Jun 14, 2021, 19:38 John Karr ***@***.***> wrote:
I think square brackets are reasonable at this point, however, they might
not even be necessary. If a choices list is a required component of a file
(and possibly an optional divisions list if the format is going to support
preservation of precinct or other subdivisions) it could be for something
like this:
=choices
DGM: Doña García Márquez
SY: Sue Ye (蘇業)
AM: Adam Muñoz
If the spec is everything from the colon to the newline (\n) with leading
and trailing white space stripped is the description.
=divisions
WARD1_DIV1
WARD1_DIV2
when the description is missing it defaults to the identifier.
This is more readable and less typing, and easy to parse the lines.
Also since these identifiers will likely become the Keys in Key Value
lookups in code, restricting the character sets more such as to A-Z 0-9 and
limited punctuation is a good idea and helps prevent humans from making
typos (Ward1_Div2 has more opportunity to have an entry error on character
case than WARD1_DIV2).
To word this for a specification:
A data block begins with the name of the block preceded by an equal sign
on a line.
A data block ends with a blank line.
For Key value pairs the key may contain the characters A-Z and 0-9 plus
underscores and possibly some other punctuation characters.
A Key Value Pair is the key followed by a colon. The value is everything
on the line after the colon, with any leading or trailing white-space
stripped.
Probably worth a separate issue but the mention of newline made me think
of it, \r and BOM should be discouraged, possibly by requiring a validator
to reject them as invalid characters.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA5FMEXWTOLT7W2FCE2YXTTS24LBANCNFSM46SJ7YUQ>
.
|
I think this issue (issue #5) has become the place where we answer the question "what is a valid candidate token?", and I think we're coming to rough consensus on this point. Here's the taxonomy I'm seeing for candidate tokens:
There's a separate issue I wish to file regarding the mapping of bare tokens to fullname tokens. We also still have issue #6 , which is where we're discussing the structure of an overall ABIF file, and how much line-by-line state is necessary for interpreting ABIF. There are a number of issues that one of us could file as separate issues, but I'll just enumerate them here in a bulleted list in hopes that others who care about speedy and clear resolution of the issues will file issues and elaborate/clarify their thoughts:
I'm almost certainly not inclined to file any of the three bullets above as issues in the next 24 hours, so I would be delighted if someone filed one (or all) of these as separate issues. |
I'm leaning towards the same options as @robla. |
Since I started work on writing a parser I would like to resolve more about fullname tokens.
So I would prefer to not allow fullname tokens in the ballot lines. While I think there would be support for an optional metadata flags for this, the decision will probably be to not by default enforce the choices list and to allow fullname tags in the ballot lines. I would prefer not to write code for all possible options. It appears resolved at this point that the fullname token delimiter will be [] and that [] will not be allowed in them. Beyond that:
my votes are 1: no, 2: no, 3: no, 4 yes, 5: fatal or warn. |
My votes: 1-yes, 2-no, 3-no, 4-yes, 5-warn. My preferences here are probably influenced by my language that contains many non-ASCII characters, so I would approach them leniently; I can easily imagine the case where all ballot lines contain only fullname tokens. Otherwise, while I think the option of having fullname tokens mixed with bare ones on a ballot line could also be useful for write-in candidates, disallowing it will probably help catch many unwanted errors (typos), so I agree with your view; I would stick to warnings in the case of candidate list violation since the parser can still resolve the situation unambiguously. Ad 2: would this rather be a non-control character? |
@nealmcb raised the issue of UTF-8 support in ABIF. I fully agree that it should be UTF-8, and I think you'd be hard pressed to find someone who objects. I've been referring to it as ASCII, but ASCII is the subset of UTF-8 that most English-speaking developers know how to deal with. That said, the test cases on the electowiki ABIF page already use several names that imply UTF-8 support is needed:
Thank you, @nealmcb for updating the ABIF electowiki page (I'm assuming you're the same "nealmcb" in both places). We should assume that all ABIF documents will have UTF-8 characters outside of lower ASCII (i.e. it may contain characters above
U+007E
, which will be interpreted according to the UTF-8 spec)One important note: I also suggest that bare words be much more limited (e.g. the ASCII characters for
[A-Z]
and[a-z]
), and that we have a mapping mechanism from full strings to bare tokens, such as an optional header like this:...which would allow for mapping full UTF-8 strings to bare tokens. For example, the header above would map the string "
Sue Ye (蘇業)
" to the characters "SY
". Pretty much any UTF-8 string should be allowed between square brackets ("[
" and "]
"), except for square brackets themselves. It seems wise to leave the presence of an opening square bracket after the first opening square bracket unspecified, and wait until we have some implementations that need a specification in order to interoperate.It also seems wise to limit everything outside of square brackets to 7-bit ASCII characters, much like many popular programming languages and data formats used internationally, and only allow full 8-bit UTF-8 characters in quoted strings.
An alternative to square brackets could be to use quotation marks (much like JSON, YAML and others) and then use the backslash escaping mechanism. Given the balancing issues that I've seen over the years with quotation marks, as well as the weird ambiguity and varying use of single quote (
'
) and double quote ("
), and given that many candidate names seem more likely to include quoted nicknames in English-language speaking countries (e.g. ([Richard "Dick" Nixon]
), I think square brackets make a better quoting mechanism. But I also think now is the time to make the case for an alternative.Thoughts?
The text was updated successfully, but these errors were encountered: