electorama / abif Goto Github PK

4.0 2.0 1.0 57 KB

The _Aggregated Ballot Information Format_ provides a concise, aggregated, text-based document to describe the ballots cast in range-based or ranked elections, as well as approval-based and choose-one balloting systems.

License: Other

Python 100.00%

abif electorama election-methods voting-methods

abif's Introduction

ABIF - Aggregated Ballot Information Format

The Aggregated Ballot Information Format (or "ABIF") provides a concise, aggregated, text-based document to describe the ballots cast in range-based or ranked elections, as well as approval-based and choose-one balloting systems. See the following resources to learn more:

electowiki.org/wiki/ABIF - The primary wiki page to learn about this project.
electorama.com/abiftool - A Python-based tool that implements the ABIF format specified in this repository.

abif's People

Contributors

Stargazers

Watchers

Forkers

brainbuz

abif's Issues

How to handle bare candidate tokens without a mapping when other bare tokens have a mapping?

Consider this ABIF file

[Joe Biden]: JB
[Donald Trump]: DT

125: JB>DT
113: DT>JB
110: JB>DT>JS

What should the parser do when encountering JS?

FourCCs, UTF-8, and ABIF files

The BLUF: I would like to impose the following requirements on ABIF files.

All ABIF files are valid UTF-8 byte streams
It should be possible to identify the line type of each line of an ABIF file by inspecting the first four bytes of the line (the "One2FourBC", as I'd like to call it)
Each ABIF line type will have a structure that MUST be defined by a self-contained BNF
All ABIF line types can be identified by the first byte (or maybe two, or MAYBE even four, but absolutely no more than four). We should strive to keep it that way.
Identifier tokens (e.g. "bare candidate tokens" in the current nominclature) should be first-class citizens inside an ABIF file.

Sorry for all of the acronyms and jargon, but I'll try to explain the rationale for all five of these below.

The details:

It seems to me that ABIF is taking on a structure that has general applicability to text-based file formats. It seems unwise to try inventing a new generalized text-based data structure format, since there already so many (RFC 822, XML, JSON, YAML, TOML, etc). However, it also seems to me that fear of "reinventing the wheel" (or rather, for the metaphor I'm making: "reinventing the hammer" as a tool) has led to misapplication of existing tools when a new tool may be more appropriate.

It may be that a text-based data format similar to (or the same as) what I'm suggesting here has been more adequately defined elsewhere. I am not hoping that what I'm defining here is new and unique, but as of this writing (in June 2021), I am not aware a generalized text-based data structure format similar to this.

ABIF is encoded using UTF-8, so let me interject with few essential bits of Unicode important to this conversation. "UTF-8" is a character-encoding format that encodes each character of a text document in a sequence of up to four bytes, but typically (for English) only one byte per character. This has become less true for English as support for UTF-8 has slowly replaced ASCII as the baseline character encoding format. The transition has been slow because UTF-8 is a stricty-compatible superset of lower ASCII. By "lower ASCII", I'm referring to the UTF-8 characters "U+0000" through "U+007F" which are codepoints in the "Basic Latin Unicode Block". UTF-8's Basic Latin encoding is compatible with ASCII's encoding in that range of characters. The transition from ASCII-only to full UTF-8 has accelerated in recent years in no small part because "plain text" editors have gained support for characters beyond Basic Latin (like “fancy” quotation marks and emojis like “🐕” and “🧔”)

Prior to the broad acceptance of UTF-8 as a method for encoding text, and prior to the broad acceptance of XML and JSON (and other text-based formats for data structures), it was common to use the "FourCC" byte sequence as a technique for defining the structure of the data following the sequence. "FourCC" stands for "four character code", but it is really a "four byte code" rather than four characters, and were typically restricted to ASCII bytes. I believe that FourCCs are in still in common use today in binary formats (e.g. .mp4 files and .webm files), but it's been been a while since I've looked at binary file formats very closely. Regardless, four-bytes is the same as 32 bits, which capable of expressing 4,294,967,296 values. I do not anticipate needing more than a dozen line types with ABIF, but if anyone else wishes to create text format using these ideas, it's something to keep in mind.

I've never been much of a C programmer or assembly-language programmer, but I believe I'd be able to write an efficient byte-level tokenizer for ABIF files where each line conformed to the following quasi-BNF (where "BNF" is
"Backus-Naur Form")

expression	meaning
`<line>`	the full line, terminated by a line feed character ("`<LF>`") (or optionally "`<CR><LF>`"). I believe the BNF production looks something like this: "`<One2FourBC> <LSD> (<CR>)? <LF>`"
`<One2FourBC>`	Short for "one to four byte code" (similar to "FourCC"s in other formats). This code comprises between one to four bytes which identifies the line type. We may decide to abbreviate this "`<124BC>`", but let's not make that change yet. The "`<One2FourBC>`" code may contain line-specific data to prepend to the following "`<LSD>`"
`<LSD>`	Whoa, man! That's trippy! FAR OUT!!!!1!!1! 🤪 Okay, just kidding. This refers to "line-specific data". The structure of the line-specific data depends on the contents of the `<One2FourBC>`.
`<CR>`	"`U+000D`" -- The "carriage return" character in the Basic Latin Unicode Block.
`<LF>`	"`U+000A`" - The "line feed" character from the Basic Latin Unicode Block. The minimum number of bytes for newline in a modern text file.

For each "One2FourBC" we define, we are going to need to create a BNF specification for that line. Creating a BNF is not that hard, and in fact, we should be able to test our BNFs using BNF parsers like the Python-based SimpleParse. But we also shouldn't relish the idea of creating a lot of line formats, because we need to keep ABIF simple enough to be readable by non-developers (as well as developers who don't want to implement overly-complicated text-formats).

The way that I see ABIF evolving is that we will have different tiers of data that people will want to pull out of the file:

Structural data - this is structural information that all ABIF implementations will need to deal with (e.g. identifiers, delimiters, newlines). It should be possible to know exactly how many ballots are in an ABIF file using an implementation that only has implemented support for the structural data. Identifiers should count as "structural data" (more on this in a bit).
Broadly-applicable application data - this is data that almost all ABIF implementations will be interested in. For example, it should be possible for applications to generate a list of UTF-8-formatted candidate names and the quantity of ballots that the candidate appears on by having implemented support for reading (and understanding) broadly-applicable application data.
Domain-specific essential application data - this is data that a significant number of implementors require (e.g. half of all ABIF implementors), and may also be of little interest ot implementors outside the domain of use for the ABIF file. For example, a "STAR voting" implementation may only be interested in the ratings given to each candidate, and may not care about the order that the candidates are presented in each ballot bundle. An "Instant-Runoff Voting (IRV)" implementation may only care about the ordering of the candidates in the ballot bundles, but cannot include a rating (because the voter may not have provided one). For interoperability and readability purposes, we may want to strongly encourage implementors to use the ">" and "=" delimiters between candidates in ballot bundles (rather than ",") and strongly encourage implementors to list candidates in order of most preferred to least preferred within each ballot bundle. Regardless, for ABIF to be successful, we will need to determine which domains the most important ones to serve for ABIFv1.0.
Niche-implementation application data - this is application data that is essential to a small number of implementations, but is not important to the vast majority of applications that support the format. For example, a voting machine may support images associated with the candidates, and may wish to use ABIF as the format to display the candidate list for voters. We could add a "CandidateIMG" field and associate it with the candidate identifier.

Anyway, that's a lot to consider, but I still have one other thing to discuss. One thing that I've come to realize about many popular text-based serialization formats: the identifiers didn't start out as first class citizens. Within XML and JSON, it seems that identifiers were bolted on at the end of the specification process. The mechanism that I proposed for candidate identifiers in ABIF issue #8 seem like a general purpose mechanism for all ABIF-like formats. Here's an example of the markup:

=DGM:[Doña García Márquez] # see marquez2024.com for candidate website
=SBJ:[Steven B. Jensen]    # dropped out of race three days prior to election
=SY:[Sue Ye (蘇業)]         #  see sueye.org/2024 for more
=AM:[Adam Muñoz]           #  see munozftw.org for more

It seems to me that the format should treat a line of this format to be an "identifier" for all sorts of purposes. I don't know of anything other than polticians that would need identifiers in ABIF, but it seems to me that the BNF production for identifiers should be similar to (or perhaps the same as) that of XML identifiers (like the "Name" production out of the original XML specification from 1998)

[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
[5] Name ::= (Letter | '_' | ':') (NameChar)*

I think all bare, unquoted identifiers in ABIF should start with an ASCII letter (or maybe an underscore, but probably not a colon). What happens after the first character can be more flexible, but probably not as flexible as XML.

Anyway, that's a lot of words to get to my "BLUF" above. Restating the bullet points I led with:

I believe that ensuring that all ABIF files are valid UTF-8 byte streams is more difficult than it appears, but we need to do it so that we can play cards with ABIF files (🂡🂵🃘🃙🃟...😝). More importantly, so that Doña García Márquez can run for office using her real name.
The FourCC at the beginning of each line (a.k.a. the "<One2FourBC>") should be enough to identify which BNF production is being processed. Speaking of BNFs...
Each ABIF line type will have a structure that MUST be defined by a self-contained BNF. This will make it so that each line can be parsed independently (e.g. so that an NDJSON line can be passed to a JSON parser)
As identified in the second bullet, all ABIF line types need to be identified by the first byte (or maybe two, or MAYBE even four, but absolutely no more than four). This makes it easier for implementations that are handling ABIF streams byte-by-byte, such as implementations written in C for performance reasons.
Identifier tokens should be first-class citizens inside an ABIF file or any text-based data format. They are part of the "structural data" of the file.

Are these five requirements good requirements for ABIF? Please let me know!

Describe conversion options, implications with other formats like BFF, PrefLib, BLT, NIST CDF

Often when specifying and developing data formats and related tools, it is helpful to consider how conversions between formats work.

Ideally we would have at least proof-of-concept tools to convert between

ABIF (text and jabmod (json abif model))
PrefLib Election data format
BFF
STV.pm from Lobitos
BLT
NIST Cast Vote Records Common Data Format Specification, SP 1500-103
OpenSTV/OpaVote formats
CIVS/Civitas formats
Excel/CSV format used by RCVRC tabulator (along with NIST CDF)
ES&S iVotronic EL155
Dominion JSON CVR format
Dominion converted to .xlsx oval-by-oval mark data in Colorado (i.e. show a 1 (mark) or 0 for each possible rank for each possible choice on a ballot)
debtally, e.g. from 2022 vote
widj - JSON representation used by Electowidget
Trueballot format
others - what's out there?....

(Updated 2023 to fix broken link and add jabmod, debtally, widj and Dominion mark data)

That would inform documentation that could note which conversions sometimes or always preserve all information, which are lossy, etc.

Ballot grouping data?

As John Karr pointed out in the EM-list, it is yet to be decided whether ABIF will support any notation for ballot grouping, to delimit precincts, constituencies, etc., and if so, what the format should be. He proposed the following notation:

!division: BRONX_PRECINCT_41 # there's been no discussion of this yet, I just picked ! for this example.
... lines from BRONX_PRECINCT_41
!division: QUEENS_PRECINCT_6
... lines from QUEENS_PRECINCT_6

I feel I don't have enough clarity around the benefits of in-file ballot grouping vs. using many ABIF files (that might be needed to be evaluated together) to have a clear opinion on this, but if ABIF is to support in-file ballot grouping, I'm in favor of using a special line start character to delimit it, per @robla's specs.

Meta Data Dictionary for ABIF

Part of the discussion in #6 has been about what might be included in metadata, consensus there looks like it will be an optional but probably important part of the format. An issue raised is the NIST CVR vote record format. The purpose of the metadata dictionary would be to establish the key names and data types for the metadata, and also to cover mapping values from CVR and possibly other formats. The discussion in #6 is quite lengthy and the dictionary will require considerable discussion on its own.

Should ABIF be text or serialized?

Robla's cases proposed on electowiki are in a text format.

While this style of format may be more comfortable for human reading and hand editing, a serialized (JSON/YAML) format is easier for programmers to implement, because these formats import directly to data structures using tools available in every programming langauge. YAML and pretty JSON can be as readable as the text format.

The downside is that different structures are suited to different ballot types which then makes the specification larger. Standard RCV is easily represented with an array, while range needs key value pairs, RCV can also be done with key value pairs by inverting (best is 1 etc) and this would support equal ranking in RCV as well.

Another option is to create a text and serial version of the format within the spec. The files could be differentiated either with the file extension or by testing the first line since YAML files should begin ---, while JSON would begin with {, the text format can be specified to begin ABIF as the first line. Parsers would be required to check the first line.

The dual spec would allow users to decide which they preferred, it would also allow programmers on all platforms to take advantage of external format converters to bring the data from text to serial format that they can load without writing an importer.

Leading Spaces

I don't think it was ever clearly decided in issue #6 if an extra space at the beginning a line would be allowed or not.

Options:

Ignored [Parsers strip leading white space from lines.]
Skipped [Parsers treat a leading white space as an empty line, which effectively makes a leading space equivalent to #.]

I'm fine with either choice, Robla favored 1 in #6. If 2 it should be made an official comment marker.

Define a core data model for ABIF

Many text formats aspire to simplicity, with the belief that data models are an "implementation detail". My inclination is to err in that direction, because I fear that trying to start discussion by agreeing on a serialized data model leads to this series of unfortunate reasoning:

Let's agree on a data model before we agree on syntax
Great, we have a data model, how do we serialize it?
Why invent another data serialization format; why don't we use something like JSON or XML?
Result: a large, complicated data hierarchy that is difficult/impossible to author with a text editor, and difficult to spot errors with human inspection.

Having seen the development of many "Document Object Models (DOMs)" over the years (including working closely with the folks defining a document object model for MediaWiki markup), I've been hesitant to tackle such a complicated issue so early in the development of a new format that seems so clear in my mind. However, I've come to realize that my ideas about the data that is "important" (or "interesting" to me) and the data that is "unimportant" (or "uninteresting" to me) may be very important to others, and I want to build consensus around my idea of what ABIF can be. After mulling over the discussions in several issues here (particularly issues #6 and #14 regarding the metadata format), it occurs to me that a core data model may be helpful.

Here's my take on a core data structure that ABIF files should resolve to, expressed as a partial JSON file (NOTE: this comment is subject to revision):

{
    "metadata":
    [
        {
            <key-1>: <value-1>,
            <key-2>: <value-2>,
            <key-3>: <value-3>,
            ...
            <key-n>: <value-n>
        }
    ],
    "candidates":
    [
        {
            <candidate-id-1>: <candidate-information-1>,
            <candidate-id-2>: <candidate-information-2>,
            <candidate-id-3>: <candidate-information-3>,
            ...
            <candidate-id-n>: <candidate-information-n>
        }
    ],
    "ballot_bundles":
    [
        {
            <ballot-bundle-id-1>: <ballot-bundle-1>,
            <ballot-bundle-id-2>: <ballot-bundle-2>,
            <ballot-bundle-id-3>: <ballot-bundle-3>,
            ...
            <ballot-bundle-id-n>: <ballot-bundle-n>
        }
    [

Expressing this as JSON is tricky, because JSON dictionaries are unordered key-value pairs, and there's not a great way to stipulate "order matters!". Moreover, I would like to make sure it's possible to build the data structure above using a single-pass parser. That's going to have all sorts of really tricky implications. I think we can pull it off if we have keep a shared data model in mind, but we're going to have to do things that make people who love beautiful context-free grammars (CFGs) cringe.

Decide what the letter "I" in "ABIF" stands for, or if "ABIF" is the correct acronym

In June 2021, Neal McBurnett suggested that "image" shouldn't be part of the name (see "[EM] Re: Ballot Data Format"). Since many places use "ballot image" to refer to the ASCII representation of ballots, I'm not sure I want to try to change the standard naming by the election machine industry, but I agree that when I think of "image" formats, I usually think of JPEG, GIF, PNG, and other image formats that are common on the web. Moreover, I like the name "ABIF", and don't want to change it at this point.

So the question: which of the following should the format be named?
a. Aggregated Ballot Image Format
b. Aggregated Ballot Information Format
c. Aggregated Ballot Inventory Format
d. Something else that still spells "ABIF" as an acronym
e. Something else that doesn't spell "ABIF" as an acronym

Quantity delimiter: Asterisk or colon?

We need to pick a delimeter as a preferred delimiter between quantities and ballot ordering/rating in each line of an ABIF file. Jan Šimbera suggested on the EM list that we should consider allowing asterisk ("*") as an optional replacement for colon (":") for Pivot compatibility. See [EM] Ballot Data Format for Jan's original message.

In my mind, we may possibly also include an optional delimeter. My current preference for what a compliant implementation needs to support will be expressed in IETF terms:

...	reader/parser	writer
colon	MUST	SHOULD
asterisk	MAY	SHOULD CONSIDER

My concerns with using asterisk:

I really want to be able to strip whitespace from this format for everything outside of square brackets (i.e. "[Jan Šimbera]" should be okay, but outside of square brackets, Jan as a candidate should not have spaces, so "JanŠimbera" might be an acceptable candidate token, and "J" almost certainly will be). For colon, it's easy to cram a line together ("27:DGM/5,SBJ/2,SY/1,AM/0") but for asterisk, it gets difficult to see what is happening ("27*DGM/5,SBJ/2,SY/1,AM/0").
It's difficult (for me) not to automatically try to apply the order of operations when I see asterisk in software. An asterisk needs to be surrounded by spaces to make it clear that it's not intended to be used as a footnote (like the dagger † and double-dagger ‡ frequently are as well). Many fonts display asterisk as superscripted, which makes it a difficult-to-read replacement for the multiplication symbol ("×") for non-programmers.

Case Sensitivity

The issue has been mentioned a few other places.

What will the ABIF rules on case sensitivity be.

My preference is that Full Name Tokens be case sensitive. For everything else I would prefer either everything be case sensitive or everything be case insensitive.

I think conventions are good such as bare tokens should be upper cased and metadata keys should use javascript camelCase, but style linting should be optional.

String start/end delimiters: square brackets or something else?

@nealmcb raised the issue of UTF-8 support in ABIF. I fully agree that it should be UTF-8, and I think you'd be hard pressed to find someone who objects. I've been referring to it as ASCII, but ASCII is the subset of UTF-8 that most English-speaking developers know how to deal with. That said, the test cases on the electowiki ABIF page already use several names that imply UTF-8 support is needed:

Doña García Márquez
Sue Ye (蘇業)
Adam Muñoz

Thank you, @nealmcb for updating the ABIF electowiki page (I'm assuming you're the same "nealmcb" in both places). We should assume that all ABIF documents will have UTF-8 characters outside of lower ASCII (i.e. it may contain characters above U+007E, which will be interpreted according to the UTF-8 spec)

One important note: I also suggest that bare words be much more limited (e.g. the ASCII characters for [A-Z] and [a-z]), and that we have a mapping mechanism from full strings to bare tokens, such as an optional header like this:

[Doña García Márquez]: DGM
[Sue Ye (蘇業)]: SY
[Adam Muñoz]: AM

...which would allow for mapping full UTF-8 strings to bare tokens. For example, the header above would map the string "Sue Ye (蘇業)" to the characters "SY". Pretty much any UTF-8 string should be allowed between square brackets ("[" and "]"), except for square brackets themselves. It seems wise to leave the presence of an opening square bracket after the first opening square bracket unspecified, and wait until we have some implementations that need a specification in order to interoperate.

It also seems wise to limit everything outside of square brackets to 7-bit ASCII characters, much like many popular programming languages and data formats used internationally, and only allow full 8-bit UTF-8 characters in quoted strings.

An alternative to square brackets could be to use quotation marks (much like JSON, YAML and others) and then use the backslash escaping mechanism. Given the balancing issues that I've seen over the years with quotation marks, as well as the weird ambiguity and varying use of single quote (') and double quote ("), and given that many candidate names seem more likely to include quoted nicknames in English-language speaking countries (e.g. ([Richard "Dick" Nixon]), I think square brackets make a better quoting mechanism. But I also think now is the time to make the case for an alternative.

Thoughts?

ABIF implementations need to handle skipped rankings

I'm glad ABIF can handle equal rankings.
But what about skipped rankings?
When voters mark ballots, e.g. on a Dominion ballot, they are presented with a grid

   first    second    third  fourth
c1  _       _         _      _
c2  _       _         _      _
c3  _       _         _      _
c4  _       _         _      _

Different jurisdictions invalidate choices with different rules. E.g. in Alaska, two skipped rankings means further rankings are ignored. In Colorado, that happens with a single skipped ranking.
I think we should support exact representations of such ballots so people can write software to process them according to the different rules.

I guess all that requires for ABIF is making it clear that the ">" sign can appear any number of times at the beginning of a list of preferences, or between preferences.

e.g. 42:>Memphis>>Nashville=Chattanooga for 42 voters who skipped the first preference and the third, and ranked Nashville and Chattanooga both in fourth place.

Is that already handled?
Can it be?

Candidate tokens: mechanism for bare token to fullname token mapping?

As of this writing (in June 2021) the testcases all propose the following way to map between fullname tokens and bare tokens (proposal "a"):

[Doña García Márquez]: DGM
[Steven B. Jensen]:    SBJ
[Sue Ye (蘇業)]:        SY
[Adam Muñoz]:          AM

The reason why I chose that ordering is because it makes the square bracket the first character of the line, and makes it possible to determine what the line type is by the first character. However, over in issue #5 , @brainbuz suggested reversing the order, and including an explicit section header at the top (proposal "b"):

=choices
DGM: Doña García Márquez
SY: Sue Ye (蘇業)
AM: Adam Muñoz

I believe that we should make it possible to infer the section that a line is in from the first character of the line, and have a convention (rather than a requirement) of using comments to delimit sections for readability for now. We may want to make sections more explicit in the near future, but my hunch is that having line-based section identification will force us to make the line formats we design more robust (and human readable) and will also encourage more robust implementations without too much burden. We should discuss the generalities of my hunch over in issue #6.

On the subject of mapping bare tokens to fullname tokens, I would like to propose a hybrid of the two proposals above (proposal "c"):

=DGM:[Doña García Márquez]
=SBJ:[Steven B. Jensen]
=SY:[Sue Ye (蘇業)]
=AM:[Adam Muñoz]

My new proposal "c" has a distinct character ("=") at the beginning of each line, and implies a sort of prefix notation for the "=" operator (that is, using the word "operator" really loosely). It still seems best to require that arbitrary strings are enclosed in square brackets, so that we have the option to add things to the line without too much hassle, and so that it's possible to put comments on the end of each line:

=DGM:[Doña García Márquez] # see marquez2024.com for candidate website
=SBJ:[Steven B. Jensen]    # dropped out of race three days prior to election
=SY:[Sue Ye (蘇業)]         #  see sueye.org/2024 for more
=AM:[Adam Muñoz]           #  see munozftw.org for more

My question: what should our mechanism for mapping bare candidate tokens to fullname candidate tokens?

Proposal "a": the current mechanism implied by the testcases (e.g. test case 5)
Proposal "b": an explicit "=candidates" section which allows for bare token followed by bare freeform string on each line
Proposal "c": equal-prefix ("="), followed by bare token, then colon (":"), then fullname token
Option "d": remove bare token to fullname token mapping from ABIF to keep it simple
Option "e": something else

I'm leaning toward my new proposal "c", but I'm open to new suggestions and/or defense of the suggestions outlined above.

Decide on metadata header (or line format) for ABIF

A couple of weeks ago (in May 2021), @cpsolver wrote a message to the EM-list that I'm only now getting around to responding to (indirectly). See "Re: [EM] Ballot Data Format" by VoteFair on 2021-06-06 for more.

In the email, he suggests the following:

A case number allows the ballot data to be processed through separate
vote-counting software while the metadata -- such as precinct number,
political-party affiliations, etc. -- can follow a different path and be
re-joined to produce the published results.

In particular, my vote-counting software focuses on the numbers/counts,
and I use different software (written in my Dashrep programming
language) to process the text info.

The use of a case number also has other benefits.

I think it's inevitable that we're going to need to figure out how to allow for custom metadata outside of comments. One thing that I love about the old email standards (and in particular, RFC 822) is how simple the rules were for distinguishing between the header (with the metadata about the email) and the body (which contained the message, which could be pretty much ANYTHING).

The following message is vaguely compatible with RFC 822:

Hyphen-separate-field-1: Random-ish characters, terminated by CRLF
Hyphen-separate-field-3: Even more random-ish characters, terminated by CRLF
Hyphen-separate-field-2: More random-ish characters, terminated by another CRLF
From: Random name with random characters <[email protected]>
Subject: Does anyone remember RFC 822?
To: The world <[email protected]>
Date: Today-ish
Hyphen-separate-field-4: Oh, yeah, here's another header, terminated by another CRLF

This is my email ode to RFC 822!  業業業業whee業業業業wheeee!!!!!!!

Did I mention this: whee!  Oh, yeah, and 業!  ña, ña, ña!

I suspect my example above has a few problems of non-compliance with RFC 822, and probably also has problems with the updated specs (RFC 5322 and RFC 6854). Still, the format hasn't changed much; in fact, it still uses US-ASCII rather than UTF-8, and most developers who have done much with email will recognize the example as something vaguely compatible with RFC 822.

Note that there are many arbitrary headers in the top portion of the example, and that the order seems a bit random. My hope for ABIF is that we would do something very similar. I realize now that my proposed headers on some of the test cases for ABIF (as I write this on June 13) don't seem to allow a lot of room for expansion.

There's many ways I can see for solving this problem:

a. create a way of having a mandatory body, and an optional header in all ABIF files
- a1. Create a way of expressing the header as valid JSON (allowing for newlines), and a way of delimiting between JSON and an ABIF-body section
- a2. Create a way of expressing the header as valid YAML (allowing for newlines and following YAML whitespace rules), and create a way of delimiting between YAML and an ABIF-body section
- a3. Create a way of attaching a valid RFC 5322 header to the top of the file, with a blank newline as the delimiter between the RFC-5322-formatted header and the ABIF body
- a4. Create some other header format
b. Create rules for having a variety of line types in ABIF which can be recognized and routed according to their first character. The following sub-options are NOT mutually exclusive
- b1. Have [0-9] as the first line character correspond to a ballot grouping
- b2. Have "#" as the first line character correspond to a comment
- b3. Have open square bracket ([) correspond to an ABIF mapping line (like "[Sue Ye (蘇業)]: SY")
- b4. Have open squirrelly bracket ({) correspond to a valid NDJSON line. Arbitrary metadata can be placed inside of JSON dictionaries, which most parsers MAY ignore.
- b5. Allow all b1 through b4 to occur in any order in a valid ABIF file
c. Some combination of the "a" and "b" above

My current preference is option "c", because I think writing parsers will be easier if all of the metadata is declared at the top of the file, but I also want to keep the option to have metadata and comments down in the body of the document. I also think that it should be safe for authors to add spaces and tabs at the beginning of the line, and have those stripped out by parsers. I'd also like to make it reasonably easy to write a single-pass parser for ABIF files, which becomes much easier if the candidate mappings (described in "b3." above) are handled as part of "header" handling, so that there are no surprise candidate token declarations in the body.

Thoughts?

Quoted tokens: allow double quotes, or always square brackets?

As I've been writing https://github.com/electorama/abiftool , it has occurred to me that double quotes make parsing a bit harder. The current test016.abif file looks like this:

# Case 16 - Bracketed candidate tokens (declared).  Ranked and scored.
#
# As of 2023-04-02, this test should pass, now that abif uses lark's builtin
# token "ESCAPED_STRING".

=DGM:[Doña García Márquez]
=SBJ:[Steven B. Jensen]
="蘇業":[Sue Ye (蘇業)]
=AM:[Adam Muñoz]

27: DGM/5 > SBJ/2 >  "蘇業"/1 > AM/0
26: SBJ/5 > DGM/3 =  "蘇業"/3 > AM/1
24:  "蘇業"/5 > DGM/2 =  AM/2 > SBJ/1
23:  AM/5 >  "蘇業"/3 > DGM/1 > SBJ/0

However, I was really tempted to change the file to this:

# Case 16 - Bracketed candidate tokens (declared).  Ranked and scored.
#
# As of 2023-04-02, this test should pass, now that abif uses lark's builtin
# token "ESCAPED_STRING".

=DGM:[Doña García Márquez]
=SBJ:[Steven B. Jensen]
=[蘇業]:[Sue Ye (蘇業)]
=AM:[Adam Muñoz]

27: DGM/5 > SBJ/2 >  [蘇業]/1 > AM/0
26: SBJ/5 > DGM/3 =  [蘇業]/3 > AM/1
24: [蘇業]/5 > DGM/2 =  AM/2 > SBJ/1
23:  AM/5 >  [蘇業]/3 > DGM/1 > SBJ/0

Note how the Chinese characters for our fictional "Sue Ye" candidate are surrounded by square brackets rather than doublequote marks in the latter example. I like the square brackets as a quoting mechanism because square brackets are much clearer about where the beginning and end are. However, I imagine in would be easier to take a CSV file and convert it to ABIF if double quotes are allowed, since many CSV files already use double quotes in addition to commas as column delimiters.

Thoughts? (my apologies if I've forgotten/misplaced a discussion of this point)

The `CONTRIBUTING.md` file for electorama/abif needs is not an MVP

As stated in the subject: the "CONTRIBUTING.md" file for electorama/abif is still kind of a disaster (as of 2021-08-19). A minimum viable product (MVP) for that file would make it so that @robla feels more comfortable accepting pull requests.

Separate standards for ranked vs. scored ballots?

Scores and ranks are different. Should different standards exist for each?

What types of scores to allow?

Alongside integer scores, what kinds of scores do we want ABIF to support?

A/5    # a no-brainer
A/4.5    # decimal non-integer
A/1/3    # rational fractions, the notation gets very unreadable with this combination though
A/A    # used e.g. in standard notations of Majority Judgment

Votelib currently supports the first two alternatives.

Write a proper specification

Folks who have been around the electoral reform community for a while seem pretty interested in this format, but as of 2021-06-06, there isn't really a specification. There's only a few wiki pages, some test cases and a few online discussions about the format. It's really similar to ad hoc formats that have been around for 25 years or so, but we should actually have a written record of what we're up to.

Allow fractional vote counts?

Do we want to allow non-integer vote counts in ABIF? E.g. this

1.5: A>B=C

would be a valid ABIF line.

If so, do we want decimal numbers, or even fractions such as

1/3: A>B=C

Range/Scoring/STAR Voting: simplified format (fixed order of candidates, separated by comma)

New format requested.

Instead of this:
=DGM:[Doña García Márquez]
=SBJ:[Steven B. Jensen]
=SY:[Sue Ye (蘇業)]
=AM:[Adam Muñoz]

27: DGM/5, SBJ/2, SY/1, AM/0
26: DGM/3, SBJ/5, SY/3, AM/1
24: DGM/2, SBJ/1, SY/5, AM/2
23: DGM/1, SBJ/0, SY/3, AM/5

use double equals sign (maybe instead of two equals - we should use tilde character ~:
==DGM:[Doña García Márquez]
==SBJ:[Steven B. Jensen]
==SY:[Sue Ye (蘇業)]
==AM:[Adam Muñoz]

27: 5, 2, 1, 0
26: 3, 5, 3, 1
24: 2, 1, 5, 2
23: 1, 0, 3, 5

Idea: double equal sign (or alternative character - say: ~ ) indicates scoring ballot, with fixed order and columns are required.
This is OK:
23: 1, 0, 3, 5
and this is OK
23: 1, , 3, 5

However, this is not OK - should fail with error:
23: 1, 3, 5
Error: number of candidates (4) and number of scores (3) must match.

However, this is not OK - should fail with error:
23: 1, 0, 3, 5, 4
Error: number of candidates (4) and number of scores (5) must match.