Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiformats Considered Harmful #2

Open
selfissued opened this issue Sep 6, 2023 · 4 comments
Open

Multiformats Considered Harmful #2

selfissued opened this issue Sep 6, 2023 · 4 comments

Comments

@selfissued
Copy link

selfissued commented Sep 6, 2023

While I usually reserve my time and energy for advancing good ideas, I’m making an exception to publicly state the reasons why I believe “multiformats” should not be considered for standardization by the IETF.

  1. Multiformats institutionalize the failure to make a choice, which is the opposite of what good standards do. Good standards make choices about representations of data structures resulting in interoperability, since every conforming implementation uses the same representation. In contrast, multiformats enable different implementations to use a multiplicity of different representations for the same data, harming interoperability. https://datatracker.ietf.org/doc/html/draft-multiformats-multibase-03#appendix-D.1 defines 23 equivalent and non-interoperable representations for the same data!
  2. The stated purpose of “multibase” is “Unfortunately, it’s not always clear what base encoding is used; that’s where this specification comes in. It answers the question: Given data ‘d’ encoded into text ‘s’, what base is it encoded with?”, which is wholly unnecessary. Successful standards DEFINE what encoding is used where. For instance, https://www.rfc-editor.org/rfc/rfc7518.html#section-6.2.1.2 defines that “x” is base64url encoded. No guesswork or prefixing is necessary or useful.
  3. Standardization of multiformats would result in unnecessary and unhelpful duplication of functionality – especially of key representations. The primary use of multiformats is for “publicKeyMultibase” – a representation of public keys that are byte arrays. For instance, the only use of multiformats by the W3C DID spec is for publicKeyMultibase. The IETF already has several perfectly good key representations, including X.509, JSON Web Key (JWK), and COSE_Key. There’s not a compelling case for another one.
  4. publicKeyMultibase can only represent a subset of the key types used in practice. Representing many kinds of keys requires multiple values – for instance, RSA keys require both an exponent and a modulus. By comparison, the X.509, JWK, and COSE_Key formats are flexible enough to represent all kinds of keys. It makes little to no sense to standardize a key format that limits implementations to only certain kinds of keys.
  5. The “multihash” specification relies on a non-standard representation of integers called “Dwarf”. Indeed, the referenced Dwarf document lists itself as being at http://dwarf.freestandards.org/ – a URL that no longer exists!
  6. The “Multihash Identifier Registry” at https://www.ietf.org/archive/id/draft-multiformats-multihash-07.html#mh-registry duplicates the functionality of the IANA “Named Information Hash Algorithm Registry” at https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg, in that both assign (different) numeric identifiers for hash functions. If multihash goes forward, it should use the existing registry.
  7. It’s concerning that the draft charter states that “Changing current Multiformat header assignments in a way that breaks backward compatibility with production deployments” is out of scope. Normally IETF working groups are given free rein to make improvements during the standardization process.
  8. Finally, as a member of the W3C DID and W3C Verifiable Credentials working groups, I will state that it is misleading for the draft charter to say that “The outputs from this Working Group are currently being used by … the W3C Verifiable Credentials Working Group, W3C Decentralized Identifiers Working Group…”. The documents produced by these working groups intentionally contain no normative references to multiformats or any data structures derived from them. Where they are referenced, it is explicitly stated that the references are non-normative.
@zamicol
Copy link

zamicol commented Sep 6, 2023

I respect the first principles engineering work multiformats demonstrates.

I didn't realize that a public key specification was included. Where is publicKeyMultibase defined?

EDIT: msporny pointed me in the right direction:

publicKeyMultibase is an encoding of Multikey. The Multikey format is described for each primitive in the their respective W3C specification, for example, eddsa and ecdsa.

@BigBlueHat
Copy link

@zamicol publicKeyMultibase is defined in the DID-CORE spec.

@zamicol
Copy link

zamicol commented Sep 14, 2023

In that document I see publicKeyMultibase referred to, but I don't see a definition. Just as publicKeyJwk is referred to, but JWK is defined by it's own specification. The DID-CORE spec links to the appropriate specification for JWK, but I don't see any such link for Multibase's "public key". Where is Multibase's public key defined?

@AaronGoldman
Copy link

1.

Multiformats institutionalize the failure to make a choice, which is the opposite of what good standards do. Good
standards make choices about representations of data structures resulting in interoperability, since every
conforming implementation uses the same representation. In contrast, Multiformats enable different implementations
to use a multiplicity of different representations for the same data, harming interoperability.
datatracker.ietf.org/doc/html/draft-Multiformats-Multibase-03#appendix-D.1
defines 23 equivalent and non-interoperable representations for the same data!

Multibase specifically and Multiformats more generally are standards for decoupling. A good example of a decoupling
standard is IPv4/IPv6
and the IP protocol numbers. IPv4 has Protocol and
IPv6 has the Next Header but they share the same IANA registry.
We could call this a "failure to make a choice" as IP did not choose the format of the layers above and below IP, or
we could view it as a deliberate decoupling of the layers of the network stack. Whether it was a good or bad design,
it did enable innovation in what types of content IP is capable of encapsulating. There are 146 protocols in the
registry and some routers don't implement them all, just preferring ICMP, UDP, and TCP but IPv4/IPv6 have still
proved useful.

The Multibase standard solves the problem of representing bytes in text strings with restricted character sets,
without needing to know in advance what the restrictions will be. This is independent and separate from all the
other Multiformat standards.

The Multiformat standard solves the problem of providing a "tag" to specify what the next "value" is, same as IPv4's
Protocol header or HTTP's Content-Type header.

2.

The stated purpose of "Multibase" is
"Unfortunately, it's not always clear what base encoding is used; that's where this specification comes in. It
answers the question: Given data ‘d' encoded into text ‘s', what base is it encoded with?", which is wholly
unnecessary. Successful standards DEFINE what encoding is used where. For instance,
rfc-editor.org/rfc/rfc7518.html#section-6.2.1.2
defines that "x" is base64url encoded. No guesswork or prefixing is necessary or useful.

Some standards do specify a specific encoding. Multibase will not prevent any past or future standard from specifying
that a text field is Base64url, for example. It dose enables future standards to specify that bytes are encoded as a
Multibase string.

Multibase is a set of encodings that will allow an array of bytes to be encoded as text with restriction on character
set that may not always be known in advance. If we had a protocol that had a 32-byte number, and we needed to represent
those bytes as text, we could represent them as:

Base Literal
b256(bytes) (non-ascii bytes not representable here)
b85 <FLd+nEV_Rn)~#~nQyryC$2%{WSf&rq?MT)cv84k
b64 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
b32 4OYMIQUY7QOBJGX36TEJS35ZEQT24QPEMSNZGTFESWMRW6CSXBKQ====
b16 E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
integer_literal 0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
integer_literal 102987336249554097029535212322581322789799900648198034993379397001115665086549
integer_literal 0o16166061041230770160244657576462114557562220475344074431115623231222254621557024534125
integer_literal 0b1110001110110000110001000100001010011000111111000001110000010100100110101111101111110100110010001001100101101111101110010010010000100111101011100100000111100100011001001001101110010011010011001010010010010101100110010001101101111000010100101011100001010101

By using an integer literal, I can both describe the number and the base that the number is represented in. In this
case, we represent hex in a text that only needs to be able to support 0123456789abcdefx, binary with just 01b, and
so on. Multibase takes this further by requiring that the first byte (indicating the base) is one of the bytes from the
alphabet of the encoding. This way we don't add a character requirement for no value.

An example of this adding value is when Multibase was chosen for IPFS CIDs. The CIDs were traditionally in base58btc,
which is case-sensitive. This worked well for representing bytes in the restricted text environment of file paths and
URI paths. This could have easily been specified as a base58btc string, but fortunately they chose Multibase to
decouple the bytes of the CID from the string representation. When the time came that they wanted to put CIDs into
subdomains, the case-insensitive subdomains were a more restricted text environment that they had not anticipated. They
switched to base32 which was not case-sensitive and thus able to represent the same bytes in a more restricted
environment.

Multibase is orthogonal to Multiformats and should be standardized as a way to represent bytes in a restricted text
environment that is restricted in ways that are irrelevant to the bytes being represented. If we don't know whether
our data will need to be represented as compact arbitrary bytes, 7-bit safe ascii, JSON non-escaped ascii, CSV
non-escaped ascii, TSV non-escaped ascii, URL path-safe ascii, domain-name-safe ascii, decimal numbers only, some
not yet known but soon to be important environment, etc. then encoding the bytes as Multibase has decoupling value.

3.

Standardization of Multiformats would result in unnecessary and unhelpful duplication of functionality – especially
of key representations. The primary use of Multiformats is for "publicKeyMultibase" – a representation of public
keys that are byte arrays. For instance, the only use of Multiformats by the W3C DID spec
is for publicKeyMultibase. The IETF already has several perfectly good key representations, including X.509, JSON
Web Key (JWK), and COSE_Key. There's not a compelling case for another one.

The standardization of Multiformats is independent of whether IETF chooses to standardize publicKeyMultibase.

For example, the IPv4 Protocol header registers 70 VISA VISA Protocol. This does not imply that IETF needs to
specify VISA Protocol. In fact, as far as we
can tell, it is the IVI Foundation that maintains that standard. In the exact same way, the only interaction between
Multiformats standardization and publicKeyMultibase is that publicKeyMultibase could use the Multiformats
registry to map numbers to key representations. Any flaws in publicKeyMultibase are no better an argument against
standardization of Multiformats than the flaws in VISA Protocol are against standardization of IPv4 and the IANA
protocol-numbers registry.

If X.509, JSON Web Key (JWK), or COSE_Key become the standard way to represent keys for the web then publicKeyMultibase
could just add a Multiformats registry entry for X.509 or JWK, and publicKeyMultibase would just be a wrapper around
those representations. COSE is already present in the registry.

4.

publicKeyMultibase can only represent a subset of the key types used in practice. Representing many kinds of keys
requires multiple values – for instance, RSA keys require both an exponent and a modulus. By comparison, the X.509,
JWK, and COSE_Key formats are flexible enough to represent all kinds of keys. It makes little to no sense to
standardize a key format that limits implementations to only certain kinds of keys.

Please see above. publicKeyMultibase is outside the scope of this working group, which is tasked
with producing the following artifacts:

  1. An RFC specifying multibase usage
  2. An RFC defining an independent multibase registry and populating it with today's already-implemented stable and final
    values
  3. An RFC defining a registry-group for all the multicodecs, empty at inception, with registration process and group-wide
    constraints on registration values
  4. An RFC specifying multihash usage
  5. An RFC defining a multihash registry within the multicodecs registry group and populating it with today's
    already-implemented stable and final values

The Multiformat-varint spec is also pulled in as it is needed to specify the length in Multihash and Multiformat with
sized payloads.

5.

The "multihash" specification relies on a
non-standard representation of integers called "Dwarf". Indeed, the referenced Dwarf document lists itself as being
at http://dwarf.freestandards.org – a URL that no longer exists!

We agree here - the Multiformats-varint is close to but not exactly Dwarf. This is due to the fact that the
Multiformats-varint is limited to 9 bytes. It is a 1-to-9 byte representation of an unsigned int63. from 0x00(0)
to 0x7FFFFFFF_FFFFFFFF(9223372036854775807) this means the decoded value will always fit in either a signed int64 or
an unsigned int64. If the most-significant-bit of a byte is 0, this is the last byte of the Multiformats-varint. If it
is 1, there is at least one more byte present in the Multiformats-varint. The 7 remaining bits are the payload bits.
You can shift the payload bits left by 7 * (byte number) and | (bitwise-OR) them in to get the decoded number.

| length in bytes | Encoded bits | Bits                                                                             |
|-----------------|--------------|----------------------------------------------------------------------------------|
| 1               | 7            | 0xxxxxxx                                                                         |
| 2               | 14           | 1xxxxxxx 0xxxxxxx                                                                |
| 3               | 21           | 1xxxxxxx 1xxxxxxx 0xxxxxxx                                                       |
| 4               | 28           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx                                              |
| 5               | 35           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx                                     |
| 6               | 42           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx                            |
| 7               | 49           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx                   |
| 8               | 56           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx          |
| 9               | 63           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
|                 |              |  7     0, 14    8, 21   15, 28   22, 35   23, 42   36, 49   43, 56   50, 63   57 |

Multiformats-varint is such a simple varint that there is no reason to point anywhere else. The Multiformats-varint
should be specified by this working group alongside Multibase and Multihash. Any reference to Dwarf is simply
unnecessary as it is clearer to specify Multiformats-varint rather than trying to describe it relative to a similar
but non-identical varint.

6.

The "Multihash Identifier Registry" at ietf.org/archive/id/draft-Multiformats-multihash-07.html#mh-registry
duplicates the functionality of the IANA "Named Information Hash Algorithm Registry" at
iana.org/assignments/named-information/named-information.xhtml#hash-alg,
in that both assign (different) numeric identifiers for hash functions. If multihash goes forward, it should use
the existing registry.

"Not all uses of these names require use of the full hash output -- truncated hashes can be safely used in some
environments. For this reason, we define a new IANA registry for hash functions to be used with this specification so
as not to mix strong and weak (truncated) hash algorithms in other protocol registries."
-- rfc6920: Naming Things with Hashes

The goal of the named-information registry is to be a hash function and prefix length for the binary encoding of a
ni:// or a nih://. This is limited to a 6-bit field but the Multiformats registry intends to support more than 64
algorithm/size pairs.

hash sizes
identity 1
sha1 1
sha2 9
sha2a 1
sha3 4
keccak 5
blake3 1
md4 1
md5 1
blake2b 64
blake2s 32
skein256 32
skein512 64
skein1024 128

We can't fit hundreds of hash function length pairs in a 64-entry registry. This would break backwards compatibility
because it changes which numbers match which hash functions. It pollutes the registry for rfc6920 implementors by
including non-cryptographically secure hash functions. Lastly, the Multiformats registry already contains more than
64 hash functions and would not fit in the Named Information Hash Algorithm Registry.

It is better to have hash function and length as two different fields as in Multihash.

7.

It's concerning that the draft charter states that
"Changing current Multiformat header assignments in a way that breaks backward compatibility with production
deployments" is out of scope. Normally IETF working groups are given free rein to make improvements during the
standardization process.

This may be a distinction without a difference. We certainly could empower the working group to make backwards
incompatible changes, but they will try not to have any unnecessary breaking changes.

8.

Finally, as a member of the W3C DID and W3C Verifiable Credentials working groups, I will state that it is
misleading for the draft charter to say that "The outputs from this Working Group are currently being used by … the
W3C Verifiable Credentials Working Group, W3C Decentralized Identifiers Working Group…". The documents produced by
these working groups intentionally contain no normative references to Multiformats or any data structures derived
from them. Where they are referenced, it is explicitly stated that the references are non-normative.

This is a good note. The draft charter should probably be clear that Multiformats are being used in Verifiable
Credentials and Decentralized Identifiers in production. There are multiple existing independent implementations
of this technology enabling Verifiable Credentials and Decentralized Identifiers to be useful. While these specs
contain no normative references, this registry provides the ability to make Verifiable Credentials and Decentralized
Identifiers that are better decoupled from the data structures that they contain, and will therefore be flexible in
the face of future evolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants