Binary strings vs text strings #20

streamich · 2023-12-10T08:14:37Z

It seems RESP v2 and RESP3 do not clearly distinguish between text strings (strings for short) and binary strings, clobs, blobs (binary for short). I might missing something, if so please pardon me.

Other binary serialization formats like MessagePack and CBOR clearly distinguish between strings (UTF-8) and binary data type (a list of any octets).

RESP seems to be running in the same problem that MessagePack v1 had: there was no clear distinction between strings and binary data.

This is a serious problem and what's preventing me from implementing MessagePack in my ObjC framework. I have to know whether I should create a string object or a data object. Creating a string object for everything will fail if it is not UTF-8 and always creating a data object will be very impractical.

Right now RESP has 4 string kinds: (1) simple string; (2) bulk string; (3) verbatim string; (4) streaming string. However, in none of them it can be clearly identified if the string can be safely treated as text, say UTF-8 string, or it should be left as binary.

The verbatim string has the encoding field, which can be specified as txt:; but the specification does not clearly state what txt: means. Does it mean it is UTF-8 text?

In most programming languages keys in built-in map types are strings, so one could try to parse the RESP Map type keys as strings, but, again, it is not guaranteed that those will be valid strings, and the text encoding is unknown.

Error messages in all programming languages are text strings, but the RESP Error types (simple and bulk) not guaranteed to be valid text and the text encoding is not specified.

Text strings, like object keys and object values—for example— from INFO command are returned as "bulk string" $ type, which in documentation is defined as binary string, i.e. a binary blob. It is not text: (1) it could potentially contain octets which would be illegal in text; (2) it is not clear which text encoding should be used.

CLUTER MYID returns a binary string—Bulk string—$, but the actual data is a text string, holding a HEX sequence.

This forces me to have a special opt-in parsing mode for RESP responses, where all Bulk strings $ are attempted to be parsed as text strings, but if the UTF-8 validation of that string fails it bails and parses the Bulk string as an array of bytes instead. This is very inefficient and hacky.

This also results into the complexity that, when parsing a Bulk string, the parser can return 3 different types: (1) binary blob, e.g. Uint8Array; (2) a native string (if it is valid UTF-8); (3) a null value (if it is null Bulk string $-1\r\n).

Proposal

Add ability to clearly discriminate between binary data and UTF-8 text.

A new type could be added which is always UTF-8 valid string.

Alternatively, the Verbatim strings could specify txt: and bin: encoding formats. Where txt: would be explicitly reserved for UTF-8 strings and bin: for binary data.

Clearly state in specification that the Simple string format is UTF-8 text (or ASCII or Latin1).

Alternatively add a UTF-8 text string tag.

The text was updated successfully, but these errors were encountered:

414owen · 2024-01-21T11:56:39Z

I do think this needs addressing.

Currently, strings that contain newlines have to be encoded as blob strings, so parsers/clients that care about correctness have to return a buffer of bytes, rather than a string. This seems like a huge problem to me. The two are completely different, semantically, and the blobness will percolate up to the user, who will have to decipher "( \865\176 \860\662 \865\176)\n", instead of

( ͡° ͜ʖ ͡°)

streamich changed the title ~~Binary strings from text strings~~ Binary strings vs text strings Dec 11, 2023

414owen mentioned this issue Jan 21, 2024

Tag type to prevent explosion of formats #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary strings vs text strings #20

Binary strings vs text strings #20

streamich commented Dec 10, 2023 •

edited

Loading

414owen commented Jan 21, 2024 •

edited

Loading

Binary strings vs text strings #20

Binary strings vs text strings #20

Comments

streamich commented Dec 10, 2023 • edited Loading

Proposal

414owen commented Jan 21, 2024 • edited Loading

streamich commented Dec 10, 2023 •

edited

Loading

414owen commented Jan 21, 2024 •

edited

Loading