Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary strings vs text strings #20

Open
streamich opened this issue Dec 10, 2023 · 1 comment
Open

Binary strings vs text strings #20

streamich opened this issue Dec 10, 2023 · 1 comment

Comments

@streamich
Copy link

streamich commented Dec 10, 2023

It seems RESP v2 and RESP3 do not clearly distinguish between text strings (strings for short) and binary strings, clobs, blobs (binary for short). I might missing something, if so please pardon me.

Other binary serialization formats like MessagePack and CBOR clearly distinguish between strings (UTF-8) and binary data type (a list of any octets).

RESP seems to be running in the same problem that MessagePack v1 had: there was no clear distinction between strings and binary data.

This is a serious problem and what's preventing me from implementing MessagePack in my ObjC framework. I have to know whether I should create a string object or a data object. Creating a string object for everything will fail if it is not UTF-8 and always creating a data object will be very impractical.

Right now RESP has 4 string kinds: (1) simple string; (2) bulk string; (3) verbatim string; (4) streaming string. However, in none of them it can be clearly identified if the string can be safely treated as text, say UTF-8 string, or it should be left as binary.

The verbatim string has the encoding field, which can be specified as txt:; but the specification does not clearly state what txt: means. Does it mean it is UTF-8 text?

In most programming languages keys in built-in map types are strings, so one could try to parse the RESP Map type keys as strings, but, again, it is not guaranteed that those will be valid strings, and the text encoding is unknown.

Error messages in all programming languages are text strings, but the RESP Error types (simple and bulk) not guaranteed to be valid text and the text encoding is not specified.

Text strings, like object keys and object values—for example— from INFO command are returned as "bulk string" $ type, which in documentation is defined as binary string, i.e. a binary blob. It is not text: (1) it could potentially contain octets which would be illegal in text; (2) it is not clear which text encoding should be used.

CLUTER MYID returns a binary string—Bulk string—$, but the actual data is a text string, holding a HEX sequence.

This forces me to have a special opt-in parsing mode for RESP responses, where all Bulk strings $ are attempted to be parsed as text strings, but if the UTF-8 validation of that string fails it bails and parses the Bulk string as an array of bytes instead. This is very inefficient and hacky.

This also results into the complexity that, when parsing a Bulk string, the parser can return 3 different types: (1) binary blob, e.g. Uint8Array; (2) a native string (if it is valid UTF-8); (3) a null value (if it is null Bulk string $-1\r\n).

image

Proposal

Add ability to clearly discriminate between binary data and UTF-8 text.

A new type could be added which is always UTF-8 valid string.

Alternatively, the Verbatim strings could specify txt: and bin: encoding formats. Where txt: would be explicitly reserved for UTF-8 strings and bin: for binary data.

Clearly state in specification that the Simple string format is UTF-8 text (or ASCII or Latin1).

Alternatively add a UTF-8 text string tag.

@streamich streamich changed the title Binary strings from text strings Binary strings vs text strings Dec 11, 2023
@414owen
Copy link

414owen commented Jan 21, 2024

I do think this needs addressing.

Currently, strings that contain newlines have to be encoded as blob strings, so parsers/clients that care about correctness have to return a buffer of bytes, rather than a string. This seems like a huge problem to me. The two are completely different, semantically, and the blobness will percolate up to the user, who will have to decipher "( \865\176 \860\662 \865\176)\n", instead of

( ͡° ͜ʖ ͡°)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants