Lightweight and robust data encoding library for Go
Schemer provides an API to construct schemata that describe data structures; a schema is then used to encode and decode values into sequences of bytes to be sent over the network or written to a file.
Schemer seeks to be an alternative to protobuf or Avro, but it can also be used as a substitute for JSON.
- Compact binary data format
- High-speed encoding and decoding
- Forward and backward compatibility
- No code generation and no new language to learn
- Simple and lightweight library with no external dependencies
- Supports custom encoding for user-defined data types
- JavaScript library for web browser interoperability (coming soon!)
Schemer is an attempt to further simplify data encoding. Unlike other encoding libraries that use interface description languages (i.e. protobuf), schemer allows developers to construct schemata programmatically with an API. Rather than generating code from a schema, a schema can be constructed from code. In Go, schemata can be generated from Go types using the reflection library. This adds a surprising amount of flexibility and extensibility to the encoding library.
Here's how schemer stacks up against other encoding formats:
Property | JSON | XML | MessagePack | Protobuf | Thrift | Avro | Gob | Schemer |
---|---|---|---|---|---|---|---|---|
Human-Readable | ✔️ | 😐 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Support for Many Programming Languages | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ |
Widely Adopted | ✔️ | ✔️ | ❌ | ✔️ | ❌ | ❌ | ❌ | ❌ |
Precise Encoding of Numbers | 😐 | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Binary Strings | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Compact Encoded Payload | ❌❌ | ❌❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Fast Encoding / Decoding | ❌ | ❌ | ✔️ | ✔️ | ❔ | 😐 | 😐 | ❔ |
Backward Compatibility | ✔️ | ✔️ | ✔️ | 😐 | 😐 | ✔️ | 😐 | ✔️ |
Forward Compatibility | ✔️ | ✔️ | ✔️ | 😐 | 😐 | ✔️ | 😐 | ✔️ |
No Language To Learn | ✔️ | ✔️ | ✔️ | ❌ | ❌ | 😐 | ✔️ | ✔️ |
Schema Support | 😐 | 😐 | ❓ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ |
Supports Fixed-field Objects | ❌ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Works on Web Browser | ✔️ | ✔️ | ✔️ | ✔️ | 😐 | ✔️ | ❌ | 📆 soon… |
The table above is intended to guide the reader toward an encoding format based on their requirements, but the evaluations of these encoding formats are, of course, rather subjective. Please feel free to open an issue if you feel something should be adjusted and/or corrected.
schemer uses type information provided by the schema to encode values. The following are all of the types that are supported:
- Integer
- Can be signed or unsigned
- Fixed-size or variable-size 1
- Fixed-size integers can be 8, 16, 32, or 64 bits
- Floating-point number (32 or 64-bit)
- Complex number (64 or 128-bit)
- Boolean
- Enumeration
- String
- Can support any encoding, including UTF-8 and binary
- Fixed-size or variable-size 2
- Array
- Fixed-size or variable-size
- Object w/fixed fields (i.e. struct)
- Object w/variable fields (i.e. map)
- Schema (i.e. a schemer schema)
- Dynamically-typed value (i.e. variant)
- User-defined types
- A few common types are provided for representing timestamps, time durations, IP addresses, UUIDs, regular expressions, etc.
Type | JSON Type Name | Additional Options |
---|---|---|
Fixed-size Integer | int | * signed - boolean indicating if integer is signed or unsigned* bits - one of the following numbers indicating the size of the integer: 8, 16, 32, 64, 128, 256, 512, 1024Note: integers larger than 64 bits are not fully supported |
Variable-size Integer | int | * signed - boolean indicating if integer is signed or unsigned* bits - must be null or omitted |
Floating-point Number | float | * bits - one of the following numbers indicating the size of the floating-point: 32, 64 |
Complex Number | complex | * bits - one of the following numbers indicating the size of the complex number: 64, 128 |
Boolean | bool | |
Enum | enum | * values - an object mapping strings to integer values |
Fixed-Length String | string | * length - the length of the string in bytes |
Variable-Length String | string | * length - must be null or omitted |
Fixed-Length Array | array | * length - the length of the string in bytes |
Variable-Length Array | array | * length - must be null or omitted |
Object w/fixed fields | object | * fields - an array of fields. Each field is an type object with keys:name 3, type , and any additional options for the type |
Object w/variable fields | object | * fields - must be null or omitted |
Variant | variant |
Here's a struct with three fields:
- firstName (string)
- lastName (string)
- age (uint8 - unsigned integer requiring a single byte)
{
"type": "object",
"fields": [
{
"name": "firstName",
"type": "string"
}, {
"name": "lastName",
"type": "string"
}, {
"name": "age",
"type": "int",
"signed": false,
"bits": 8
}
]
}
When decoding values from one type to another, schemer employs the following compatibility rules. These rules, while rather opinionated, provide safe defaults when decoding values. Users who want to carefully craft how values are decoded from one type to another can simply create a custom type.
As a general rule, types are only compatible with themselves (i.e. boolean values can only be decoded to boolean values). The table below outlines a few notable exceptions and describes how using "weak" decoding mode can increase type compatibility by sacrificing type safety and by making a few assumptions.
Destination | ||||||||
---|---|---|---|---|---|---|---|---|
Source | int | float | complex | bool | enum | string | array (see #12) | object |
int | ✔️ #1 | ✔️ #1 | ✔️ #1 | ❕ #6 | ❕ #7 | ❕ #9 | ❌ | ❌ |
float | ✔️ #1 | ✔️ #1 | ✔️ #1 | ❌ | ❌ | ❕ #9 | ❌ | ❌ |
complex | ✔️ #1 | ✔️ #1 | ✔️ #1 | ❌ | ❌ | ❕ #9 | ❕ #11 | ❌ |
bool | ❕ #6 | ❌ | ❌ | ✔️ | ❌ | ❕ #10 | ❌ | ❌ |
enum | ❕ #7 | ❌ | ❌ | ❌ | ✔️ #2 | ✔️ #2 | ❌ | ❌ |
string | ❕ #8 | ❕ #8 | ❕ #8 | ❕ #10 | ✔️ #2 | ✔️ | ❌ | ❌ |
array (see #12) | ❌ | ❌ | ❕ #11 | ❌ | ❌ | ❌ | ✔️ #3 | ❌ |
object | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ #4 |
Legend:
✔️ - indicates compatibility according to the specified rule
❕- indicates compatibility according to the specified rule only if weak decoding is used
❌ - indicates that the source type cannot be decoded to the destination (excepting rule #12)
-
Any number can be decoded to any other number, provided the decoded value can be stored into the destination without losing any precision. If weak decoding is specified, we loosen this restriction slightly by allowing floating-point and complex number conversions to lose precision.
For example, if the number
3.14
is decoded, it can be stored as a float or complex number, but it cannot be stored as an integer. Similarly, the number500
can be stored into auint16
but not auint8
, sinceuint8
can only store values between 0 and 255. -
Enumerations are decoded to other enumerations by performing a case-insensitive match on the named value, not a match on the numeric value. If multiple matches occur, a case-sensitive match is then performed. Decoding fails if the decoded named value does not match a named value in the destination enumeration. Enumerations can also be converted to strings and vice-versa by matching on the enumeration's named value.
-
Arrays can be decoded to arrays if the element type and array length is compatible. Specifically, when the destination array is of fixed-size and does not support null values, the decoded array must match exactly in length.
-
Objects are decoded to other objects by performing a case-insensitive match on the key or field name. If multiple matches occur, a case-sensitive match is then performed. When the destination is an object with fixed fields and the decoded value does not have a matching key or field name, the key / field is simply skipped and will remain unchanged.
-
Null values can only be decoded to destinations that support null values (i.e. pointers), but a non-null value can be decoded even if the destination does not support null values.
The following compatibility rules apply for weak decoding only:
- The boolean value
true
can be converted to the integer value1
, and the boolean valuefalse
can be converted to the integer value0
. Similarly, the integer0
will be decoded asfalse
, and all other integers are decoded astrue
. - Enumerations can be converted to integer values and vice-versa, and they are matched on the enumeration's numeric value.
- Strings can be decoded to numeric values by considering the string format according to the table below. The resulting numeric value is compatible with the destination according to the relevant compatibility rules.
- Numbers are always encoded to strings in base 10.
- Boolean values
true
andfalse
are converted to string values"true"
and"false"
respectively. Strings"1"
,"t"
,"T"
,"TRUE"
,"true"
, and"True"
can be converted to the boolean valuetrue
. Strings"0"
,"f"
,"F"
,"FALSE"
,"false"
, and"False"
can be converted to boolean valuefalse
. - Complex numbers may be converted into 2-element arrays of floating-point numbers and vice-versa. The real part of the complex number will be matched with array element 0, and the complex part will be matched with array element 1.
- Single-element arrays can be decoded to a destination that is compatible with the array element and vice-versa.
String Example | Decoded As | Regular Expression |
---|---|---|
"-3.14" |
Number, base 10 | `^[-+]?(0 |
"0b1101" |
Integer, base 2 | ^[-+]?0[bB][01]+$ |
"0775" |
Integer, base 8 | ^[-+]?0[oO]?[0-7]+$ |
"0x2020" |
Number, base 16 | ^[-+]?0[xX][0-9A-Fa-f]+(\.[0-9A-Fa-f]*)?([pP][+-]?[0-9A-Fa-f]+)?$ |
"2.34 + 2i" |
Complex number, base 10 | You don't want to see it, but here's the link. |
This library was created on April 14, 2021, the day of Bernie Madoff's death. What a schemer! May he rest in peace.
Special thanks to Benjamin Pritchard for his significant contributions to this library and for making it a reality.
Footnotes
-
By default, integer types are encoded as variable integers, as this format will most likely generate the smallest encoded values. ↩
-
By default, string types are encoded as variable-size strings. Fixed-size strings are padded with trailing null bytes / zeros. ↩
-
It is strongly encouraged to use camelCase for object field names. ↩