Skip to content

Commit

Permalink
Merge pull request #2 from OpenJData/master
Browse files Browse the repository at this point in the history
sync to draft 1
  • Loading branch information
fangq authored May 13, 2020
2 parents a418aaf + 2da127d commit d04ab48
Show file tree
Hide file tree
Showing 2 changed files with 132 additions and 49 deletions.
175 changes: 128 additions & 47 deletions Binary_JData_Specification.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
Binary JData: An efficient interchange format for complex binary data
Binary JData: A portable interchange format for complex binary data
============================================================

- **Status of this document**: This document is current under development.
- **Status of this document**: Request for comments.
- **Maintainer**: Qianqian Fang <q.fang at neu.edu>
- **License**: Apache License, Version 2.0
- **Version**: 0.4 (Draft 1.pre)
- **Version**: 0.5 (Draft 1)
- **Abstract**:

> The Binary JData (BJData) Specification defines an efficient serialization
Expand Down Expand Up @@ -32,7 +32,63 @@ extended binary data types.
Introduction
------------------------------

`To be added`
The Javascript Object Notation (JSON) format, formally known as the ECMA-404
or ISO21778:2017 standard, is ubiquitously used in today's web and native
applications. JSON presents numerous advantages, such as human readability,
generality for accommodating complex hierarchical data, self-documentation, and
abundant existing free and commercial libraries. However, its utility is largely
restricted to the storage of lightweight textural data, and has very limited presence
in many data-intensive and performance-demanding applications, such as databases,
medical imaging, and scientific data storage.

The lack of support for strongly-typed and binary data has been one of the main
barriers towards widespread adoption of JSON in these domains. In recent years,
efforts to address these limitation have resulted in an array of versatile binary
JSON formats, such as BSON (Binary JSON, http://bson.org), UBJSON (Universal Binary
JSON, http://ubjson.org), MessagePack (https://msgpack.org), CBOR (Concise Binary
Object Representation, [RFC 7049], https://cbor.io) etc. These binary JSON
counterparts are broadly used in speed-sensitive data processing applications and
address various needs from a diverse range of applications.

To better champion findable, accessible, interoperable, and reusable
([FAIR principle](https://www.nature.com/articles/sdata201618)) data in
scientific data storage and management, we have created the **OpenJData Initiative**
(http://openjdata.org) to develop a set of open-standards for portable, human-readable
and high-performance data annotation and serialization aimed towards enabling
scientific researchers, IT engineers, as well as general data users to efficiently
annotate and store complex data structures arising from diverse applications.

The OpenJData framework first converts complex data structures, such as N-D
arrays, trees, tables and graphs, into easy-to-serialize, portable data annotations
via the **JData Specification** (https://github.com/fangq/jdata) and then serializes
and stores the annotated JData constructs using widely-supported data formats.
To balance data portability, readability and efficiency, OpenJData defines a
**dual-interface**: a text-based format **syntactically compatible with JSON**,
and a binary-JSON format to achieve significantly smaller file sizes and faster
encoding/decoding.

The Binary JData (BJData) format is the **official binary interface** for the JData
specification. It is derived from the widely supported UBJSON Specification
Draft 12 (https://github.com/ubjson/universal-binary-json), and adds native
support for **N-dimensional packed arrays** - an essential data structure for
scientific applications - as well as extended binary data types, including unsigned
integer types and half-precision floating-point numbers. The new data constructs
also allow a BJData file to store binary arrays larger than 4 GB in size, which
is not currently possible with MessagePack (maximum data record size is limited
to 4 GB) and BSON (maximum total file size is 4 GB).

A key rationale for basing the BJData format upon UBJSON as opposed to
other more popular binary JSON-like formats, such as BSON, CBOR and MessagePack,
is UBJSON's **quasi-human-readability** - a unique characteristic that is
absent from almost all other binary formats. This is because all data semantic
elements in a UBJSON/BJData file, e.g. the "name" fields and data-type markers,
are defined in human-readable strings. The resulting binary files are not only
capable of storing complex and hierarchical binary data structures, but also
directly readable using an editor with minimal or no processing. We anticipate that
such a unique capability, in combination with the highly portable JData annotation
keywords, makes a data file self-explanatory, easy to reuse, and easy to
inter-operate in complex applications.


License
------------------------------
Expand Down Expand Up @@ -64,18 +120,32 @@ data representing the actual binary data for this type of value.

### Notes
- Some values are simple enough that just writing the 1-byte ASCII marker into
the stream is enough to represent the value (e.g. null) while others have a
the stream is enough to represent the value (e.g. `null`), while others have a
type that is specific enough that no length is needed as the length is implied
by the type (e.g. `int32`) while others still require both a type and a length to
communicate their value (e.g. string). Additionally some values (e.g. `array`)
by the type (e.g. `int32`). Yet others still require both a type and a length to
communicate their value (e.g. `string`). In addition, some values (e.g. `array`)
have additional (_optional_) parameters to improve decoding efficiency and/or
to reduce size of the encoded value even further.
to reduce the size of the encoded value even further.

- The BJData specification requires that all numeric values be written in
Big-Endian order.
**Big-Endian order**.

- The `array` and `object` data types are **container** types, similar to JSON
arrays and objects. They help partition and organize data records of all types,
including container types, into composite and complex records.

- Using a [strongly-typed](#container_optimized) container construct can further
reduce data file sizes as it extracts common data type markers to the header. It
also helps a parser to pre-allocate necessary memory buffers before reading the
data payload.

In the following sections, we use a **block-notation** to illustrate the layout
of the encoded data. In this notation, the data type markers and individual
data payloads are enclosed by a pair of `[]`, strictly for illustration purposes.
Both illustration markers `[` and `]` as well as the whitespaces between these
data elements, if present, shall be ignored when performing the actual data storage.


- To store binary data, use a [strongly-typed](#container_optimized) array of
`uint8` values.

## <a name="type_summary"/>Type summary

Expand Down Expand Up @@ -108,7 +178,7 @@ Type | Total size | ASCII Marker(s) | Length required | Data (payload)
## <a name="value_types"/>Value types

### <a name="value_null"/>Null
The `null` value in is equivalent to the null value from the JSON specification.
The `null` value is equivalent to the `null` value from the JSON specification.

#### Example
In JSON:
Expand All @@ -127,9 +197,14 @@ In BJData (using block-notation):

---
### <a name="value_noop"/>No-Op
There is no equivalent to `no-op` value in the original JSON specification. When
There is no equivalent to the `no-op` value in the original JSON specification. When
decoding, No-Op values should be skipped. Also, they can only occur as elements
of a container.
of an `array` construct.

The intended usage of the `no-op` value is as a valueless signal between a
producer (most likely a server) and a consumer (most likely a client) to indicate
activity, for example, as a keep-alive signal so that a client knows a server is
still working and hasn't hung or timed out.

---
### <a name="value_bool"/>Boolean
Expand All @@ -155,7 +230,7 @@ In BJData (using block-notation):

---
### <a name="value_numeric"/>Numeric
Unlike in JSON whith has a single _Number_ type (used for both integers and
Unlike in JSON, which has a single _Number_ type (used for both integers and
floating point numbers), BJData defines multiple types for integers. The
minimum/maximum of values (inclusive) for each integer type are as follows:

Expand All @@ -169,7 +244,7 @@ int32 | No | -2,147,483,648 | 2,147,483,647
uint32| Yes | 0 | 4,294,967,295
int64 | No | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807
uint64| Yes | 0 | 18,446,744,073,709,551,615
float16/half | Yes | See [IEEE 754 Spec](http://en.wikipedia.org/wiki/IEEE_754-1985) | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-1985)
float16/half | Yes | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-2008_revision) | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-2008_revision)
float32/single | Yes | See [IEEE 754 Spec](http://en.wikipedia.org/wiki/IEEE_754-1985) | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-1985)
float64/double | Yes | See [IEEE 754 Spec](http://en.wikipedia.org/wiki/IEEE_754-1985) | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-1985)
high-precision number | Yes | Infinite | Infinite
Expand All @@ -181,11 +256,12 @@ and infinity are converted to [`null`](#value_null).
- It is advisable to use the smallest applicable type when encoding a number.

#### Integer
All integer types are written in **Big-Endian order**.
All integer types (`uint8`, `int8`, `uint16`, `int16`, `uint32`, `int32`, `uint64` and
`int64`) are written in **Big-Endian order**.

#### Float
- `float16` or half-precision values are written in [IEEE 754 single precision floating point
format](http://en.wikipedia.org/wiki/IEEE_754-1985), which has the following
- `float16` or half-precision values are written in [IEEE 754 half precision floating point
format](https://en.wikipedia.org/wiki/IEEE_754-2008_revision), which has the following
structure:
- Bit 15 (1 bit) - sign
- Bit 14-10 (5 bits) - exponent
Expand Down Expand Up @@ -270,7 +346,7 @@ BJData (using block-notation):
---
### <a name="value_string"/>String
The `string` type in BJData is equivalent to the `string` type from the JSON
specification apart from that the BJData string value **requires** UTF-8
specification apart from the fact that BJData string value **requires** UTF-8
encoding.

#### Example
Expand All @@ -296,7 +372,9 @@ See also [optimized format](#container_optimized) below.

### <a name="container_array"/>Array
The `array` type in BJData is equivalent to the `array` type from the JSON
specification.
specification.

The child elements of an array are ordered and can be accessed by their indices.

#### Example
Array in JSON:
Expand Down Expand Up @@ -329,6 +407,8 @@ The `object` type in BJData is equivalent to the `object` type from the JSON
specification. Since value names can only be strings, the *S* (string) marker
**must not** be included since it is redundant.

The child elements of an object are ordered and can be accessed by their names.

#### Example

Object in JSON:
Expand Down Expand Up @@ -361,13 +441,14 @@ help optimize the container for better parsing performance and smaller size.

### Type - *$*
When a _type_ is specified, all value types stored in the container (either
array or object) are considered to be of that singular _type_ and as a result,
array or object) are considered to be of that singular _type_ and, as a result,
_type_ markers are omitted for each value within the container. This can be
thought of providing the ability to create a strongly-typed container in BJData.
thought of as providing the ability to create a strongly-typed container in BJData.

- If a _type_ is specified, it **must** be done so before a _count_.
- If a _type_ is specified, a _count_ **must** be specified as well. (Otherwise
it is impossible to tell when a container is ending, e.g. did you just parse
*]* or the int8 value of 93?)
*]* or the `int8` value of 93?)

#### Example (string type):
```
Expand All @@ -376,42 +457,42 @@ it is impossible to tell when a container is ending, e.g. did you just parse

---
### Count - *\#*
When a _count_ is followed by a single non-nagative integer record, i.e. one of
When a _count_ is followed by a single non-negative integer record, i.e. one of
`i,U,I,u,l,m,L,M`, it specifies the total child element count. This allows the
parser to pre-size any internal construct used for parsing, verify that the
promised number of child values were found and avoid scanning for any terminating
promised number of child values were found, and avoid scanning for any terminating
bytes while parsing.

- A _count_ can be specified without a type.
- A _count_ can be specified without a _type_.

#### Example (count of 64):
```
[#][i][64]
```

### Optimized N-dimensional array
### Optimized N-dimensional array of uniform type
When both _type_ and _count_ are specified and the _count_ marker `#` is followed
by `[`, the parser should expect the following sequence to be a 1-D `array` with
zero or more (`ndim`) integer elements (`nx, ny, nz, ...`). This specifies an
`ndim`-dimensional array of uniform type specified by the _type_ marker after `$`.
zero or more (`Ndim`) integer elements (`Nx, Ny, Nz, ...`). This specifies an
`Ndim`-dimensional array of uniform type specified by the _type_ marker after `$`.
The array data are serialized in the **row-major format**.

For example, the below two block sequences both represent a `nx*ny*nz*...` array of
For example, the below two block sequences both represent an `Nx*Ny*Nz*...` array of
uniform numeric type:

```
[[] [$] [type] [#] [[] [$] [nx type] [#] [ndim type] [ndim] [nx ny nz ...] [nx*ny*nz*...*sizeof(type)]
[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...] [Nx*Ny*Nz*...*sizeof(type)]
or
[[] [$] [type] [#] [[] [nx type] [nx] [ny type] [ny] [nz type] [nz] ... []] [nx*ny*nz*...*sizeof(type)]
[[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [Nx*Ny*Nz*...*sizeof(type)]
```
where `ndim` is the number of dimensions, and `nx`, `ny`, and `nz` ... are
where `Ndim` is the number of dimensions, and `Nx`, `Ny`, and `Nz` ... are
all non-negative numbers specifying the dimensions of the N-dimensional array.
`nz/ny/nz/ndim` types must be one of the integer types (`i,U,I,u,l,m,L,M`).
The binary data of the N-dimensional array is then serialized in the **row-major**
format (similar to C, C++, Javascript or Python) order.
`Nz/Ny/Nz/Ndim` types must be one of the integer types (`i,U,I,u,l,m,L,M`).
The binary data of the N-dimensional array is then serialized into a 1-D vector
in the **row-major** element order (similar to C, C++, Javascript or Python) .


#### Example (a 2x3x4 uint8 array):
#### Example (a 2x3x4 `uint8` array):
The following 2x3x4 3-D `uint8` array
```
[
Expand Down Expand Up @@ -440,7 +521,7 @@ shall be stored as
- If a _count_ is specified the container **must not** specify an end-marker.
- A container that specifies a _count_ **must** contain the specified number of
child elements.
- If a _type_ is specified, it **must** be done so before count.
- If a _type_ is specified, it **must** be done so before _count_.
- If a _type_ is specified, a _count_ **must** also be specified. A _type_
cannot be specified by itself.
- A container that specifies a _type_ **must not** contain any additional
Expand Down Expand Up @@ -489,12 +570,12 @@ Optimized with both _type_ and _count_
```

---
### Special case: Marker-only types (null, no-op & boolean)
If using both _count_ and _type_ optimisations, the marker itself represent the
value thus saving repetition (since these types to not have a payload).
Additional requirements are:
### Special case: Marker-only types (`null`, `no-op` & Boolean)
If using both _count_ and _type_ optimizations, the marker itself represents the
value, thus saving repetition (since these types do not have a payload).
Additional examples are:

Strongly typed array of type `true` (boolean) and with a _count_ of 512:
Strongly typed array of type `true` (Boolean) and with a _count_ of 512:
```
[[][$][T][#][I][512]
```
Expand All @@ -510,8 +591,8 @@ Strongly typed object of type `null` and with a _count_ of 3:
Recommended File Specifiers
------------------------------

For binary JData files, the recommended file suffix is **`".bjd"`**.
The MIME type for a binary JData document is **`"application/jdata-binary"`**
For Binary JData files, the recommended file suffix is **`".bjd"`**.
The MIME type for a Binary JData document is **`"application/jdata-binary"`**

Acknowledgement
------------------------------
Expand All @@ -520,5 +601,5 @@ The BJData spec is derived from the Universal Binary JSON (UBJSON, http://ubjson
specification (Draft 12) developed by Riyad Kalla and other UBJSON contributors.

The initial version of this MarkDown-formatted specification was derived from the
documentation included in the [Py-ubjson](https://github.com/Iotic-Labs/py-ubjson/blob/dev-contrib/UBJSON-Specification.md)
documentation included in the [Py-UBJSON](https://github.com/Iotic-Labs/py-ubjson/blob/dev-contrib/UBJSON-Specification.md)
repository (Commit 5ce1fe7).
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ for convenient editing and version control.
This specification was based upon the Universal Binary JSON (UBJSON) Specification
[Draft 12](https://github.com/ubjson/universal-binary-json/tree/master/spec12)
developed by Riyad Kalla and other UBJSON contributors. The MarkDown version
of this document was derived from the documentation included in the
of this specification was derived from the documentation included in the
[Py-ubjson](https://github.com/Iotic-Labs/py-ubjson/blob/dev-contrib/UBJSON-Specification.md)
repository (Commit 5ce1fe7).

Expand All @@ -30,11 +30,13 @@ draft of this file specification, including
question, report a bug, provide a feature request, or simply propose
general discussions. Please use URLs or keywords to link your discussion
to a specific line/section/topic in the document.
- Write short comments on Request for Comments (RFC) commits
- [Write short comments on Request for Comments (RFC) commits](https://github.com/OpenJData/bjdata/commit/2f352210f928af9efd89f47c845d8e68c5b9bbc2)
- A milestone version of the specification will be associated with an
RFC (Request for comments) commit (where the entire file is removed
and re-added so that every line appears in such commit). One can
write short comments as well as post replies on this RFC page.
- The latest RFC commit is based on **Draft 1 (version v0.5)**. Please use
[this link](https://github.com/OpenJData/bjdata/commit/2f352210f928af9efd89f47c845d8e68c5b9bbc2) to comment.
- To add a comment, you need to first register a github account, and then
browse the above RFC page. When hovering your cursor over each line, a
"plus" icon is displayed, clicking it will allow one to comment on a
Expand Down

0 comments on commit d04ab48

Please sign in to comment.