Merge pull request #2 from OpenJData/master

sync to draft 1
NeuroJSON · May 13, 2020 · d04ab48 · d04ab48
2 parents a418aaf + 2da127d
commit d04ab48
Show file tree

Hide file tree

Showing 2 changed files with 132 additions and 49 deletions.
diff --git a/Binary_JData_Specification.md b/Binary_JData_Specification.md
@@ -1,10 +1,10 @@
-Binary JData: An efficient interchange format for complex binary data
+Binary JData: A portable interchange format for complex binary data
 ============================================================
 
-- **Status of this document**: This document is current under development.
+- **Status of this document**: Request for comments.
 - **Maintainer**: Qianqian Fang <q.fang at neu.edu>
 - **License**: Apache License, Version 2.0
-- **Version**: 0.4 (Draft 1.pre)
+- **Version**: 0.5 (Draft 1)
 - **Abstract**:
 
 > The Binary JData (BJData) Specification defines an efficient serialization 
@@ -32,7 +32,63 @@ extended binary data types.
 Introduction
 ------------------------------
 
-`To be added`
+The Javascript Object Notation (JSON) format, formally known as the ECMA-404 
+or ISO21778:2017 standard, is ubiquitously used in today's web and native 
+applications. JSON presents numerous advantages, such as human readability, 
+generality for accommodating complex hierarchical data, self-documentation, and 
+abundant existing free and commercial libraries. However, its utility is largely 
+restricted to the storage of lightweight textural data, and has very limited presence 
+in many data-intensive and performance-demanding applications, such as databases,
+medical imaging, and scientific data storage.
+
+The lack of support for strongly-typed and binary data has been one of the main 
+barriers towards widespread adoption of JSON in these domains. In recent years, 
+efforts to address these limitation have resulted in an array of versatile binary 
+JSON formats, such as BSON (Binary JSON, http://bson.org), UBJSON (Universal Binary 
+JSON, http://ubjson.org), MessagePack (https://msgpack.org), CBOR (Concise Binary 
+Object Representation, [RFC 7049], https://cbor.io) etc. These binary JSON 
+counterparts are broadly used in speed-sensitive data processing applications and
+address various needs from a diverse range of applications.
+
+To better champion findable, accessible, interoperable, and reusable 
+([FAIR principle](https://www.nature.com/articles/sdata201618)) data in 
+scientific data storage and management, we have created the **OpenJData Initiative**
+(http://openjdata.org) to develop a set of open-standards for portable, human-readable 
+and high-performance data annotation and serialization aimed towards enabling
+scientific researchers, IT engineers, as well as general data users to efficiently 
+annotate and store complex data structures arising from diverse applications.
+
+The OpenJData framework first converts complex data structures, such as N-D
+arrays, trees, tables and graphs, into easy-to-serialize, portable data annotations
+via the **JData Specification** (https://github.com/fangq/jdata) and then serializes 
+and stores the annotated JData constructs using widely-supported data formats. 
+To balance data portability, readability and efficiency, OpenJData defines a 
+**dual-interface**: a text-based format **syntactically compatible with JSON**,
+and a binary-JSON format to achieve significantly smaller file sizes and faster 
+encoding/decoding.
+
+The Binary JData (BJData) format is the **official binary interface** for the JData 
+specification. It is derived from the widely supported UBJSON Specification 
+Draft 12 (https://github.com/ubjson/universal-binary-json), and adds native
+support for **N-dimensional packed arrays** - an essential data structure for
+scientific applications - as well as extended binary data types, including unsigned
+integer types and half-precision floating-point numbers. The new data constructs
+also allow a BJData file to store binary arrays larger than 4 GB in size, which
+is not currently possible with MessagePack (maximum data record size is limited
+to 4 GB) and BSON (maximum total file size is 4 GB).
+
+A key rationale for basing  the BJData format upon UBJSON as opposed to 
+other more popular binary JSON-like formats, such as BSON, CBOR and MessagePack, 
+is UBJSON's **quasi-human-readability** - a unique characteristic that is 
+absent from almost all other binary formats. This is because all data semantic 
+elements in a UBJSON/BJData file, e.g. the "name" fields and data-type markers, 
+are defined in human-readable strings. The resulting binary files are not only
+capable of storing complex and hierarchical binary data structures, but also 
+directly readable using an editor with minimal or no processing. We anticipate that
+such a unique capability, in combination with the highly portable JData annotation 
+keywords, makes a data file self-explanatory, easy to reuse, and easy to 
+inter-operate in complex applications.
+
 
 License
 ------------------------------
@@ -64,18 +120,32 @@ data representing the actual binary data for this type of value.
 
 ### Notes
 - Some values are simple enough that just writing the 1-byte ASCII marker into 
-the stream is enough to represent the value (e.g. null) while others have a 
+the stream is enough to represent the value (e.g. `null`), while others have a 
 type that is specific enough that no length is needed as the length is implied 
-by the type (e.g. `int32`) while others still require both a type and a length to 
-communicate their value (e.g. string). Additionally some values (e.g. `array`) 
+by the type (e.g. `int32`). Yet others still require both a type and a length to 
+communicate their value (e.g. `string`). In addition, some values (e.g. `array`) 
 have additional (_optional_) parameters to improve decoding efficiency and/or 
-to reduce size of the encoded value even further.
+to reduce the size of the encoded value even further.
 
 - The BJData specification requires that all numeric values be written in 
-Big-Endian order.
+**Big-Endian order**.
+
+- The `array` and `object` data types are **container** types, similar to JSON
+arrays and objects. They help partition and organize data records of all types, 
+including container types, into composite and complex records.
+
+- Using a [strongly-typed](#container_optimized) container construct can further 
+reduce data file sizes as it extracts common data type markers to the header. It
+also helps a parser to pre-allocate necessary memory buffers before reading the 
+data payload.
+
+In the following sections, we use a **block-notation** to illustrate the layout
+of the encoded data. In this notation, the data type markers and individual 
+data payloads are enclosed by a pair of `[]`, strictly for illustration purposes.
+Both illustration markers `[` and `]` as well as the whitespaces between these 
+data elements, if present, shall be ignored when performing the actual data storage.
+
 
-- To store binary data, use a [strongly-typed](#container_optimized) array of 
-`uint8` values.
 
 ## <a name="type_summary"/>Type summary
 
@@ -108,7 +178,7 @@ Type | Total size | ASCII Marker(s) | Length required | Data (payload)
 ## <a name="value_types"/>Value types
 
 ### <a name="value_null"/>Null
-The `null` value in is equivalent to the null value from the JSON specification.
+The `null` value is equivalent to the `null` value from the JSON specification.
 
 #### Example
 In JSON:
@@ -127,9 +197,14 @@ In BJData (using block-notation):
 
 ---
 ### <a name="value_noop"/>No-Op
-There is no equivalent to `no-op` value in the original JSON specification. When 
+There is no equivalent to the `no-op` value in the original JSON specification. When 
 decoding, No-Op values should be skipped. Also, they can only occur as elements 
-of a container.
+of an `array` construct.
+
+The intended usage of the `no-op` value is as a valueless signal between a 
+producer (most likely a server) and a consumer (most likely a client) to indicate 
+activity, for example, as a keep-alive signal so that a client knows a server is 
+still working and hasn't hung or timed out.
 
 ---
 ### <a name="value_bool"/>Boolean
@@ -155,7 +230,7 @@ In BJData (using block-notation):
 
 ---
 ### <a name="value_numeric"/>Numeric
-Unlike in JSON whith has a single _Number_ type (used for both integers and 
+Unlike in JSON, which has a single _Number_ type (used for both integers and 
 floating point numbers), BJData defines multiple types for integers. The 
 minimum/maximum of values (inclusive) for each integer type are as follows:
 
@@ -169,7 +244,7 @@ int32 | No | -2,147,483,648 | 2,147,483,647
 uint32| Yes | 0 | 4,294,967,295
 int64 | No | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807
 uint64| Yes | 0 | 18,446,744,073,709,551,615
-float16/half | Yes | See [IEEE 754 Spec](http://en.wikipedia.org/wiki/IEEE_754-1985) | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-1985)
+float16/half | Yes | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-2008_revision) | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-2008_revision)
 float32/single | Yes | See [IEEE 754 Spec](http://en.wikipedia.org/wiki/IEEE_754-1985) | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-1985)
 float64/double | Yes | See [IEEE 754 Spec](http://en.wikipedia.org/wiki/IEEE_754-1985) | See [IEEE 754 Spec](https://en.wikipedia.org/wiki/IEEE_754-1985)
 high-precision number | Yes | Infinite | Infinite
@@ -181,11 +256,12 @@ and infinity are converted to [`null`](#value_null).
 - It is advisable to use the smallest applicable type when encoding a number.
 
 #### Integer
-All integer types are written in **Big-Endian order**.
+All integer types (`uint8`, `int8`, `uint16`, `int16`, `uint32`, `int32`, `uint64` and 
+`int64`) are written in **Big-Endian order**.
 
 #### Float
-- `float16` or half-precision values are written in [IEEE 754 single precision floating point 
-format](http://en.wikipedia.org/wiki/IEEE_754-1985), which has the following 
+- `float16` or half-precision values are written in [IEEE 754 half precision floating point 
+format](https://en.wikipedia.org/wiki/IEEE_754-2008_revision), which has the following 
 structure:
   - Bit 15 (1 bit) - sign
   - Bit 14-10 (5 bits) - exponent
@@ -270,7 +346,7 @@ BJData (using block-notation):
 ---
 ### <a name="value_string"/>String
 The `string` type in BJData is equivalent to the `string` type from the JSON 
-specification apart from that the BJData string value **requires** UTF-8 
+specification apart from the fact that BJData string value **requires** UTF-8 
 encoding.
 
 #### Example
@@ -296,7 +372,9 @@ See also [optimized format](#container_optimized) below.
 
 ### <a name="container_array"/>Array
 The `array` type in BJData is equivalent to the `array` type from the JSON 
-specification.
+specification. 
+
+The child elements of an array are ordered and can be accessed by their indices.
 
 #### Example
 Array in JSON:
@@ -329,6 +407,8 @@ The `object` type in BJData is equivalent to the `object` type from the JSON
 specification. Since value names can only be strings, the *S* (string) marker 
 **must not** be included since it is redundant.
 
+The child elements of an object are ordered and can be accessed by their names.
+
 #### Example
 
 Object in JSON:
@@ -361,13 +441,14 @@ help optimize the container for better parsing performance and smaller size.
 
 ### Type - *$*
 When a _type_ is specified, all value types stored in the container (either 
-array or object) are considered to be of that singular _type_ and as a result, 
+array or object) are considered to be of that singular _type_ and, as a result, 
 _type_ markers are omitted for each value within the container. This can be 
-thought of providing the ability to create a strongly-typed container in BJData.
+thought of as providing the ability to create a strongly-typed container in BJData.
+
 - If a _type_ is specified, it **must** be done so before a _count_.
 - If a _type_ is specified, a _count_ **must** be specified as well. (Otherwise 
 it is impossible to tell when a container is ending, e.g. did you just parse 
-*]* or the int8 value of 93?)
+*]* or the `int8` value of 93?)
 
 #### Example (string type):
 ```
@@ -376,42 +457,42 @@ it is impossible to tell when a container is ending, e.g. did you just parse
 
 ---
 ### Count - *\#*
-When a _count_ is followed by a single non-nagative integer record, i.e. one of
+When a _count_ is followed by a single non-negative integer record, i.e. one of
 `i,U,I,u,l,m,L,M`, it specifies the total child element count. This allows the 
 parser to pre-size any internal construct used for parsing, verify that the 
-promised number of child values were found and avoid scanning for any terminating 
+promised number of child values were found, and avoid scanning for any terminating 
 bytes while parsing. 
 
-- A _count_ can be specified without a type.
+- A _count_ can be specified without a _type_.
 
 #### Example (count of 64):
 ```
 [#][i][64]
 ```
 
-### Optimized N-dimensional array
+### Optimized N-dimensional array of uniform type
 When both _type_ and _count_ are specified and the _count_ marker `#` is followed 
 by `[`, the parser should expect the following sequence to be a 1-D `array` with 
-zero or more (`ndim`) integer elements (`nx, ny, nz, ...`). This specifies an 
-`ndim`-dimensional array of uniform type specified by the _type_ marker after `$`. 
+zero or more (`Ndim`) integer elements (`Nx, Ny, Nz, ...`). This specifies an 
+`Ndim`-dimensional array of uniform type specified by the _type_ marker after `$`. 
 The array data are serialized in the **row-major format**.
 
-For example, the below two block sequences both represent a `nx*ny*nz*...` array of
+For example, the below two block sequences both represent an `Nx*Ny*Nz*...` array of
 uniform numeric type:
 
 ```
-[[] [$] [type] [#] [[] [$] [nx type] [#] [ndim type] [ndim] [nx ny nz ...]  [nx*ny*nz*...*sizeof(type)]
+[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...]  [Nx*Ny*Nz*...*sizeof(type)]
   or
-[[] [$] [type] [#] [[] [nx type] [nx] [ny type] [ny] [nz type] [nz] ... []] [nx*ny*nz*...*sizeof(type)]
+[[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [Nx*Ny*Nz*...*sizeof(type)]
 ```
-where `ndim` is the number of dimensions, and `nx`, `ny`, and `nz` ... are 
+where `Ndim` is the number of dimensions, and `Nx`, `Ny`, and `Nz` ... are 
 all non-negative numbers specifying the dimensions of the N-dimensional array.
-`nz/ny/nz/ndim` types must be one of the integer types (`i,U,I,u,l,m,L,M`). 
-The binary data of the N-dimensional array is then serialized in the **row-major** 
-format (similar to C, C++, Javascript or Python) order.
+`Nz/Ny/Nz/Ndim` types must be one of the integer types (`i,U,I,u,l,m,L,M`). 
+The binary data of the N-dimensional array is then serialized into a 1-D vector
+in the **row-major** element order (similar to C, C++, Javascript or Python) .
 
 
-#### Example (a 2x3x4 uint8 array):
+#### Example (a 2x3x4 `uint8` array):
 The following 2x3x4 3-D `uint8` array 
 ```
 [
@@ -440,7 +521,7 @@ shall be stored as
 - If a _count_ is specified the container **must not** specify an end-marker.
 - A container that specifies a _count_ **must** contain the specified number of 
 child elements.
-- If a _type_ is specified, it **must** be done so before count.
+- If a _type_ is specified, it **must** be done so before _count_.
 - If a _type_ is specified, a _count_ **must** also be specified. A _type_ 
 cannot be specified by itself.
 - A container that specifies a _type_ **must not** contain any additional 
@@ -489,12 +570,12 @@ Optimized with both _type_ and _count_
 ```
 
 ---
-### Special case: Marker-only types (null, no-op & boolean)
-If using both _count_ and _type_ optimisations, the marker itself represent the 
-value thus saving repetition (since these types to not have a payload). 
-Additional requirements are:
+### Special case: Marker-only types (`null`, `no-op` & Boolean)
+If using both _count_ and _type_ optimizations, the marker itself represents the 
+value, thus saving repetition (since these types do not have a payload). 
+Additional examples are:
 
-Strongly typed array of type `true` (boolean) and with a _count_ of 512:
+Strongly typed array of type `true` (Boolean) and with a _count_ of 512:
 ```
 [[][$][T][#][I][512]
 ```
@@ -510,8 +591,8 @@ Strongly typed object of type `null` and with a _count_ of 3:
 Recommended File Specifiers
 ------------------------------
 
-For binary JData files, the recommended file suffix is **`".bjd"`**.
-The MIME type for a binary JData document is **`"application/jdata-binary"`**
+For Binary JData files, the recommended file suffix is **`".bjd"`**.
+The MIME type for a Binary JData document is **`"application/jdata-binary"`**
 
 Acknowledgement
 ------------------------------
@@ -520,5 +601,5 @@ The BJData spec is derived from the Universal Binary JSON (UBJSON, http://ubjson
 specification (Draft 12) developed by Riyad Kalla and other UBJSON contributors.
 
 The initial version of this MarkDown-formatted specification was derived from the 
-documentation included in the [Py-ubjson](https://github.com/Iotic-Labs/py-ubjson/blob/dev-contrib/UBJSON-Specification.md) 
+documentation included in the [Py-UBJSON](https://github.com/Iotic-Labs/py-ubjson/blob/dev-contrib/UBJSON-Specification.md) 
 repository (Commit 5ce1fe7).
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ for convenient editing and version control.
 This specification was based upon the Universal Binary JSON (UBJSON) Specification
 [Draft 12](https://github.com/ubjson/universal-binary-json/tree/master/spec12)
 developed by Riyad Kalla and other UBJSON contributors. The MarkDown version 
-of this document was derived from the documentation included in the 
+of this specification was derived from the documentation included in the 
 [Py-ubjson](https://github.com/Iotic-Labs/py-ubjson/blob/dev-contrib/UBJSON-Specification.md)
 repository (Commit 5ce1fe7).
 
@@ -30,11 +30,13 @@ draft of this file specification, including
     question, report a bug, provide a feature request, or simply propose
     general discussions. Please use URLs or keywords to link your discussion 
     to a specific line/section/topic in the document.
-- Write short comments on Request for Comments (RFC) commits
+- [Write short comments on Request for Comments (RFC) commits](https://github.com/OpenJData/bjdata/commit/2f352210f928af9efd89f47c845d8e68c5b9bbc2)
   - A milestone version of the specification will be associated with an
     RFC (Request for comments) commit (where the entire file is removed
     and re-added so that every line appears in such commit). One can
     write short comments as well as post replies on this RFC page. 
+  - The latest RFC commit is based on **Draft 1 (version v0.5)**. Please use
+    [this link](https://github.com/OpenJData/bjdata/commit/2f352210f928af9efd89f47c845d8e68c5b9bbc2) to comment.
   - To add a comment, you need to first register a github account, and then 
     browse the above RFC page. When hovering your cursor over each line, a 
     "plus" icon is displayed, clicking it will allow one to comment on a