1.0 (#39)

bullet-db · Feb 18, 2021 · c59dc38 · c59dc38
1 parent 26b3f28
commit c59dc38
Show file tree

Hide file tree

Showing 48 changed files with 680 additions and 628 deletions.
diff --git a/docs/apidocs/bullet-bql/1.0.0/index.html b/docs/apidocs/bullet-bql/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-bql/1.1.0/index.html b/docs/apidocs/bullet-bql/1.1.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-core/1.0.0/index.html b/docs/apidocs/bullet-core/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-core/1.1.0/index.html b/docs/apidocs/bullet-core/1.1.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-core/1.2.0/index.html b/docs/apidocs/bullet-core/1.2.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-dsl/1.0.0/index.html b/docs/apidocs/bullet-dsl/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-dsl/1.0.1/index.html b/docs/apidocs/bullet-dsl/1.0.1/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-dsl/1.1.0/index.html b/docs/apidocs/bullet-dsl/1.1.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-kafka/1.0.0/index.html b/docs/apidocs/bullet-kafka/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-kafka/1.0.1/index.html b/docs/apidocs/bullet-kafka/1.0.1/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-pulsar/1.0.0/index.html b/docs/apidocs/bullet-pulsar/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-record/1.0.0/index.html b/docs/apidocs/bullet-record/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-record/1.1.0/index.html b/docs/apidocs/bullet-record/1.1.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-service/1.0.0/index.html b/docs/apidocs/bullet-service/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-spark/1.0.0/index.html b/docs/apidocs/bullet-spark/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/apidocs/bullet-storm/1.0.0/index.html b/docs/apidocs/bullet-storm/1.0.0/index.html
@@ -0,0 +1 @@
+Replace me with the real documentation.
diff --git a/docs/backend/dsl.md b/docs/backend/dsl.md
@@ -5,17 +5,14 @@ Bullet DSL is a configuration-based DSL that allows users to plug their data int
 To support this, Bullet DSL provides two major components. The first is for reading data from a pluggable data source (the *connectors* for talking to various data sources), and the second is for converting data (the *converters* for understanding your data formats) into [BulletRecords](ingestion.md).
 By enabling Bullet DSL in the Backend and configuring Bullet DSL, your backend will use the two components to read from the configured data source and convert the data into BulletRecords, without you having to write any code.
 
-The three interfaces that the DSL uses are:
+There is also an optional minor component that acts as the glue between the connectors and the converters. These are the *deserializers*. They exist if the data coming out of connector is of a format that cannot be understood by a converter. Typically, this happens for serialized data that needs to be deserialized first before a converter can understand it.
 
-1. The **BulletConnector** : Bullet DSL's reading component
-2. The **BulletRecordConverter** : Bullet DSL's converting component
-3. The **Bullet Backend** : The implementation of Bullet on a Stream Processor
-
-There is also an optional BulletDeserializer component that sits between the Connector and the Converter to deserialize the data.
+The four interfaces that the DSL uses are:
 
-!!!note
-
-    For the Backend, please refer to the DSL-specific Bullet Storm setup [here](storm-setup.md#using-bullet-dsl). (Currently, only Bullet Storm supports Bullet DSL.)
+1. The **BulletConnector** : Bullet DSL's reading component
+2. The **BulletDeserializer** : Bullet DSL's optional deserializing component
+3. The **BulletRecordConverter** : Bullet DSL's converting component
+4. The **Bullet Backend** : The implementation of Bullet on a Stream Processor
 
 ## BulletConnector
 
@@ -135,6 +132,10 @@ bullet.dsl.converter.pojo.class.name: "com.your.package.YourPOJO"
 
 The MapBulletRecordConverter is used to convert Java Maps of Objects into BulletRecords. Without a schema, it simply inserts every entry in the Map into a BulletRecord without any type-checking. If the Map contains objects that are not types supported by the BulletRecord, you might have issues when serializing the record.
 
+### JSONBulletRecordConverter
+
+The JSONBulletRecordConverter is used to convert String JSON representations of records into BulletRecords. Without a schema, it simply inserts every entry in the JSON object into a BulletRecord without any type-checking and it only uses the Double type for all numeric values (since it is unable to guess whether records might need a wider type). You should use a schema and mention the appropriate types if you want more specific numeric types for the fields in your record. If the JSON contains objects that are not types supported by the BulletRecord, you might have issues when serializing the record.
+
 ### AvroBulletRecordConverter
 
 The AvroBulletRecordConverter is used to convert Avro records into BulletRecords. Without a schema, it inserts every field into a BulletRecord without any type-checking. With a schema, you get type-checking, and you can also specify a RECORD field, and the converter will accept Avro Records in addition to Maps, flattening them into the BulletRecord.
@@ -146,16 +147,14 @@ The schema consists of a list of fields each described by a name, reference, typ
 1. `name` :  The name of the field in the BulletRecord
 2. `reference` : The field to extract from the to-be-converted object
 3. `type` : The type of the field
-4. `subtype` : The subtype of any nested fields in this field (if any)
 
 
 When using the schema:
 
 1. The `name` of the field in the schema will be the name of the field in the BulletRecord.
 2. The `reference` of the field in the schema is the field/value to be extracted from an object when it is converted to a BulletRecord.
 3. If the `reference` is null, it is assumed that the `name` and the `reference` are the same.
-4. The `type` must be specified and will be used for type-checking.
-5. The `subtype` must be specified for certain `type` values (`LIST`, `LISTOFMAP`, `MAP`, or `MAPOFMAP`). Otherwise, it must be null.
+4. The `type` must be specified and can be used for type-checking. If you provide a schema and set the `bullet.dsl.converter.schema.type.check.enable` setting, then the converter will validate that the types in the source data matches the given type here. Otherwise, the type provided will be assumed. This is useful when initially using the DSL and you are not sure of the types.
 
 #### Types
 
@@ -165,24 +164,34 @@ When using the schema:
 4. FLOAT
 5. DOUBLE
 6. STRING
-7. LIST
-8. LISTOFMAP
-9. MAP
-10. MAPOFMAP
-11. RECORD
-
-#### Subtypes
-
-1. BOOLEAN
-2. INTEGER
-3. LONG
-4. FLOAT
-5. DOUBLE
-6. STRING
-
-!!!note "RECORD"
-
-    For RECORD type, you should normally reference a Map. For each key-value pair in the Map, a field will be inserted into the BulletRecord. Hence, the name in a RECORD field is left empty.
+7. BOOLEAN_MAP
+8. INTEGER_MAP
+9. LONG_MAP
+10. FLOAT_MAP
+11. DOUBLE_MAP
+12. STRING_MAP
+13. BOOLEAN_MAP_MAP
+14. INTEGER_MAP_MAP
+15. LONG_MAP_MAP
+16. FLOAT_MAP_MAP
+17. DOUBLE_MAP_MAP
+18. STRING_MAP_MAP
+19. BOOLEAN_LIST
+20. INTEGER_LIST
+21. LONG_LIST
+22. FLOAT_LIST
+23. DOUBLE_LIST
+24. STRING_LIST
+25. BOOLEAN_MAP_LIST
+26. INTEGER_MAP_LIST
+27. LONG_MAP_LIST
+28. FLOAT_MAP_LIST
+29. DOUBLE_MAP_LIST
+30. STRING_MAP_LIST
+
+!!!note "Special Type for a RECORD"
+
+    There is a special case where if you omit the `type` and the `name` for an entry in the schema, the reference is assumed to be a map containing arbitrary fields with types in the list above. You can use this if you have a map field that contains various objects with one or more types in the list above and want to flatten that map out into the target record using the respective types of each field in the map. The names of the fields in the map will be used as the top-level names in the resulting record.
 
 #### Example Schema
 
@@ -195,13 +204,11 @@ When using the schema:
     },
     {
       "name": "myBoolMap",
-      "type": "MAP",
-      "subtype": "BOOLEAN"
+      "type": "BOOLEAN_MAP"
     },
     {
       "name": "myLongMapMap",
-      "type": "MAPOFMAP",
-      "subtype": "LONG"
+      "type": "LONG_MAP_MAP"
     },
     {
       "name": "myIntFromSomeMap",
@@ -217,18 +224,17 @@ When using the schema:
       "name": "myIntFromSomeNestedMapsAndLists",
       "reference": "someMap.nestedMap.nestedList.0",
       "type": "INTEGER"
-    },
+    },    
     {
-      "reference" : "someMap",
-      "type": "RECORD"
+      "reference" : "someMap"
     }
   ]
 }
 ```
 
 ## BulletDeserializer
 
-BulletDeserializer is an abstract Java class that can be implemented to deserialize/transform output from BulletConnector to input for BulletRecordConverter. It is an *optional* component and whether it's necessary or not depends on the output of your data sources. For example, if your KafkaConnector outputs byte arrays that are actually Java-serialized Maps, and you're using a MapBulletRecordConverter, you would use the JavaDeserializer, which would deserialize byte arrays into Java Maps for the converter.
+BulletDeserializer is an abstract Java class that can be implemented to deserialize/transform output from BulletConnector to input for BulletRecordConverter. It is an *optional* component and whether it's necessary or not depends on the output of your data sources. If one is not needed, the `IdentityDeserializer` can be used. For example, if your KafkaConnector outputs byte arrays that are actually Java-serialized Maps, and you're using a MapBulletRecordConverter, you would use the JavaDeserializer, which would deserialize byte arrays into Java Maps for the converter.
 
 Currently, we support two BulletDeserializer implementations:
 

diff --git a/docs/backend/ingestion.md b/docs/backend/ingestion.md
@@ -33,6 +33,7 @@ Data placed into a Bullet Record is strongly typed. We support these types curre
 
 1. Map of Strings to any of the [Primitives](#primitives)
 2. Map of Strings to any Map in 1
+3. List of any of the [Primitives](#primitives)
 3. List of any Map in 1
 
 With these types, it is unlikely you would have data that cannot be represented as Bullet Record but if you do, please let us know and we are more than willing to accommodate.

diff --git a/docs/backend/spark-setup.md b/docs/backend/spark-setup.md
@@ -12,7 +12,7 @@ Download the Bullet Spark standalone jar from [JCenter](http://jcenter.bintray.c
 
 If you are using Bullet Kafka as pluggable PubSub, you can download the fat jar from [JCenter](http://jcenter.bintray.com/com/yahoo/bullet/bullet-kafka/). Otherwise, you need to plug in your own PubSub jar or use the RESTPubSub built-into bullet-core and turned on in the API.
 
-To use Bullet Spark, you need to implement your own [Data Producer Trait](https://github.com/bullet-db/bullet-spark/blob/master/src/main/scala/com/yahoo/bullet/spark/DataProducer.scala) with a JVM based project. You have two ways to implement it as described in the [Spark Architecture](spark-architecture.md#data-processing) section. You include the Bullet artifact and Spark dependencies in your pom.xml or other equivalent build tools. The artifacts are available through JCenter. Here is an example if you use Scala and Maven:
+To use Bullet Spark, you need to implement your own [Data Producer Trait](https://github.com/bullet-db/bullet-spark/blob/master/src/main/scala/com/yahoo/bullet/spark/DataProducer.scala) with a JVM based project or you can use Bullet DSL (see below). If you choose to implement your own, you have two ways as described in the [Spark Architecture](spark-architecture.md#data-processing) section. You include the Bullet artifact and Spark dependencies in your pom.xml or other equivalent build tools. The artifacts are available through JCenter. Here is an example if you use Scala and Maven:
 
 ```xml
 <repositories>
@@ -65,9 +65,26 @@ To use Bullet Spark, you need to implement your own [Data Producer Trait](https:
 
 You can also add ```<classifier>sources</classifier>``` or ```<classifier>javadoc</classifier>``` if you want the sources or javadoc.
 
+### Using Bullet DSL
+
+Instead of implementing your own Data Producer, you can also use the provided DSL receiver with [Bullet DSL](dsl.md). To do so, add the following settings to your YAML configuration:
+
+```yaml
+# If true, enables the Bullet DSL data producer which can be configured to read from a custom data source. If enabled,
+# the DSL data producer is used instead of the producer.
+bullet.spark.dsl.data.producer.enable: true
+
+# If true, enables the deserializer between the Bullet DSL connector and converter components. Otherwise, this step is skipped.
+bullet.spark.dsl.deserializer.enable: false
+```
+
+You may then use the appropriate DSL settings to point to the class names of the Connector and Converter you wish to use to read from your data source and convert it to BulletRecord instances.
+
+There is also a setting to enable [BulletDeserializer](dsl.md#bulletdeserializer), which is an optional component of Bullet DSL for deserializing data between reading and converting.  
+
 ## Launch
 
-After you have implemented your own data producer and built a jar, you could launch your Bullet Spark application. Here is an example command for a [YARN cluster](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).
+After you have implemented your own data producer or used Bullet DSL and built a jar, you could launch your Bullet Spark application. Here is an example command for a [YARN cluster](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).
 
 ```bash
 ./bin/spark-submit \