Update ws api (#7)

bullet-db · Jun 20, 2018 · 6a4d4af · 6a4d4af
1 parent 1c6ddc0
commit 6a4d4af
Show file tree

Hide file tree

Showing 11 changed files with 761 additions and 28 deletions.
diff --git a/docs/img/additive-tumbling.png b/docs/img/additive-tumbling.png
diff --git a/docs/img/bullet-icons-line.png b/docs/img/bullet-icons-line.png
diff --git a/docs/img/reactive.png b/docs/img/reactive.png
diff --git a/docs/img/time-based-tumbling.png b/docs/img/time-based-tumbling.png
diff --git a/docs/index.md b/docs/index.md
@@ -1,26 +1,28 @@
-# Overview
+  ![Bullet Icons](../img/bullet-icons-line.png)
 
-## Bullet ...
+# Bullet:
 
-* Is a real-time query engine that lets you run queries on very large data streams
+* **Is a real-time query engine for very large data streams**
 
-* Does not use a **a persistence layer**. This makes it **light-weight, cheap and fast**
+* **Has NO persistence layer**
 
-* Is a **look-forward** query system. Queries are submitted first and they operate on data that arrive after the query is submitted
+* **Is light-weight, cheap and fast**
 
-* Supports rich queries for filtering and getting **Raw data, Counting Distincts, Distincts, Grouping (Sum, Count, Min, Max, Avg), Distributions, and Top K**
+* **Is multi-tenant**
 
-* Is **multi-tenant** and can scale for more queries and/or for more data
+* **Is pluggable to any data source**
 
-* Provides a **UI and Web Service** that are also pluggable for a full end-to-end solution to your querying needs
+* **Provides a UI and Web Service**
 
-* Has an implementation on [Storm](http://storm.apache.org) currently. There are plans to implement it on other Stream Processors.
+* **Can filter raw data or aggregate data**
 
-* Is **pluggable**. Any data source that can be read from Storm can be converted into a standard data container letting you query that data. Data is **typed**
+* **Can be run on storm or spark streaming**
 
-* Is used at scale and in production at Yahoo with running 500+ queries simultaneously on 200,000 rps (records per second) and tested up to 2,000,000 rps
+* **Is a look-forward query system** - operates on data that arrive after the query is submitted
 
-## How is this useful
+* **Is big-data scale-tested** - used in production at Yahoo and tested running 500+ queries simultaneously on up to 2,000,000 rps
+
+# How is this useful
 
 How Bullet is used is largely determined by the data source it consumes. Depending on what kind of data you put Bullet on, the types of queries you run on it and your use-cases will change. As a look-forward query system with no persistence, you will not be able to repeat your queries on the same data. The next time you run your query, it will operate on the different data that arrives after that submission. If this usage pattern is what you need and you are looking for a light-weight system that can tap into your streaming data, then Bullet is for you!
 

diff --git a/docs/pubsub/kafka.md b/docs/pubsub/kafka.md
@@ -1,6 +1,6 @@
 # Kafka PubSub
 
-The Kafka implemented of the Bullet PubSub can be used on any Backend and Web Service. It uses [Apache Kafka](https://kafka.apache.org) as the backing PubSub queue and works on all Backends.
+The Kafka implementation of the Bullet PubSub can be used on any Backend and Web Service. It uses [Apache Kafka](https://kafka.apache.org) as the backing PubSub queue and works on all Backends.
 
 ## How does it work?
 

diff --git a/docs/pubsub/rest.md b/docs/pubsub/rest.md
@@ -0,0 +1,84 @@
+# REST PubSub
+
+The REST PubSub implementation is included in bullet-core, and can be launched along with the Web Service. If it is enabled the Web Service will expose two additional REST endpoints, one for reading/writing Bullet queries, and one
+for reading/writing results.
+
+## How does it work?
+
+When the Web Service receives a query from a user, it will create a PubSubMessage and write the message to the "query" RESTPubSub endpoint. This PubSubMessage will contain not only the query, but also some metadata, including the
+appropriate host/port to which the response should be sent (this is done to allow for multiple Web Services running simultaneously). The query is then stored in memory until the backend does a GET from this endpoint, at which
+time the query will be served to the backend, and dropped from the queue in memory.
+
+Once the backed has generated the results of the query, it will wrap those results in PubSubMessage. The backend extracts the URL to send the results to from the metadata and writes the results PubSubMessage to the
+"results" REST endpoint with a POST. This result will then be stored in memory until the Web Service does a GET to that endpoint, at which time the Web Service will have the results of the query to send back to the user.
+
+## Setup
+
+To enable the RESTPubSub and expose the two additional necessary REST endpoints, you must enable the setting:
+
+```yaml
+bullet.pubsub.builtin.rest.enabled: true
+```
+
+...in the Web Service Application.yaml file. This can also be done from the command line when launching the Web Service jar file by adding the command-line option:
+
+```bash
+--bullet.pubsub.builtin.rest.enabled=true
+```
+
+This will enable the two necessary REST endpoints, the paths for which can be configured in the Application.yaml file with the settings:
+
+```yaml
+bullet.pubsub.builtin.rest.query.path: /pubsub/query
+bullet.pubsub.builtin.rest.result.path: /pubsub/result
+```
+
+### Plug into the Backend
+
+Configure the backend to use the REST PubSub:
+
+```yaml
+bullet.pubsub.context.name: "QUERY_PROCESSING"
+bullet.pubsub.class.name: "com.yahoo.bullet.kafka.KafkaPubSub"
+
+bullet.pubsub.rest.connect.timeout.ms: 5000
+bullet.pubsub.rest.subscriber.max.uncommitted.messages: 100
+bullet.pubsub.rest.result.subscriber.min.wait.ms: 10
+bullet.pubsub.rest.query.subscriber.min.wait.ms: 10
+bullet.pubsub.rest.query.urls:
+    - "http://webServiceHostNameA:9901/api/bullet/pubsub/query"
+    - "http://webServiceHostNameB:9902/api/bullet/pubsub/query"
+```
+
+* __bullet.pubsub.context.name: "QUERY_PROCESSING"__ - tells the PubSub that it is running in the backend
+* __bullet.pubsub.class.name: "com.yahoo.bullet.kafka.KafkaPubSub"__ - tells Bullet to use this class for it's PubSub
+* __bullet.pubsub.rest.connect.timeout.ms: 5000__ - sets the HTTP connect timeout to a half second
+* __bullet.pubsub.rest.subscriber.max.uncommitted.messages: 100__ - this is the maxiumum number of uncommitted messages allowed before blocking
+* __bullet.pubsub.rest.query.subscriber.min.wait.ms: 10__ - this setting is used to avoid making an http request too rapidly and overloading the http endpoint. It will force the backend to poll the query endpoint at most once every 10ms.
+* __bullet.pubsub.rest.query.urls__ - this should be a list of all the query rest enpoint URLs. If you are only running one Web Service this will only contain one url (the url of your Web Service followed by the full path of the query endpoint).
+
+### Plug into the Web Service
+
+Configure the Web Service to use the REST PubSub:
+
+```yaml
+bullet.pubsub.context.name: "QUERY_SUBMISSION"
+bullet.pubsub.class.name: "com.yahoo.bullet.kafka.KafkaPubSub"
+
+bullet.pubsub.rest.connect.timeout.ms: 5000
+bullet.pubsub.rest.subscriber.max.uncommitted.messages: 100
+bullet.pubsub.rest.result.subscriber.min.wait.ms: 10
+bullet.pubsub.rest.query.subscriber.min.wait.ms: 10
+bullet.pubsub.rest.result.url: "http://localhost:9901/api/bullet/pubsub/result"
+bullet.pubsub.rest.query.urls:
+    - "http://localhost:9901/api/bullet/pubsub/query"
+```
+
+* __bullet.pubsub.context.name: "QUERY_SUBMISSION"__ - tells the PubSub that it is running in the Web Service
+* __bullet.pubsub.class.name: "com.yahoo.bullet.kafka.KafkaPubSub"__ - tells Bullet to use this class for it's PubSub
+* __bullet.pubsub.rest.connect.timeout.ms: 5000__ - sets the HTTP connect timeout to a half second
+* __bullet.pubsub.rest.subscriber.max.uncommitted.messages: 100__ - this is the maxiumum number of uncommitted messages allowed before blocking
+* __bullet.pubsub.rest.query.subscriber.min.wait.ms: 10__ - this setting is used to avoid making an http request too rapidly and overloading the http endpoint. It will force the backend to poll the query endpoint at most once every 10ms.
+* __bullet.pubsub.rest.result.url: "http://localhost:9901/api/bullet/pubsub/result"__ - this is the endpoint from which the WebService should read results - it should generally be the hostname of that machine the Web Service is running on (or "localhost").
+* __bullet.pubsub.rest.query.urls__ - in the Web Service this setting should contain __exactly one__ url - the url to which queries should be written  - it should generally be the hostname of that machine the Web Service is running on (or "localhost").
+
diff --git a/docs/ws/api.md b/docs/ws/api.md
@@ -1,21 +1,36 @@
 # API
 
-See the [UI Usage section](../ui/usage.md) for using the UI to build Bullet queries. This section deals with examples of the JSON query format that the API currently exposes (and the UI uses underneath).
+This section gives a comprehensive overview of the Web Service API for launching Bullet queries.
 
-Bullet queries allow you to filter, project and aggregate data. It lets you fetch raw and aggregated data. Fields inside maps can be accessed using the '.' notation in queries. For example, myMap.key will access the key field inside the myMap map. There is no support for accessing fields inside Lists or inside nested Maps as of yet. Only the entire object can be operated on for now.
+* For info on how to use the UI, see the [UI Usage section](../ui/usage.md)
+* For examples of specific queries see the [Examples](examples.md) section
 
-The three main sections of a Bullet query are:
+The main constituents of a Bullet query are:
+
+* __filters__, which determine which records will be consumed by your query
+* __projection__, which determines which fields will be projected in the resulting output from Bullet
+* __aggregation__, which allows users to aggregate data and perform aggregation operations
+* __window__, which can be used to return incremental results on "windowed" data
+* __duration__, which determines the maximum duration of the query in milliseconds
+
+Fields inside maps can be accessed using the '.' notation in queries. For example,
+
+`myMap.key`
+
+will access the "key" field inside the "myMap" map. There is no support for accessing fields inside Lists or inside nested Maps as of yet. Only the entire object can be operated on for now.
+
+The main constituents of a Bullet query listed above create the top level fields of the Bullet query:
 ```javascript
 {
-    "filters": {},
+    "filters": [{}, {}, ...],
     "projection": {},
     "aggregation": {}.
+    "window": {},
     "duration": 20000
 }
 ```
-The duration represents how long the query runs for (a window from when you submit it to that many milliseconds into the future).
 
-See the [Filters](#filters), [Projections](#projections) and [Aggregation](#aggregations) sections for their respective specifications. Each of those sections are objects and you will need to be place the entire object in the respective sections above.
+We will describe how to specify each of these top-level fields below:
 
 ## Filters
 
@@ -36,7 +51,7 @@ The current logical operators allowed in filters are:
 | OR               | Any filter must be true. The first true filter evaluated left to right will short-circuit the computation. |
 | NOT              | Negates the value of the first filter clause. The filter is satisfied iff the value is true. |
 
-The format for a Logical filter is:
+The format for a __single__ Logical filter is:
 
 ```javascript
 {
@@ -52,6 +67,8 @@ The format for a Logical filter is:
 
 Any other type of filter may be provided as a clause in clauses.
 
+Note that the "filter" field in the query is a __list__ of as many filters as you'd like.
+
 ### Relational Filters
 
 Relational filters allow you to specify conditions on a field, using a comparison operator and a list of values.
@@ -68,7 +85,7 @@ The current comparisons allowed in filters are:
 | >          | Greater than any value in values |
 | RLIKE      | Matches using [Java Regex notation](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html), any Regex value in values |
 
-These operators are all typed based on the type of the left hand side from the Bullet record. If the elements on the right hand side cannot be
+Note: These operators are all typed based on the type of the __left hand side__ from the Bullet record. If the elements on the right hand side cannot be
 casted to the types on the LHS, those items will be ignored for the comparison.
 
 The format for a Relational filter is:
@@ -263,6 +280,96 @@ The following attributes are supported for ```TOP K```:
 
 Note that the ```K``` in ```TOP K``` is specified using the ```size``` field in the ```aggregation``` object.
 
+## Window
+
+The "window" field is **optional** and allows you to instruct Bullet to return incremental results. For example you might want to return the COUNT of a field and return that count every 2 seconds. 
+
+If "window" is ommitted Bullet will emit only a single result at the very end of the query.
+
+An example window might look like this:
+
+```javascript
+"window": { "emit": { "type": "TIME/RECORD", "every": 5000 },
+            "include": { "type": "TIME/RECORD/ALL", "first": 5000 } },
+```
+
+* The __emit__ field is used to specify when a window should be emmitted and the current results sent back to the user
+    * The __type__ subfield for "emit" can have two values:
+        * __"TIME"__ specifies that the window will emit after a specific number of milliseconds 
+        * __"RECORD"__ specifies that the window will emit after consuming a specific number of records
+    * The __every__ subfield for "emit" specifies how many records/milliseconds (depending on "type") will be counted before the window is emmitted
+* The __include__ field is used to specify what will be included in the emmitted window
+    * The __type__ subfield for "include" can have three values:
+        * __"TIME"__ specifies that the window will include all records seen in a certain time period in the window
+            * e.g. All records seen in the first 2 seconds of a 10 second window
+        * __"RECORD"__ specifies that the window will include the first n records, where n is specified in the "first" field below
+        * __"ALL"__ specifies that the window will include ALL results accumulated since the very beginning of the __query__ (not just this window)
+    * the __first__ subfield for "include" specifies the number of records/milliseconds at the beginning of this window to include in the emmitted result - it should be ommitted if "type" is "ALL".
+
+**NOTE: Not all windowing types are supported at this time.**
+
+### **Currently Bullet supports the following window types**:
+
+* Time-Based Tumbling Windows
+* Additive Tumbling Windows
+* Reactive Record-Based Windows
+* No Window
+
+Support for more windows will be added in the future.
+
+Each currently supported window type will be described below:
+
+#### **Time-Based Tumbling Windows**
+
+Currently time-based tumbling windows **must** have emit == include. In other words, only the entire window can be emitted, and windows must be adjacent.
+
+![Time-Based Tumbling Windows](../img/time-based-tumbling.png)
+
+The above example windowing would be specified with the window:
+
+```javascript
+"window": { "emit": { "type": "TIME", "every": 3000 },
+            "include": { "type": "TIME", "first": 3000 } },
+```
+
+Any aggregation can be done in each window, or the raw records themselves can be returned as specified in the "aggregation" object.
+
+In this example the first window would include 3 records, the second would include 4 records, the third would include 3 records and the fourth would include 2 records.
+
+#### **Additive Tumbling Windows**
+
+Additive tumbling windows emit with the same logic as time-based tumbling windows, but include ALL results from the beginning of the query:
+
+![Additive Tumbling Windows](../img/additive-tumbling.png)
+
+The above example would be specified with the window:
+
+```javascript
+"window": { "emit": { "type": "TIME", "every": 3000 },
+            "include": { "type": "ALL" } },
+```
+
+In this example the first window would include 3 records, the second would include 7 records, the third would include 10 records and the fourth would include 12 records.
+
+#### **Sliding "Reactive" Windows**
+
+Sliding windows emit based on the arrival of an event, rather than after a certain period of time. In general sliding windows often do some aggregation on the previous X records, or on all records that arrived in the last X seconds.
+Bullet will support this functionality in the future, at this time Bullet only supports **Sliding Windows of size 1**, often referred to as "reactive" windows. It does not support sliding windows with an aggregation at this time.
+Effectively this query will simply return every event that matches the filters instantly to the user.
+
+![Reactive Windows](../img/reactive.png)
+
+The above example would be specified with the window:
+
+```javascript
+"window": { "emit": { "type": "RECORD", "every": 1 },
+            "include": { "type": "RECORD", "last": 1 } },
+```
+
+#### **No Window**
+
+If the "window" field is optional. If it is  ommitted, the query will only emit when the entire query is finished.
+
 ## Results
 
 Bullet results are JSON objects with two fields: