- what is dart
- prerequisites
- bootstrapping
- local development setup
- usage
- dart development
- engine development
- REST api documentation
Dart, short for "data mart à la carte", is a self-service data workflow solution that facilitates the creation, transformation, and movement of data using data engines, primarily in the context of AWS. Put another way, it is a system to facilitate moving data in and out of datastores like S3, RDS, Redshift, EMR, DynamoDB, something custom, etc.
The dart UI is built on top of the dart REST api, is data-driven, and is a combination of AngularJS, Angular Material, and Angular Schema Form. The UI assists with self-service, while the REST api allows dart to be plugged into engineering systems.
Below is a list of dart entities to become familiar with.
An engine advertizes its supported actions using JSON schema, and interacts with dart over the REST api for
- engine registration
- action check-out
- action check-in
- and more...
Engines are able to understand and interact with datasets.
A dataset is a platform agnostic definition of data living in S3, which includes:
- the S3 root location
- the fields and types (or JSON paths for lines of JSON data)
- the load type (e.g. are new files merged, inserted, reloaded, etc.?)
- compression type
- etc.
A datastore is intended to capture attributes of a particular datastore technology instance (e.g. a Redshift cluster, an EMR cluster, etc.) including its host, port, connection url, etc. Engine's can describe input options for their datastore types via JSON schema.
An action entity is an instance of an action type specified by an engine. These contain user supplied arguments that adhere to the engine's JSON schema for that action type. In addition, the action contains state information, timing information, and optional alert email addresses.
A workflow is a collection of actions in the TEMPLATE state to be run on the specified datastore. When a workflow runs, its template actions are copied to a workflow instance and run. If the specified datastore for the workflow is in the TEMPLATE state, then a new datastore will be created based on it for the workflow instance.
An instance of a running workflow with state and timing information.
A subscription is a stateful listing of dataset files on S3. Unprocessed subscription elements are considered "unconsumed". When an engine provides an action with an action type name of "consume_subscription", dart will assign unconsumed subscription elements to that action at run time. Engines can request these subscription elements through a REST api call.
A trigger initiates a workflow. There are several types of triggers available:
- scheduled - triggers a workflow based on a CRON schedule
- workflow_completion - triggering upon successful workflow completion
- subscription_batch - creates subscription element batches based on a file size threshold, and triggers workflows as they are ready
- super - takes other triggers as inputs and fires when ALL have fired or ANY have fired
- event - triggers a workflow based on an "event" entity (essentially an SQS message sent from a source outside of dart)
Additionally, workflows can be triggered manually via the REST api or the UI.
An event is just a named identifier so that SQS messages in the dart trigger call format can be crafted outside of dart (with a particular event id) in order to trigger dart workflows.
- obtain an AWS account
- configure the AWS CLI locally
- create a VPC
- create an EC2 key-pair for dart
- create a Route53 hosted zone
- install python
- install pip
- install libxmlsec1
- install libmagic
- install git
- install npm
- install grunt
- install bower
- install docker
- install docker-machine
- run the bash snippets below
feel free to create and use an isolated python virtualenv
pip install -r src/python/requirements.txt
pip install -e src/python/.
cd tools/docker
./docker-local-setup.sh
cd src/python/dart/web/ui
bower install
Following the instructions below will result in all necessary docker images built and pushed to ECR, an isolated dart environment, as well as a generated config file saved in S3. Dart is primarily driven by a single YAML config file, in combination with a few environment variables.
Copy and rename dart-config-template-EXAMPLE.yaml
outside of the dart project, and modify it to suit
your needs. For the remainder of this example, let's assume this file lives in a sibling directory next
to the dart project: ../my-dart-config-project/dart-config-template.yaml
tips:
- Fields marked as
...TBD...
will be populated during deployment - Fields with
my
ormycompany
should be changed (for urls, your key-pair name, etc.) - If your VPC requires VPN access, change one or both of the
OpenCidrIp1
andOpenCidrIp2
values accordingly. - Most other fields should be self-explanatory, and comments have been added otherwise
Run the following bash snippets to create a "stage" environment:
(AWS_SECRET_ACCESS_KEY
and AWS_ACCESS_KEY_ID
are assumed to be in the current shell environment)
export PYTHONPATH=/full-path-to-your-dart-project-root/src/python:${PYTHONPATH}
read -s -p "Your email server password (used by dart to send alerts):" DART_EMAIL_PASSWORD
read -s -p "Your chosen RDS password:" DART_RDS_PASSWORD
export DART_EMAIL_PASSWORD
export DART_RDS_PASSWORD
python src/python/dart/deploy/full_environment_create.py \
-n stage \
-i ../my-dart-config-project/dart-config-template.yaml \
-o s3://mycompany-dart-stage-config/dart-stage.yaml \
-u [email protected]
When the command above executes, it will use a combination of AWS CloudFormation and boto to provision a new dart
environment, and write the generated config file to the specified location,
s3://mycompany-dart-stage-config/dart-stage.yaml
. See the source of full_environment_create.py
for more
insight.
Dart is primarily driven by a single YAML config file, in combination with a few environment variables. For local development, we will use a config file similar to the one in the bootstrapping step, with a few differences:
- Instead of RDS, it is preferable to use a dockerized postgres instance
- Instead of SQS, it is preferable to use a dockerized elasticmq instance
- Instead of running engine tasks in ECS, it is preferable to have the engine worker fork them locally
Make sure your config file has the "local_setup" section set properly, since this will be used next. Additionally,
it is assumed you have already built the elasticmq docker image and updated your local dart config either buy
running the bootstrapping step, or by manually running this snippet and updating your local
dart config (local_setup.elasticmq_docker_image
) with the matching value from the snippet below:
cd tools/docker
docker build -f tools/docker/Dockerfile-elasticmq -t <my-elasticmq-docker-image-name> .
Run the postgres instance:
cd tools/docker
export DART_CONFIG=../my-dart-config-project/dart-config-local.yaml
./docker-local-run-postgres.sh
Run the elasticmq instance:
cd tools/docker
export DART_CONFIG=../my-dart-config-project/dart-config-local.yaml
./docker-local-run-elasticmq.sh
Run the dart components:
(AWS_SECRET_ACCESS_KEY
and AWS_ACCESS_KEY_ID
are assumed to be in the current shell environment)
cd src/python/dart/web
export PYTHONPATH=/full-path-to-your-dart-project-root/src/python:${PYTHONPATH}
export DART_ROLE=web
export DART_CONFIG=../my-dart-config-project/dart-config-local.yaml
python server.py
cd src/python/dart/worker
export PYTHONPATH=/full-path-to-your-dart-project-root/src/python:${PYTHONPATH}
export DART_ROLE=web
export DART_CONFIG=../my-dart-config-project/dart-config-local.yaml
python trigger.py
cd src/python/dart/worker
export PYTHONPATH=/full-path-to-your-dart-project-root/src/python:${PYTHONPATH}
export DART_ROLE=worker
export DART_CONFIG=../my-dart-config-project/dart-config-local.yaml
python engine.py
cd src/python/dart/worker
export PYTHONPATH=/full-path-to-your-dart-project-root/src/python:${PYTHONPATH}
export DART_ROLE=web
export DART_CONFIG=../my-dart-config-project/dart-config-local.yaml
python subscription.py
If running locally with the default flask configuration, navigate to http://localhost:5000. If you have deployed to AWS, navigate to http://your-dart-host-and-port.
For a demonstration of dart's capabilities, follow these steps:
- navigate to the "datastores" link in the entities section of the navbar (not the managers section)
- create a new "no_op_engine" datastore with default values by clicking on the "+" button and then clicking "save"
- refresh the page
- click on the "more options" vertical elipses on the left side of the row
- click on "graph", wait for the graph to load
- click on "add > no_op_engine > workflow chaining demo"
- click on "file > save"
- click on the top most workflow (blue, diamond shaped) icon
- click on "edit > run"
In short, familiarity with dart entities allows a variety of workflow chains to be created.
For example, an EMR workflow can be created to parse web logs and store them in a delimited format on S3, which is referred to by an existing dataset. Completion of this workflow triggers a Redshift workflow to load the data, perform aggregations, and upload to another dataset location. This workflow completion triggers an S3 workflow to load that copies the data to the front-end web team's AWS bucket. Completion of this workflow triggers a DynamoDB workflow to load the aggregated data into a DynamoDB table, supporting a front-end feature.
Dart uses postgres 9.4+ for its JSONB columns. This allows dart models to be fluid but consistent. That is, most
dart models have the fields: id
, version_id
, created
, updated
, and data
(the JSONB column). Dart has
custom deserialization that relies on parsing the python docstring for types (recursively). This way, the data can
be stored in plan JSON format, rather than relying on modifiers as in jsonpickle. So keep the docstrings up
to date and accurate!
Be sure to keep the black-box integration tests up to date and passing.
Engines are completely decoupled from dart by design. The dart engine source code lives in the same repository out of convenience, but this is by no means a requirement. Engines can be added to dart source itself for general use cases, but they can also be developed outside of dart for custom use cases, so long as they adhere to these conventions:
- engines must make a registration REST api call that informs dart about the ECS task definitions
- engines are instantiated as ECS tasks to process one action at a time
- dart will override the ECS task definition at run time to supply one environment variable that informs the running
engine which action is being run:
DART_ACTION_ID
- a REST api call must be made to check-out this action
- a REST api call must be made to check-in this action upon completion
- if writing a
consume_subscription
action, a REST api call should be made to retrieve the assigned subscription elements
See the draft API specification in Swagger 2.0 format at src/python/dart/web/api/swagger.yaml
or via Swagger-UI
at http://your-dart-host-and-port/apidocs/index.html
See the source code at src/python/dart/web/api
for implementation details.