Spark job that reads OpenTSDB data from an HBase snapshot and generates rollup data points.
This is the accompanying repository to the Skyscanner Engineering blog post on the same topic.
We're running the Job on AWS using a data pipelinewhich creates a Spark and an HBase cluster for us.
The input for the job is read from HBase snapshots that have been uploaded to S3.
For a detailed description of the infrastructure, see the blog post.
The job takes snapshots of the OpenTSDB tables for raw data points and UIDs as input (as defined in tsd.storage.hbase.data_table
and tsd.storage.hbase.uid_table
of the OpenTSDB configuration, respectively).
The names used in this script are the defaults of tsdb
and tsdb-uid
.
The following assumptions are made:
- Snapshots follow the naming convention
<table_name>-YYYY-MM-DD
. - Snapshots from the live cluster are experted using HBase's
ExportSnapshot
tool and exported tos3a://${BackupBucket}/#{BackupWeekNumber}/${HBaseClusterColour}/<snapshot_name>/
- The JAR that is built as part of this repo is published to an S3 bucket, to the path
s3://${JobBucket}/rollups/${BuildId}/opentsdb-rollup-all.jar
This describes the input parameters to the CloudFormation script in cloudformation/cloudformation.yaml
that is used to create the data pipeline described above.
An identifier that is generated by a build pipeline that builds and publishes the JAR file.
It's used to generate the path in an S3 bucket (s3://${JobBucket}/rollups/${BuildId}
) which is used for resources like the JAR itself, config files and other scripts.
The number of the calendar week that the input snapshots were taken.
Used to generate the input path (s3a://${BackupBucket}/#{BackupWeekNumber}/${HBaseClusterColour}/<snapshot_name>
).
The date of the snapshot in the format YYYY-MM-DD
.
Name of the Bucket that the snapshots are uploaded to. Note the assumptions listed above for details on the exact path where the snapshots are expected.
Timestamp of the first data point we want to include in this run of the job.
Filters out every point before the given timestamp.
Unit: milliseconds. Must be less than AfterTimestamp
.
Timestamp of the first data point we want to exclude in this run of the job.
Filters out every point after the given timestamp.
Unit: milliseconds. Must be greater than BeforeTimestamp
.
BeforeTimestamp
and AfterTimestamp
define the timestamps of the data points that are to be rolled up.
Colour of the cluster. Useful when running an active/standby cluster setup. Used to construct the full path to the backup (see assumptions).
When to terminate the created EMR cluster at the very latest. Directly passed through to the EMR cluster config.
The SSH key pair to use for connecting to the EC2 instances that form the EMR cluster. Optional.
Name of an S3 bucket for supporting files.
Boolean value indicating whether or not to send an alert if the job fails.
VictorOps hook for the CloudWatch integration. Used to route the alerts on failure.
The project uses Java 8, but should be compatible with newer Java versions.
To build the job's fat jar, run
./gradlew build jar
The output can be found in build/libs
.
Tests can be run with
./gradlew test
Apart from the main RollupJob
code, there is a subproject serializer
The serializer project is needed to create a shaded JAR for serialising rollup schemas. Our schemas use proto3 while HBase still uses proto2. No additional steps are needed to update this code as the shaded JAR is automatically included in the main rollup job build.