OpenTSDB Rollups Spark job

Spark job that reads OpenTSDB data from an HBase snapshot and generates rollup data points.
This is the accompanying repository to the Skyscanner Engineering blog post on the same topic.

Deployment

We're running the Job on AWS using a data pipelinewhich creates a Spark and an HBase cluster for us. The input for the job is read from HBase snapshots that have been uploaded to S3. For a detailed description of the infrastructure, see the blog post. The job takes snapshots of the OpenTSDB tables for raw data points and UIDs as input (as defined in tsd.storage.hbase.data_table and tsd.storage.hbase.uid_table of the OpenTSDB configuration, respectively). The names used in this script are the defaults of tsdb and tsdb-uid.

The following assumptions are made:

Snapshots follow the naming convention <table_name>-YYYY-MM-DD.
Snapshots from the live cluster are experted using HBase's ExportSnapshot tool and exported to s3a://${BackupBucket}/#{BackupWeekNumber}/${HBaseClusterColour}/<snapshot_name>/
The JAR that is built as part of this repo is published to an S3 bucket, to the path s3://${JobBucket}/rollups/${BuildId}/opentsdb-rollup-all.jar

Input parameters of the CloudFormation script

This describes the input parameters to the CloudFormation script in cloudformation/cloudformation.yaml that is used to create the data pipeline described above.

`BuildId`

An identifier that is generated by a build pipeline that builds and publishes the JAR file. It's used to generate the path in an S3 bucket (s3://${JobBucket}/rollups/${BuildId}) which is used for resources like the JAR itself, config files and other scripts.

`BackupWeekNumber`

The number of the calendar week that the input snapshots were taken. Used to generate the input path (s3a://${BackupBucket}/#{BackupWeekNumber}/${HBaseClusterColour}/<snapshot_name>).

`SnapshotRestoreDate`

The date of the snapshot in the format YYYY-MM-DD.

`BackupBucket`

Name of the Bucket that the snapshots are uploaded to. Note the assumptions listed above for details on the exact path where the snapshots are expected.

`BeforeTimestamp`

Timestamp of the first data point we want to include in this run of the job. Filters out every point before the given timestamp. Unit: milliseconds. Must be less than AfterTimestamp.

`AfterTimestamp`

Timestamp of the first data point we want to exclude in this run of the job. Filters out every point after the given timestamp. Unit: milliseconds. Must be greater than BeforeTimestamp. BeforeTimestamp and AfterTimestamp define the timestamps of the data points that are to be rolled up.

`HBaseClusterColour`

Colour of the cluster. Useful when running an active/standby cluster setup. Used to construct the full path to the backup (see assumptions).

`TerminateAfter`

When to terminate the created EMR cluster at the very latest. Directly passed through to the EMR cluster config.

`SSHKeypair`

The SSH key pair to use for connecting to the EC2 instances that form the EMR cluster. Optional.

`JobBucket`

Name of an S3 bucket for supporting files.

`TriggersAlert`

Boolean value indicating whether or not to send an alert if the job fails.

`VictorOpsIntegrationHook`

VictorOps hook for the CloudWatch integration. Used to route the alerts on failure.

Development

The project uses Java 8, but should be compatible with newer Java versions.

Developing locally

To build the job's fat jar, run

./gradlew build jar

The output can be found in build/libs.

Tests can be run with

./gradlew test

Subprojects

Apart from the main RollupJob code, there is a subproject serializer

Serializer

The serializer project is needed to create a shaded JAR for serialising rollup schemas. Our schemas use proto3 while HBase still uses proto2. No additional steps are needed to update this code as the shaded JAR is automatically included in the main rollup job build.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cloudformation		cloudformation
gradle/wrapper		gradle/wrapper
opentsdb		opentsdb
serializer		serializer
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenTSDB Rollups Spark job

Deployment

Input parameters of the CloudFormation script

`BuildId`

`BackupWeekNumber`

`SnapshotRestoreDate`

`BackupBucket`

`BeforeTimestamp`

`AfterTimestamp`

`HBaseClusterColour`

`TerminateAfter`

`SSHKeypair`

`JobBucket`

`TriggersAlert`

`VictorOpsIntegrationHook`

Development

Developing locally

Subprojects

Serializer

About

Releases

Packages

Contributors 2

Languages

License

Skyscanner/OpenTSDB-rollup

Folders and files

Latest commit

History

Repository files navigation

OpenTSDB Rollups Spark job

Deployment

Input parameters of the CloudFormation script

BuildId

BackupWeekNumber

SnapshotRestoreDate

BackupBucket

BeforeTimestamp

AfterTimestamp

HBaseClusterColour

TerminateAfter

SSHKeypair

JobBucket

TriggersAlert

VictorOpsIntegrationHook

Development

Developing locally

Subprojects

Serializer

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`BuildId`

`BackupWeekNumber`

`SnapshotRestoreDate`

`BackupBucket`

`BeforeTimestamp`

`AfterTimestamp`

`HBaseClusterColour`

`TerminateAfter`

`SSHKeypair`

`JobBucket`

`TriggersAlert`

`VictorOpsIntegrationHook`

Packages