Skip to content

Provision Spark on Amazon EC2

Darren L. Weber, Ph.D edited this page Sep 12, 2017 · 23 revisions

An LD4P Spark cluster can be created using:

git clone [email protected]:sul-dlss/spark-ec2.git
cd spark-ec2

TAGS="Group:ld4p_dev_spark,Manager:${USER},Service:spark,Stage:dev"

# NOTE:  the "vpc-id" must be copied from the EC2 console dashboard (top right corner)
# NOTE:  the "ami-6e1a0117" is an Ubuntu 16.04 system

./spark-ec2 --key-pair=ld4p --identity-file="${HOME}/.ssh/ld4p.pem" \
  --vpc-id=vpc-d84467b3 \
  --ami=ami-6e1a0117 --region=us-west-2 --zone=all \
  --master-instance-type=c4.xlarge \
  --instance-type=c4.2xlarge --slaves 3 \
  --no-ganglia --additional-tags="${TAGS}" --tag-volumes \
  --additional-security-group=ld4p_dev_ssh_security_group \
  launch ld4p-pipe

It should complete with something like:

Connection to ec2-52-41-0-58.us-west-2.compute.amazonaws.com closed.
Spark standalone cluster started at http://ec2-52-41-0-58.us-west-2.compute.amazonaws.com:8080
Ganglia started at http://ec2-52-41-0-58.us-west-2.compute.amazonaws.com:5080/ganglia
Done!

Login

./spark-ec2 -k ld4p -i ld4p.pem -r us-west-2 login ld4p-pipe

Additional Notes

The work ticket on this is at

For some detailed instructions on doing it by-hand:

Also looking for puppet recipes and deployment management, e.g.

One of the guys from code4lib-norcal recently recommended using terraform, e.g.