-
Notifications
You must be signed in to change notification settings - Fork 4
Provision Spark on Amazon EC2
Darren L. Weber, Ph.D edited this page Sep 12, 2017
·
23 revisions
An LD4P Spark cluster can be created using:
git clone [email protected]:sul-dlss/spark-ec2.git
cd spark-ec2
TAGS="Group:ld4p_dev_spark,Manager:${USER},Service:spark,Stage:dev"
# NOTE: the "vpc-id" must be copied from the EC2 console dashboard (top right corner)
# NOTE: there is an --ami option, but leave it alone so the scripts are compatible with the default ami
# - it did not work with an ubuntu 16.04 image
# - it requires a redhat package management system (yum/rpm)
./spark-ec2 --key-pair=ld4p --identity-file="${HOME}/.ssh/ld4p.pem" \
--vpc-id=vpc-d84467b3 \
--region=us-west-2 --zone=all \
--master-instance-type=c4.xlarge \
--instance-type=c4.2xlarge --slaves 3 \
--no-ganglia --additional-tags="${TAGS}" --tag-volumes \
--additional-security-group=ld4p_dev_ssh_security_group \
launch ld4p_dev_spark
It should respond with something like:
Setting up security groups...
Creating security group ld4p_dev_spark-master
Creating security group ld4p_dev_spark-slaves
Searching for existing cluster ld4p_dev_spark in region us-west-2...
Launching instances...
Launched 1 slave in us-west-2a, regid = r-0ae9a3c77a4699eb7
Launched 1 slave in us-west-2b, regid = r-0606edc02e95179c0
Launched 1 slave in us-west-2c, regid = r-0473d248b899620e6
Launched master in us-west-2c, regid = r-03f617264d2f7ce11
Waiting for AWS to propagate instance metadata...
Applying tags to master nodes
Applying tags to slave nodes
Applying tags to volumes
# ignore some temporary ssh connection failures
# lots of provisioning details...
Done!
./spark-ec2 --key-pair=ld4p --identity-file="${HOME}/.ssh/ld4p.pem" \
--region=us-west-2 login ld4p_dev_spark
Once logged in, the /root/spark
path is where spark is installed. To submit a job, e.g.
MASTER_PRIV_IP="spark://ip-172-31-42-178.us-west-2.compute.internal:7077"
/root/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master ${MASTER_PRIV_IP} \
/root/spark/examples/jars/spark-examples_2.11-2.2.0.jar 100
The work ticket on this is at
For some detailed instructions on doing it by-hand:
- https://sparkour.urizone.net/recipes/installing-ec2/
- http://blog.insightdatalabs.com/spark-cluster-step-by-step/
- if we need hadoop, this might help (hadoop 2.6 on ubuntu 14.04):
Also looking for puppet recipes and deployment management, e.g.
- https://github.com/adobe-research/spark-cluster-deployment
- For running spark on mesos
One of the guys from code4lib-norcal recently recommended using terraform, e.g.