This repo houses the example code for a blog post on using a persistent history server to view job history about your Spark / MapReduce jobs and aggregated YARN logs from short-lived clusters on GCS.
cluster_templates/
history_server.yaml
ephemeral_cluster.yaml
init_actions
disable_history_servers.sh
workflow_templates
spark_mr_workflow_template.yaml
terraform
variables.tf
network.tf
history-server.tf
history-bucket.tf
long-running-cluster.tf
firewall.tf
service-account.tf
The recommended way to run this example is to use terraform as it creates a vpc network to run the example with the appropriate firewall rules.
gcloud services enable \
compute.googleapis.com \
dataproc.googleapis.com
- Install Google Cloud SDK
- Enable the following APIs if not already enabled.
gcloud services enable compute.googleapis.com dataproc.googleapis.com
- [Optional] Install Terraform
This is for example purposes only. You should take a much closer look at the firewall rules that make sense for your organization's security requirements.
This repo provides artifacts to spin up the infrastructure for persisting job history and yarn logs with terraform or with gcloud. The recommended way to use this is to modify the Terraform code to fit your needs for long running resources.
However, the cluster templates are included as an example of standardiznig cluster creation for ephemeral clusters. You might ask, "Why is there a cluster template for the history server?". The history server is simply a cleaner interface for reading your logs from GCS. For Spark, it is stateless and you may wish to only spin up a history server when you'll actually be using it. For MapReduce, the history server will only be aware of the files on GCS when it was created and those files which it moves from intermediate done directory to the done directory. For this reason, MapReduce workflows should use a persistent history server.
Often times, it makes sense to leave the history server running because several teams may use it and you could configure it to manage clean up of your logs by setting the following additional properties:
yarn:yarn.log-aggregation.retain-seconds
spark:spark.history.fs.cleaner.enabled
spark:spark.history.fs.cleaner.maxAge
mapred:mapreduce.jobhistory.cleaner.enable
mapred:mapreduce.jobhistory.cleaner.max-age-ms
To spin up the whole example you could simply edit the
terraform.tfvars
file to set the variables to the
desired values and run the following commands.
Note, this assumes that you have an existing project and the sufficient permissions to spin up the resources for this example.
cd terraform
terraform init
terraform apply
This will create:
- VPC Network and subnetwork for your dataproc clusters.
- Various firewall rules for this network.
- A single node dataproc history-server cluster.
- A long running dataproc cluster.
- A GCS Bucket for YARN log aggregation, and Spark MapReduce Job History as well as initialization actions for your clusters.
These instructions detail how to run this entire example with gcloud
.
- Replace
PROJECT
with your GCP project id in each file. - Replace
HISTORY_BUCKET
with your GCS bucket for logs in each file. - Replace
HISTORY_SERVER
with your dataproc history server. - Replace
REGION
with your desired GCP Compute region. - Replace
ZONE
with your desired GCP Compute zone.
cd workflow_templates
sed -i 's/PROJECT/your-gcp-project-id/g' *
sed -i 's/HISTORY_BUCKET/your-history-bucket/g' *
sed -i 's/HISTORY_SERVER/your-history-server/g' *
sed -i 's/REGION/us-central1/g' *
sed -i 's/ZONE/us-central1-f/g' *
sed -i 's/SUBNET/your-subnet-id/g' *
cd cluster_templates
sed -i 's/PROJECT/your-gcp-project-id/g' *
sed -i 's/HISTORY_BUCKET/your-history-bucket/g' *
sed -i 's/HISTORY_SERVER/your-history-server/g' *
sed -i 's/REGION/us-central1/g' *
sed -i 's/ZONE/us-central1-f/g' *
sed -i 's/SUBNET/your-subnet-id/g' *
Stage an empty file to create the spark-events path on GCS.
touch .keep
gsutil cp .keep gs://your-history-bucket/spark-events/.keep
rm .keep
Stage our initialization action for disabling history servers on your ephemeral clusters.
gsutil cp init_actions/disable_history_servers.sh
Create the history server.
gcloud beta dataproc clusters import \
history-server \
--source=cluster_templates/history-server.yaml \
--region=us-central1
Create a cluster which you can manually submit jobs to and tear down.
gcloud beta dataproc clusters import \
ephemeral-cluster \
--source=cluster_templates/ephemeral-cluster.yaml \
--region=us-central1
Import the workflow template to run an example spark and hadoop job to verify your setup is working.
gcloud dataproc workflow-templates import spark-mr-example \
--source=workflow_templates/spark_mr_workflow_template.yaml
Trigger the workflow template to spin up a cluster, run the example jobs and tear it down.
gcloud dataproc workflow-templates instantiate spark-mr-example
Follow these instructions to look at the UI by ssh tunneling to the history server. Ports to visit:
- MapReduce Job History: 19888
- Spark Job History: 18080
If you're adapting this example for your own use consider the following:
- Setting an appropriate retention for your logs.
- Setting more appropriate firewall rules for your security requirements.