spark-nanny

Who watches the watchers?

spark-nanny is a simple app designed to monitor the health of spark apps installed using spark-operator, and restart the driver pod in case it's unresponsive.

The main motivation comes from two main caveats in the way spark (and spark-operator) work when run on kubernetes:

Spark doesn't have any configuration to provide health and readiness checks for pods
The spark-operator doesn't provide any mechanism that supports this (e.g. by form of mutating webhook), see here

While it is possible to use pod templates to launch spark apps (as of spark 3.0.0) and define probes there, this isn't supported by the spark-operator (yet, maybe in the future, see this)

Installation

The simplest way to install spark-nanny is to use the provided chart:

Requires helm3

# First, add the spark-nanny repo
$ helm repo add bringg-spark-nanny https://bringg.github.io/spark-nanny

$ helm install spark-nanny --namespace spark bringg-spark-nanny/spark-nanny --set sparkApps="spark-app1\,spark-app2"

Note that the comma must be escaped (with \)

See the chart's readme for more details and possible configuration options

Configuration

spark-nanny is configured using command line flags passed to the executable, the following flags are supported:

key	default	description
`apps`	""	comma separated list of `spark` app names to watch, e.g. `spark-app1,spark-app2` (required)
`interval`	`30`	time in seconds between checks
`timeout`	`10`	timeout in seconds to wait for a response from the driver pod
`namespace`	`spark`	`spark` apps namespace
`dry-run`	`false`	preforms all the checks and logic, but won't actually delete the pod
`debug`	`false`	set to `true` to enable more verbose logging
`listen-address`	`:9164`	address to listen on for health checks and metrics

How it Works

spark-nanny does the following for each spark app passed via the --apps flag:

Get the pod ip from the kubernetes api server
Make sure the pod isn't in terminating phase, all containers are in running state and have been running for at least 60 seconds
Issue a GET request on the driver application endpoint http://<pod-ip>:4040/api/v1/applications
If the request times out, the connection is refused or a non 200 status code is returned, retry 2 more times
If after 3 retries the driver pod still doesn't return a 200 status code, delete the pod
Rinse and repeat for every interval period defined

Once the driver pod is deleted, any executors owned by it will also be deleted and the spark app will be rescheduled be the operator

Metrics

Starting from v0.2.0, spark-nanny exposes some metrics about it's operations:

spark_nanny_poke_duration_seconds - histogram metric with the poke operation duration for each watched spark app
spark_nanny_kill_count - counter metric with the number of times a spark app was killed by spark_nanny

Metrics are available on the listen-address (default :9164) at the /metrics path

Development

Getting Started

Requires go 1.16+

Clone the repo and run make install-tools, this will download the project dependencies and required tools.

Use make build to build locally and test. Note that because spark-nanny issues http requests to the driver pod, the pod needs to be accessible from spark-nanny

Releasing a New Version

To release a new version of spark-nanny do the following:

Create and push a new version tag (e.g. 0.0.8), docker hub auto build will build a new version of the with the same tag
Increment the appVersion and version in charts/spark-nanny/chart.yaml
Merge the changes to main branch

In case values have changed, run make helm-docs to regenerate the chart's README file

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
charts/spark-nanny		charts/spark-nanny
tools		tools
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
kube.go		kube.go
main.go		main.go
metrics.go		metrics.go
nanny.go		nanny.go
server.go		server.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-nanny

Installation

Configuration

How it Works

Metrics

Development

Getting Started

Releasing a New Version

About

Releases 8

Packages

Contributors 2

Languages

License

bringg/spark-nanny

Folders and files

Latest commit

History

Repository files navigation

spark-nanny

Installation

Configuration

How it Works

Metrics

Development

Getting Started

Releasing a New Version

About

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 2

Languages

Packages