Skip to content

Latest commit

 

History

History
265 lines (217 loc) · 10.3 KB

tutorial.md

File metadata and controls

265 lines (217 loc) · 10.3 KB
description
An example GitOps recipe

⌨ Tutorial

This repository contains a PySpark example job (how did you guess it was word count?) that we are going to operationalize using Helm and databricks-kube-operator. You can follow along with a local minikube cluster, or use in an environment with ArgoCD or Fleet.

Create a Helm umbrella chart

Begin by creating a Helm umbrella chart. The Helm starter chart has unneeded example resources and values that we remove:

helm create example-job
rm example-job/templates/NOTES.txt
rm -rf example-job/templates/*.yaml
rm -rf example-job/templates/tests
echo > example-job/values.yaml

Your directory structure should look like this:

example-job
├── Chart.yaml
├── charts
├── templates
│   ├── NOTES.txt
│   └── _helpers.tpl
└── values.yaml

In Chart.yaml, add a dependency to the operator chart:

dependencies:
  - name: databricks-kube-operator
    repository: https://mach-kernel.github.io/databricks-kube-operator
    version: 0.5.0

Populating Databricks resources

We are now going to create our resources in the templates/ directory.

1. Operator configmap and Databricks access token

Begin by creating a Databricks service principal, and use the according API call to create an access token. If your new service principal is unable to issue a token, enable token permissions for it by following the instructions from this KB.

{% hint style="info" %} In a production environment, the Databricks API URL and access token can be sourced via External Secrets Operator in combination with (e.g. AWS Secrets Manager). {% endhint %}

Create a secret containing your Databricks API URL and a valid access token. The snippet below is for your convenience, to run against the cluster for this example. Do not create a template and check in your token.

{% code lineNumbers="true" %}

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: databricks-api-secret
data:
  access_token: $(echo -n 'shhhh' | base64)
  databricks_url: $(echo -n 'https://my-tenant.cloud.databricks.com/api' | base64)
EOF

{% endcode %}

Create the file below. The operator configmap expects a secret name from which to pull its REST configuration.

{% code title="templates/databricks-kube-operator-configmap.yaml" %}

apiVersion: v1
kind: ConfigMap
metadata:
  name: databricks-kube-operator
  namespace: {{ .Release.Namespace }}
data:
  api_secret_name: databricks-api-secret

{% endcode %}

2. Git Credentials

Public repositories do not require Git credentials. The tutorial deploys the job from this public repository. You can skip this step, unless you are following along with your own job and a private repo.

Here is another "quick snippet" for making the required secret if deploying your own job from a private repo. As previously mentioned, do not check this in as a template.

{% code lineNumbers="true" %}

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: my-git-credential-secret
data:
  personal_access_token: $(echo -n 'shhhh' | base64)
EOF

{% endcode %}

Create the file below. According to the API documentation, the following VCS providers are available:

The available Git providers are awsCodeCommit, azureDevOpsServices, bitbucketCloud, bitbucketServer, gitHub, gitHubEnterprise, gitLab, and gitLabEnterpriseEdition.

apiVersion: com.dstancu.databricks/v1
kind: GitCredential
metadata:
  annotations:
    databricks-operator/owner: operator
  name: example-credential
  namespace: {{ .Release.Namespace }}
spec:
  secret_name: my-git-credential-secret
  credential:
    git_username: my-user-name
    git_provider: gitHub

3. DatabricksJob

Create the file below to create a job. There are two possible strategies for running jobs via Git sources. For more possible configuration, see the API SDK docs.

{% hint style="info" %} Does your job use an instance profile? You will have to give your new service principal access to the instance profile or your job will fail to launch. {% endhint %}

Using the Git provider

If your credentials are configured, Databricks job definitions now support directly referencing a Git source. Whenever the job is triggered, it will use the latest version from source control without needing to poll the repo for updates.

{% code title="template/my-word-count.yaml" lineNumbers="true" %}

apiVersion: com.dstancu.databricks/v1
kind: DatabricksJob
metadata:
  name: my-word-count
  namespace: {{ .Release.Namespace }}
spec:
  job:
    settings:
      email_notifications:
        no_alert_for_skipped_runs: false
      format: MULTI_TASK
      job_clusters:
      - job_cluster_key: word-count-cluster
        new_cluster:
          aws_attributes:
            availability: SPOT_WITH_FALLBACK
            ebs_volume_count: 1
            ebs_volume_size: 32
            ebs_volume_type: GENERAL_PURPOSE_SSD
            first_on_demand: 1
            spot_bid_price_percent: 100
            zone_id: us-east-1e
          custom_tags:
            ResourceClass: SingleNode
          driver_node_type_id: m4.large
          enable_elastic_disk: false
          node_type_id: m4.large
          num_workers: 0
          spark_conf:
            spark.databricks.cluster.profile: singleNode
            spark.master: local[*, 4]
          spark_env_vars:
            PYSPARK_PYTHON: /databricks/python3/bin/python3
          spark_version: 10.4.x-scala2.12
      max_concurrent_runs: 1
      name: my-word-count
      git_source:
        git_branch: master
        git_provider: gitHub
        git_url: https://github.com/mach-kernel/databricks-kube-operator
      tasks:
      - email_notifications: {}
        job_cluster_key: word-count-cluster
        notebook_task:
          # NOTE: Do not provide the file extension, your job will fail
          notebook_path: examples/job
          source: GIT
        task_key: my-word-count
        timeout_seconds: 0
      timeout_seconds: 0

{% endcode %}

Using the repos / workspace integration

Follow the optional Git Repo instructions before proceeding.

This is for use with the Repo API, which clones a repository to your workspace. Tasks are then launched from WORKSPACE paths. You can reuse the CRD from above removing git_source and changing the task definition to match the example below:

{% code title="templates/my-word-count-job.yaml" lineNumbers="true" %}

apiVersion: com.dstancu.databricks/v1
kind: DatabricksJob
metadata:
  name: my-word-count
  namespace: {{ .Release.Namespace }}
spec:
  job:
    settings:
      tasks:
      - email_notifications: {}
        job_cluster_key: word-count-cluster
        notebook_task:
          notebook_path: /Repos/Test/databricks-kube-operator/examples/job
          source: WORKSPACE
        task_key: my-word-count
        timeout_seconds: 0

{% endcode %}

4. All together now

Awesome! We have templates for our shiny new job. Let's make sure the chart works as expected. Inspect the resulting templates for errors:

helm template example-job

If everything looks good, it's time to install. Unfortunately this requires discussion of the dreaded "install CRDs first" problem. Here are suggestions for different readers:

  • Local/minikube: Comment out the dependency key and continue with installation
  • ArgoCD: Use sync waves
  • Fleet/others: Use one chart for your operator deployment, and another for the Databricks resources. On first deploy, the operator chart will sync successfully and example-job will do so on retry.
helm install word-count example-job

If successful, you should see the following Helm deployments, as well as your job in Databricks:

NAME                            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
databricks-kube-operator        default         1               2022-11-06 09:54:53.057226 -0500 EST    deployed        databricks-kube-operator-0.5.0  1.16.0
word-count                      default         1               2022-11-06 10:11:42.774865 -0500 EST    deployed        example-job-0.1.0               1.16.0

Bump the chart version for your Databricks definitions as they change, and let databricks-kube-operator reconcile them when they are merged to your main branch.

Optional: Git Repo

We recommend using the Git source for your job definitions, as databricks-kube-operator does not poll Databricks to update the workspace repository clone. PRs are accepted.

Create the file below to create a repo. Ensure that the /Test directory exists within Repos (docs) on your Databricks instance, or else the create request will 400:

{% code title="templates/repo.yaml" lineNumbers="true" %}

apiVersion: com.dstancu.databricks/v1
kind: Repo
metadata:
  annotations:
    databricks-operator/owner: operator
  name: databricks-kube-operator
  namespace: {{ .Release.Namespace }}
spec:
  repository:
    path: /Repos/Test/databricks-kube-operator
    provider: gitHub
    url: https://github.com/mach-kernel/databricks-kube-operator.git

{% endcode %}