Skip to content

Commit

Permalink
Merge pull request #289 from nvliyuan/main-v2304
Browse files Browse the repository at this point in the history
merge branch-23.04 to main branch
  • Loading branch information
nvliyuan authored Apr 28, 2023
2 parents 0cc1d0c + 234fdb7 commit 3cff617
Show file tree
Hide file tree
Showing 40 changed files with 266 additions and 104 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/auto-merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
on:
pull_request_target:
branches:
- branch-23.02
- branch-23.04
types: [closed]

jobs:
Expand All @@ -29,14 +29,14 @@ jobs:
steps:
- uses: actions/checkout@v3
with:
ref: branch-23.02 # force to fetch from latest upstream instead of PR ref
ref: branch-23.04 # force to fetch from latest upstream instead of PR ref

- name: auto-merge job
uses: ./.github/workflows/auto-merge
env:
OWNER: NVIDIA
REPO_NAME: spark-rapids-examples
HEAD: branch-23.02
BASE: branch-23.04
HEAD: branch-23.04
BASE: branch-23.06
AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR

76 changes: 36 additions & 40 deletions docs/get-started/xgboost-examples/csp/databricks/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,26 @@ This is a getting started guide to XGBoost4J-Spark on Databricks. At the end of
Prerequisites
-------------

* Apache Spark 3.1+ running in Databricks Runtime 9.1 ML or 10.4 ML with GPU
* AWS: 9.1 LTS ML (GPU, Scala 2.12, Spark 3.1.2) or 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1)
* Azure: 9.1 LTS ML (GPU, Scala 2.12, Spark 3.1.2) or 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1)
* Apache Spark 3.x running in Databricks Runtime 10.4 ML or 11.3 ML with GPU
* AWS: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1) or 11.3 LTS ML (GPU, Scala 2.12, Spark 3.3.0)
* Azure: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1) or 11.3 LTS ML (GPU, Scala 2.12, Spark 3.3.0)

The number of GPUs per node dictates the number of Spark executors that can run in that node. Each executor should only be allowed to run 1 task at any given time.

Start A Databricks Cluster
--------------------------

Create a Databricks cluster by clicking "+ Create -> Cluster" on the left panel. Ensure the
Create a Databricks cluster by going to "Compute", then clicking `+ Create compute`. Ensure the
cluster meets the prerequisites above by configuring it as follows:
1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
Prerequisites section.
2. Under Autopilot Options, disable autoscaling.
3. Choose the number of workers you want to use.
4. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
2. Choose the number of workers that matches the number of GPUs you want to use.
3. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker
(although they can be used for the driver node). For Azure, choose GPU nodes such as
Standard_NC6s_v3.
5. Select the driver type. Generally this can be set to be the same as the worker.
6. Start the cluster.
Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs.
4. Select the driver type. Generally this can be set to be the same as the worker.
5. Start the cluster.

Advanced Cluster Configuration
--------------------------
Expand All @@ -38,20 +37,18 @@ cluster.
your workspace. See [Managing
Notebooks](https://docs.databricks.com/notebooks/notebooks-manage.html#id2) for instructions on
how to import a notebook.
Select the initialization script based on the Databricks runtime
Select the version of the RAPIDS Accelerator for Apache Spark based on the Databricks runtime
version:

- [Databricks 9.1 LTS
ML](https://docs.databricks.com/release-notes/runtime/9.1ml.html#system-environment) has CUDA 11
installed. Users will need to use 21.12.0 or later on Databricks 9.1 LTS ML. In this case use
[generate-init-script.ipynb](generate-init-script.ipynb) which will install
the RAPIDS Spark plugin.

- [Databricks 10.4 LTS
ML](https://docs.databricks.com/release-notes/runtime/9.1ml.html#system-environment) has CUDA 11
installed. Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML. In this case use
[generate-init-script-10.4.ipynb](generate-init-script-10.4.ipynb) which will install
the RAPIDS Spark plugin.
- [Databricks 10.4 LTS
ML](https://docs.databricks.com/release-notes/runtime/10.4ml.html#system-environment) has CUDA 11
installed. Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML.
- [Databricks 11.3 LTS
ML](https://docs.databricks.com/release-notes/runtime/11.3ml.html#system-environment) has CUDA 11
installed. Users will need to use 23.04.0 or later on Databricks 11.3 LTS ML.

In both cases use
[generate-init-script.ipynb](./generate-init-script.ipynb) which will install
the RAPIDS Spark plugin.

2. Once you are in the notebook, click the “Run All” button.
3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the
Expand All @@ -72,23 +69,17 @@ cluster.
The
[`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling)
configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an
executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set
executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set
this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just
like the CPU side. Having the value smaller is fine as well.

There is an incompatibility between the Databricks specific implementation of adaptive query
execution (AQE) and the spark-rapids plugin. In order to mitigate this,
`spark.sql.adaptive.enabled` should be set to false. In addition, the plugin does not work with
the Databricks `spark.databricks.delta.optimizeWrite` option.
like the CPU side. Having the value smaller is fine as well.
Note: Please remove the `spark.task.resource.gpu.amount` config for a single-node Databricks
cluster because Spark local mode does not support GPU scheduling.

```bash
spark.plugins com.nvidia.spark.SQLPlugin
spark.task.resource.gpu.amount 0.1
spark.rapids.memory.pinnedPool.size 2G
spark.locality.wait 0s
spark.databricks.delta.optimizeWrite.enabled false
spark.sql.adaptive.enabled false
spark.rapids.sql.concurrentGpuTasks 2
spark.plugins com.nvidia.spark.SQLPlugin
spark.task.resource.gpu.amount 0.1
spark.rapids.memory.pinnedPool.size 2G
spark.rapids.sql.concurrentGpuTasks 2
```

![Spark Config](../../../../img/databricks/sparkconfig.png)
Expand Down Expand Up @@ -186,6 +177,11 @@ Limitations
4. Databricks makes changes to the runtime without notification.
Databricks makes changes to existing runtimes, applying patches, without notification.
[Issue-3098](https://github.com/NVIDIA/spark-rapids/issues/3098) is one example of this. We run
regular integration tests on the Databricks environment to catch these issues and fix them once
detected.
[Issue-3098](https://github.com/NVIDIA/spark-rapids/issues/3098) is one example of this. We run
regular integration tests on the Databricks environment to catch these issues and fix them once
detected.
5. In Databricks 11.3, an incorrect result is returned for window frames defined by a range in case
of DecimalTypes with precision greater than 38. There is a bug filed in Apache Spark for it
[here](https://issues.apache.org/jira/browse/SPARK-41793), whereas when using the plugin the
correct result will be returned.
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"source": [
"%sh\n",
"cd ../../dbfs/FileStore/jars/\n",
"sudo wget -O rapids-4-spark_2.12-22.12.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.12.0/rapids-4-spark_2.12-22.12.0.jar\n",
"sudo wget -O rapids-4-spark_2.12-23.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.04.0/rapids-4-spark_2.12-23.04.0.jar\n",
"sudo wget -O xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar\n",
"sudo wget -O xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar\n",
"ls -ltr\n",
Expand Down Expand Up @@ -60,7 +60,7 @@
"sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\n",
"\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.7.1.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.12.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-23.04.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar /databricks/jars/\"\"\", True)"
]
},
Expand Down Expand Up @@ -133,7 +133,7 @@
"1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
"2. Reboot the cluster\n",
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.02/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"5. Inside the mortgage example notebook, update the data paths\n",
" `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
" `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download latest Jars"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"dbutils.fs.mkdirs(\"dbfs:/FileStore/jars/\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"%sh\n",
"cd ../../dbfs/FileStore/jars/\n",
"sudo wget -O rapids-4-spark_2.12-23.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.04.0/rapids-4-spark_2.12-23.04.0.jar\n",
"sudo wget -O xgboost4j-gpu_2.12-1.7.3.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.3/xgboost4j-gpu_2.12-1.7.3.jar\n",
"sudo wget -O xgboost4j-spark-gpu_2.12-1.7.3.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.3/xgboost4j-spark-gpu_2.12-1.7.3.jar\n",
"ls -ltr\n",
"\n",
"# Your Jars are downloaded in dbfs:/FileStore/jars directory"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a Directory for your init script"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"dbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n",
"#!/bin/bash\n",
"sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar\n",
"sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\n",
"\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.7.3.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-23.04.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.3.jar /databricks/jars/\"\"\", True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Confirm your init script is in the new directory"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"%sh\n",
"cd ../../dbfs/databricks/init_scripts\n",
"pwd\n",
"ls -ltr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download the Mortgage Dataset into your local machine and upload Data using import Data"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"dbutils.fs.mkdirs(\"dbfs:/FileStore/tables/\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"%sh\n",
"cd /dbfs/FileStore/tables/\n",
"wget -O mortgage.zip https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip\n",
"ls\n",
"unzip mortgage.zip"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"%sh\n",
"pwd\n",
"cd ../../dbfs/FileStore/tables\n",
"ls -ltr mortgage/csv/*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Next steps\n",
"\n",
"1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
"2. Reboot the cluster\n",
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.3.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"5. Inside the mortgage example notebook, update the data paths\n",
" `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
" `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
},
"name": "Init Scripts_demo",
"notebookId": 2585487876834616
},
"nbformat": 4,
"nbformat_minor": 1
}
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"source": [
"%sh\n",
"cd ../../dbfs/FileStore/jars/\n",
"sudo wget -O rapids-4-spark_2.12-22.12.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.12.0/rapids-4-spark_2.12-22.12.0.jar\n",
"sudo wget -O rapids-4-spark_2.12-23.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.04.0/rapids-4-spark_2.12-23.04.0.jar\n",
"sudo wget -O xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar\n",
"sudo wget -O xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar\n",
"ls -ltr\n",
Expand Down Expand Up @@ -60,7 +60,7 @@
"sudo rm -f /databricks/jars/spark--maven-trees--ml--9.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.4.1.jar\n",
"\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.7.1.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.12.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-23.04.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar /databricks/jars/\"\"\", True)"
]
},
Expand Down Expand Up @@ -133,7 +133,7 @@
"1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
"2. Reboot the cluster\n",
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.02/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"5. Inside the mortgage example notebook, update the data paths\n",
" `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
" `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"
Expand Down
2 changes: 1 addition & 1 deletion docs/get-started/xgboost-examples/csp/dataproc/gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
gcloud dataproc clusters create $CLUSTER_NAME \
--region=$REGION \
--image-version=2.0-ubuntu18 \
--master-machine-type=n1-standard-16 \
--master-machine-type=n2-standard-16 \
--num-workers=$NUM_WORKERS \
--worker-accelerator=type=nvidia-tesla-t4,count=$NUM_GPUS \
--worker-machine-type=n1-highmem-32\
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ Prerequisites
* NVIDIA Pascal™ GPU architecture or better
* Multi-node clusters with homogenous GPU configuration
* Software Requirements
* Ubuntu 18.04, 20.04/CentOS7, CentOS8
* Ubuntu 18.04, 20.04/CentOS7, Rocky Linux 8
* CUDA 11.0+
* NVIDIA driver compatible with your CUDA
* NCCL 2.7.8+
* [Kubernetes 1.6+ cluster with NVIDIA GPUs](https://docs.nvidia.com/datacenter/kubernetes/index.html)
* [Kubernetes cluster with NVIDIA GPUs](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html)
* See official [Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#prerequisites)
instructions for detailed spark-specific cluster requirements
* kubectl installed and configured in the job submission environment
Expand All @@ -40,7 +40,7 @@ export SPARK_DOCKER_IMAGE=<gpu spark docker image repo and name>
export SPARK_DOCKER_TAG=<spark docker image tag>

pushd ${SPARK_HOME}
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-23.02/dockerfile/Dockerfile
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-23.04/dockerfile/Dockerfile

# Optionally install additional jars into ${SPARK_HOME}/jars/

Expand Down
Loading

0 comments on commit 3cff617

Please sign in to comment.