Merge pull request #289 from nvliyuan/main-v2304

merge branch-23.04 to main branch
NVIDIA · Apr 28, 2023 · 3cff617 · 3cff617
2 parents 0cc1d0c + 234fdb7
commit 3cff617
Show file tree

Hide file tree

Showing 40 changed files with 266 additions and 104 deletions.
diff --git a/.github/workflows/auto-merge.yml b/.github/workflows/auto-merge.yml
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
 on:
   pull_request_target:
     branches:
-    - branch-23.02
+    - branch-23.04
     types: [closed]
 
 jobs:
@@ -29,14 +29,14 @@ jobs:
     steps:
       - uses: actions/checkout@v3
         with:
-          ref: branch-23.02 # force to fetch from latest upstream instead of PR ref
+          ref: branch-23.04 # force to fetch from latest upstream instead of PR ref
 
       - name: auto-merge job
         uses: ./.github/workflows/auto-merge
         env:
           OWNER: NVIDIA
           REPO_NAME: spark-rapids-examples
-          HEAD: branch-23.02
-          BASE: branch-23.04
+          HEAD: branch-23.04
+          BASE: branch-23.06
           AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
 
diff --git a/docs/get-started/xgboost-examples/csp/databricks/databricks.md b/docs/get-started/xgboost-examples/csp/databricks/databricks.md
@@ -6,27 +6,26 @@ This is a getting started guide to XGBoost4J-Spark on Databricks. At the end of
 Prerequisites
 -------------
 
-    * Apache Spark 3.1+ running in Databricks Runtime 9.1 ML or 10.4 ML with GPU
-    * AWS: 9.1 LTS ML (GPU, Scala 2.12, Spark 3.1.2) or 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1)
-    * Azure: 9.1 LTS ML (GPU, Scala 2.12, Spark 3.1.2) or 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1)
+    * Apache Spark 3.x running in Databricks Runtime 10.4 ML or 11.3 ML with GPU
+    * AWS: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1) or 11.3 LTS ML (GPU, Scala 2.12, Spark 3.3.0)
+    * Azure: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1) or 11.3 LTS ML (GPU, Scala 2.12, Spark 3.3.0)
 
 The number of GPUs per node dictates the number of Spark executors that can run in that node. Each executor should only be allowed to run 1 task at any given time.
 
 Start A Databricks Cluster
 --------------------------
 
-Create a Databricks cluster by clicking "+ Create -> Cluster" on the left panel. Ensure the
+Create a Databricks cluster by going to "Compute", then clicking `+ Create compute`.  Ensure the
 cluster meets the prerequisites above by configuring it as follows:
 1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
    Prerequisites section.
-2. Under Autopilot Options, disable autoscaling.
-3. Choose the number of workers you want to use.
-4. Select a worker type.  On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
+2. Choose the number of workers that matches the number of GPUs you want to use.
+3. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
    p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker
    (although they can be used for the driver node).  For Azure, choose GPU nodes such as
-   Standard_NC6s_v3.
-5. Select the driver type. Generally this can be set to be the same as the worker.
-6. Start the cluster.
+   Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs. 
+4. Select the driver type. Generally this can be set to be the same as the worker.
+5. Start the cluster.
 
 Advanced Cluster Configuration
 --------------------------
@@ -38,20 +37,18 @@ cluster.
    your workspace.  See [Managing
    Notebooks](https://docs.databricks.com/notebooks/notebooks-manage.html#id2) for instructions on
    how to import a notebook.  
-   Select the initialization script based on the Databricks runtime
+   Select the version of the RAPIDS Accelerator for Apache Spark based on the Databricks runtime
    version:
-
-    - [Databricks 9.1 LTS
-    ML](https://docs.databricks.com/release-notes/runtime/9.1ml.html#system-environment) has CUDA 11
-    installed.  Users will need to use 21.12.0 or later on Databricks 9.1 LTS ML. In this case use
-    [generate-init-script.ipynb](generate-init-script.ipynb) which will install
-    the RAPIDS Spark plugin.
-
-    - [Databricks 10.4 LTS
-    ML](https://docs.databricks.com/release-notes/runtime/9.1ml.html#system-environment) has CUDA 11
-    installed.  Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML. In this case use
-    [generate-init-script-10.4.ipynb](generate-init-script-10.4.ipynb) which will install
-    the RAPIDS Spark plugin.
+   - [Databricks 10.4 LTS
+     ML](https://docs.databricks.com/release-notes/runtime/10.4ml.html#system-environment) has CUDA 11
+     installed.  Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML.
+   - [Databricks 11.3 LTS
+     ML](https://docs.databricks.com/release-notes/runtime/11.3ml.html#system-environment) has CUDA 11
+     installed.  Users will need to use 23.04.0 or later on Databricks 11.3 LTS ML.
+
+     In both cases use
+     [generate-init-script.ipynb](./generate-init-script.ipynb) which will install
+     the RAPIDS Spark plugin.
 
 2. Once you are in the notebook, click the “Run All” button.
 3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the
@@ -72,23 +69,17 @@ cluster.
     The
     [`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling)
     configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an
-    executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet.  Set
+    executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set
     this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just
-    like the CPU side.  Having the value smaller is fine as well.
-
-	There is an incompatibility between the Databricks specific implementation of adaptive query
-    execution (AQE) and the spark-rapids plugin.  In order to mitigate this,
-    `spark.sql.adaptive.enabled` should be set to false.  In addition, the plugin does not work with
-    the Databricks `spark.databricks.delta.optimizeWrite` option.
+    like the CPU side. Having the value smaller is fine as well.
+    Note: Please remove the `spark.task.resource.gpu.amount` config for a single-node Databricks 
+    cluster because Spark local mode does not support GPU scheduling.
 
     ```bash
-    spark.plugins com.nvidia.spark.SQLPlugin
-    spark.task.resource.gpu.amount 0.1
-    spark.rapids.memory.pinnedPool.size 2G
-    spark.locality.wait 0s
-    spark.databricks.delta.optimizeWrite.enabled false
-    spark.sql.adaptive.enabled false
-    spark.rapids.sql.concurrentGpuTasks 2
+     spark.plugins com.nvidia.spark.SQLPlugin
+     spark.task.resource.gpu.amount 0.1
+     spark.rapids.memory.pinnedPool.size 2G
+     spark.rapids.sql.concurrentGpuTasks 2
     ```
 
     ![Spark Config](../../../../img/databricks/sparkconfig.png)
@@ -186,6 +177,11 @@ Limitations
 4. Databricks makes changes to the runtime without notification.
 
     Databricks makes changes to existing runtimes, applying patches, without notification.
-	[Issue-3098](https://github.com/NVIDIA/spark-rapids/issues/3098) is one example of this.  We run
-	regular integration tests on the Databricks environment to catch these issues and fix them once
-	detected.
+    [Issue-3098](https://github.com/NVIDIA/spark-rapids/issues/3098) is one example of this.  We run
+    regular integration tests on the Databricks environment to catch these issues and fix them once
+    detected.
+   
+5. In Databricks 11.3, an incorrect result is returned for window frames defined by a range in case 
+   of DecimalTypes with precision greater than 38. There is a bug filed in Apache Spark for it 
+   [here](https://issues.apache.org/jira/browse/SPARK-41793), whereas when using the plugin the 
+   correct result will be returned.
diff --git a/docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb b/docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb
@@ -24,7 +24,7 @@
    "source": [
     "%sh\n",
     "cd ../../dbfs/FileStore/jars/\n",
-    "sudo wget -O rapids-4-spark_2.12-22.12.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.12.0/rapids-4-spark_2.12-22.12.0.jar\n",
+    "sudo wget -O rapids-4-spark_2.12-23.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.04.0/rapids-4-spark_2.12-23.04.0.jar\n",
     "sudo wget -O xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar\n",
     "sudo wget -O xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar\n",
     "ls -ltr\n",
@@ -60,7 +60,7 @@
     "sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\n",
     "\n",
     "sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.7.1.jar /databricks/jars/\n",
-    "sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.12.0.jar /databricks/jars/\n",
+    "sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-23.04.0.jar /databricks/jars/\n",
     "sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar /databricks/jars/\"\"\", True)"
    ]
   },
@@ -133,7 +133,7 @@
     "1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
     "2. Reboot the cluster\n",
     "3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
-    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.02/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
+    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
     "5. Inside the mortgage example notebook, update the data paths\n",
     "  `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
     "  `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"

diff --git a/docs/get-started/xgboost-examples/csp/databricks/generate-init-script-11.3.ipynb b/docs/get-started/xgboost-examples/csp/databricks/generate-init-script-11.3.ipynb
@@ -0,0 +1,166 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download latest Jars"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dbutils.fs.mkdirs(\"dbfs:/FileStore/jars/\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%sh\n",
+    "cd ../../dbfs/FileStore/jars/\n",
+    "sudo wget -O rapids-4-spark_2.12-23.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.04.0/rapids-4-spark_2.12-23.04.0.jar\n",
+    "sudo wget -O xgboost4j-gpu_2.12-1.7.3.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.3/xgboost4j-gpu_2.12-1.7.3.jar\n",
+    "sudo wget -O xgboost4j-spark-gpu_2.12-1.7.3.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.3/xgboost4j-spark-gpu_2.12-1.7.3.jar\n",
+    "ls -ltr\n",
+    "\n",
+    "# Your Jars are downloaded in dbfs:/FileStore/jars directory"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create a Directory for your init script"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n",
+    "#!/bin/bash\n",
+    "sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar\n",
+    "sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\n",
+    "\n",
+    "sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.7.3.jar /databricks/jars/\n",
+    "sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-23.04.0.jar /databricks/jars/\n",
+    "sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.3.jar /databricks/jars/\"\"\", True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Confirm your init script is in the new directory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%sh\n",
+    "cd ../../dbfs/databricks/init_scripts\n",
+    "pwd\n",
+    "ls -ltr"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Download the Mortgage Dataset into your local machine and upload Data using import Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dbutils.fs.mkdirs(\"dbfs:/FileStore/tables/\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%sh\n",
+    "cd /dbfs/FileStore/tables/\n",
+    "wget -O mortgage.zip https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip\n",
+    "ls\n",
+    "unzip mortgage.zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%sh\n",
+    "pwd\n",
+    "cd ../../dbfs/FileStore/tables\n",
+    "ls -ltr mortgage/csv/*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Next steps\n",
+    "\n",
+    "1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
+    "2. Reboot the cluster\n",
+    "3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.3.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
+    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
+    "5. Inside the mortgage example notebook, update the data paths\n",
+    "  `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
+    "  `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.2"
+  },
+  "name": "Init Scripts_demo",
+  "notebookId": 2585487876834616
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/docs/get-started/xgboost-examples/csp/databricks/generate-init-script.ipynb b/docs/get-started/xgboost-examples/csp/databricks/generate-init-script.ipynb
@@ -24,7 +24,7 @@
    "source": [
     "%sh\n",
     "cd ../../dbfs/FileStore/jars/\n",
-    "sudo wget -O rapids-4-spark_2.12-22.12.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.12.0/rapids-4-spark_2.12-22.12.0.jar\n",
+    "sudo wget -O rapids-4-spark_2.12-23.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.04.0/rapids-4-spark_2.12-23.04.0.jar\n",
     "sudo wget -O xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar\n",
     "sudo wget -O xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar\n",
     "ls -ltr\n",
@@ -60,7 +60,7 @@
     "sudo rm -f /databricks/jars/spark--maven-trees--ml--9.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.4.1.jar\n",
     "\n",
     "sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.7.1.jar /databricks/jars/\n",
-    "sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.12.0.jar /databricks/jars/\n",
+    "sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-23.04.0.jar /databricks/jars/\n",
     "sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar /databricks/jars/\"\"\", True)"
    ]
   },
@@ -133,7 +133,7 @@
     "1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
     "2. Reboot the cluster\n",
     "3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
-    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.02/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
+    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.04/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
     "5. Inside the mortgage example notebook, update the data paths\n",
     "  `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
     "  `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"

diff --git a/docs/get-started/xgboost-examples/csp/dataproc/gcp.md b/docs/get-started/xgboost-examples/csp/dataproc/gcp.md
@@ -17,7 +17,7 @@
 gcloud dataproc clusters create $CLUSTER_NAME  \
     --region=$REGION \
     --image-version=2.0-ubuntu18 \
-    --master-machine-type=n1-standard-16 \
+    --master-machine-type=n2-standard-16 \
     --num-workers=$NUM_WORKERS \
     --worker-accelerator=type=nvidia-tesla-t4,count=$NUM_GPUS \
     --worker-machine-type=n1-highmem-32\

diff --git a/docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md b/docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md
@@ -11,11 +11,11 @@ Prerequisites
   * NVIDIA Pascal™ GPU architecture or better
   * Multi-node clusters with homogenous GPU configuration
 * Software Requirements
-  * Ubuntu 18.04, 20.04/CentOS7, CentOS8
+  * Ubuntu 18.04, 20.04/CentOS7, Rocky Linux 8
   * CUDA 11.0+
   * NVIDIA driver compatible with your CUDA
   * NCCL 2.7.8+
-* [Kubernetes 1.6+ cluster with NVIDIA GPUs](https://docs.nvidia.com/datacenter/kubernetes/index.html)
+* [Kubernetes cluster with NVIDIA GPUs](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html)
   * See official [Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#prerequisites) 
     instructions for detailed spark-specific cluster requirements
 * kubectl installed and configured in the job submission environment
@@ -40,7 +40,7 @@ export SPARK_DOCKER_IMAGE=<gpu spark docker image repo and name>
 export SPARK_DOCKER_TAG=<spark docker image tag>
 
 pushd ${SPARK_HOME}
-wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-23.02/dockerfile/Dockerfile
+wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-23.04/dockerfile/Dockerfile
 
 # Optionally install additional jars into ${SPARK_HOME}/jars/