Merge pull request #306 from nvliyuan/main-v2306-release

merge branch-23.06 to main branch
NVIDIA · Jun 29, 2023 · 0cb527c · 0cb527c
2 parents 3cff617 + 5a69221
commit 0cb527c
Show file tree

Hide file tree

Showing 70 changed files with 17,633 additions and 613 deletions.
diff --git a/.github/workflows/auto-merge.yml b/.github/workflows/auto-merge.yml
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
 on:
   pull_request_target:
     branches:
-    - branch-23.04
+    - branch-23.06
     types: [closed]
 
 jobs:
@@ -29,14 +29,14 @@ jobs:
     steps:
       - uses: actions/checkout@v3
         with:
-          ref: branch-23.04 # force to fetch from latest upstream instead of PR ref
+          ref: branch-23.06 # force to fetch from latest upstream instead of PR ref
 
       - name: auto-merge job
         uses: ./.github/workflows/auto-merge
         env:
           OWNER: NVIDIA
           REPO_NAME: spark-rapids-examples
-          HEAD: branch-23.04
-          BASE: branch-23.06
+          HEAD: branch-23.06
+          BASE: branch-23.08
           AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
 
diff --git a/docs/get-started/xgboost-examples/csp/databricks/databricks.md b/docs/get-started/xgboost-examples/csp/databricks/databricks.md
@@ -14,55 +14,26 @@ The number of GPUs per node dictates the number of Spark executors that can run
 
 Start A Databricks Cluster
 --------------------------
-
-Create a Databricks cluster by going to "Compute", then clicking `+ Create compute`.  Ensure the
-cluster meets the prerequisites above by configuring it as follows:
+Before creating the cluster, we will need to create an [initialization script](https://docs.databricks.com/clusters/init-scripts.html) for the 
+cluster to install the RAPIDS jars. Databricks recommends storing all cluster-scoped init scripts using workspace files. 
+Each user has a Home directory configured under the /Users directory in the workspace. 
+Navigate to your home directory in the UI and select **Create** > **File** from the menu, 
+create an `init.sh` scripts with contents:   
+   ```bash
+   #!/bin/bash
+   sudo wget -O /databricks/jars/rapids-4-spark_2.12-23.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar
+   ```
 1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
    Prerequisites section.
 2. Choose the number of workers that matches the number of GPUs you want to use.
 3. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
-   p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker
-   (although they can be used for the driver node).  For Azure, choose GPU nodes such as
-   Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs. 
+   For Azure, choose GPU nodes such as Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs. 
 4. Select the driver type. Generally this can be set to be the same as the worker.
-5. Start the cluster.
-
-Advanced Cluster Configuration
---------------------------
-
-We will need to create an initialization script for the cluster that installs the RAPIDS jars to the
-cluster.
-
-1. To create the initialization script, import the initialization script notebook from the repo to
-   your workspace.  See [Managing
-   Notebooks](https://docs.databricks.com/notebooks/notebooks-manage.html#id2) for instructions on
-   how to import a notebook.  
-   Select the version of the RAPIDS Accelerator for Apache Spark based on the Databricks runtime
-   version:
-   - [Databricks 10.4 LTS
-     ML](https://docs.databricks.com/release-notes/runtime/10.4ml.html#system-environment) has CUDA 11
-     installed.  Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML.
-   - [Databricks 11.3 LTS
-     ML](https://docs.databricks.com/release-notes/runtime/11.3ml.html#system-environment) has CUDA 11
-     installed.  Users will need to use 23.04.0 or later on Databricks 11.3 LTS ML.
-
-     In both cases use
-     [generate-init-script.ipynb](./generate-init-script.ipynb) which will install
-     the RAPIDS Spark plugin.
-
-2. Once you are in the notebook, click the “Run All” button.
-3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the
-   contents of the script are correct.
-4. Go back and edit your cluster to configure it to use the init script.  To do this, click the
-   “Compute” button on the left panel, then select your cluster.
-5. Click the “Edit” button, then navigate down to the “Advanced Options” section.  Select the “Init
-   Scripts” tab in the advanced options section, and paste the initialization script:
-   `dbfs:/databricks/init_scripts/init.sh`, then click “Add”.
-
-    ![Init Script](../../../../img/databricks/initscript.png)
-
+5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init Scripts” tab in 
+   the advanced options section, and paste the workspace path to the initialization script:`/Users/user@domain/init.sh`, then click “Add”.
+   ![Init Script](../../../../img/databricks/initscript.png)
 6. Now select the “Spark” tab, and paste the following config options into the Spark Config section.
-   Change the config values based on the workers you choose.  See Apache Spark
+   Change the config values based on the workers you choose. See Apache Spark
    [configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator
    for Apache Spark [descriptions](https://nvidia.github.io/spark-rapids/docs/configs.html) for each config.
 
@@ -74,18 +45,36 @@ cluster.
     like the CPU side. Having the value smaller is fine as well.
     Note: Please remove the `spark.task.resource.gpu.amount` config for a single-node Databricks 
     cluster because Spark local mode does not support GPU scheduling.
-
+   
     ```bash
-     spark.plugins com.nvidia.spark.SQLPlugin
-     spark.task.resource.gpu.amount 0.1
-     spark.rapids.memory.pinnedPool.size 2G
-     spark.rapids.sql.concurrentGpuTasks 2
+    spark.plugins com.nvidia.spark.SQLPlugin
+    spark.task.resource.gpu.amount 0.1
+    spark.rapids.memory.pinnedPool.size 2G
+    spark.rapids.sql.concurrentGpuTasks 2
     ```
 
     ![Spark Config](../../../../img/databricks/sparkconfig.png)
 
-7. Once you’ve added the Spark config, click “Confirm and Restart”.
-8. Once the cluster comes back up, it is now enabled for GPU-accelerated Spark with RAPIDS and cuDF.
+    If running Pandas UDFs with GPU support from the plugin, at least three additional options
+    as below are required. The `spark.python.daemon.module` option is to choose the right daemon module
+    of python for Databricks. On Databricks, the python runtime requires different parameters than the
+    Spark one, so a dedicated python demon module `rapids.daemon_databricks` is created and should
+    be specified here. Set the config
+    [`spark.rapids.sql.python.gpu.enabled`](https://nvidia.github.io/spark-rapids/docs/configs.html#sql.python.gpu.enabled) to `true` to
+    enable GPU support for python. Add the path of the plugin jar (supposing it is placed under
+    `/databricks/jars/`) to the `spark.executorEnv.PYTHONPATH` option. For more details please go to
+    [GPU Scheduling For Pandas UDF](https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html#gpu-support-for-pandas-udf)
+
+    ```bash
+    spark.rapids.sql.python.gpu.enabled true
+    spark.python.daemon.module rapids.daemon_databricks
+    spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.06.0.jar:/databricks/spark/python
+    ```
+   Note that since python memory pool require installing the cudf library, so you need to install cudf library in 
+   each worker nodes `pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com` or disable python memory pool
+   `spark.rapids.python.memory.gpu.pooling.enabled=false`.
+
+7. Click `Create Cluster`, it is now enabled for GPU-accelerated Spark.
 
 Install the xgboost4j_spark jar in the cluster
 ---------------------------

diff --git a/docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb b/docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb