Merge pull request #280 from nvliyuan/main-2302-release

merge branch-23.02 to main branch
NVIDIA · Feb 23, 2023 · 0cc1d0c · 0cc1d0c
2 parents 8599ece + 6213dad
commit 0cc1d0c
Show file tree

Hide file tree

Showing 75 changed files with 802 additions and 260 deletions.
diff --git a/.github/workflows/auto-merge.yml b/.github/workflows/auto-merge.yml
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, NVIDIA CORPORATION.
+# Copyright (c) 2022-2023, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
 on:
   pull_request_target:
     branches:
-    - branch-22.12
+    - branch-23.02
     types: [closed]
 
 jobs:
@@ -29,13 +29,14 @@ jobs:
     steps:
       - uses: actions/checkout@v3
         with:
-          ref: branch-22.12 # force to fetch from latest upstream instead of PR ref
+          ref: branch-23.02 # force to fetch from latest upstream instead of PR ref
 
       - name: auto-merge job
         uses: ./.github/workflows/auto-merge
         env:
           OWNER: NVIDIA
           REPO_NAME: spark-rapids-examples
-          HEAD: branch-22.12
-          BASE: branch-23.02
+          HEAD: branch-23.02
+          BASE: branch-23.04
           AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
+
diff --git a/docs/api-docs/xgboost-examples-api-docs/python.md b/docs/api-docs/xgboost-examples-api-docs/python.md
diff --git a/docs/api-docs/xgboost-examples-api-docs/scala.md b/docs/api-docs/xgboost-examples-api-docs/scala.md
diff --git a/docs/get-started/xgboost-examples/building-sample-apps/scala.md b/docs/get-started/xgboost-examples/building-sample-apps/scala.md
@@ -1,6 +1,6 @@
 # Build XGBoost Scala Examples
 
-The examples rely on [XGBoost](https://github.com/nvidia/spark-xgboost).
+The examples rely on [XGBoost](https://github.com/dmlc/xgboost).
 
 ## Build
 

diff --git a/docs/get-started/xgboost-examples/csp/databricks/databricks.md b/docs/get-started/xgboost-examples/csp/databricks/databricks.md
@@ -159,35 +159,33 @@ Accuracy is 0.9980699597729774
 Limitations
 -------------
 
-1. Adaptive query execution(AQE) and Delta optimization write do not work. These should be disabled
-when using the plugin. Queries may still see significant speedups even with AQE disabled.
+1. When selecting GPU nodes, Databricks UI requires the driver node to be a GPU node. However you 
+   can use Databricks API to create a cluster with CPU driver node.
+   Outside of Databricks the plugin can operate with the driver as a CPU node and workers as GPU nodes.
 
-    ```bash 
-    spark.databricks.delta.optimizeWrite.enabled false
-    spark.sql.adaptive.enabled false
-    ```
-
-    See [issue-1059](https://github.com/NVIDIA/spark-rapids/issues/1059) for more detail. 
-
-2. Dynamic partition pruning(DPP) does not work.  This results in poor performance for queries which
-   would normally benefit from DPP.  See
-   [issue-3143](https://github.com/NVIDIA/spark-rapids/issues/3143) for more detail.
+2. Cannot spin off multiple executors on a multi-GPU node. 
 
-3. When selecting GPU nodes, Databricks requires the driver node to be a GPU node.  Outside of
-   Databricks the plugin can operate with the driver as a CPU node and workers as GPU nodes.
+   Even though it is possible to set `spark.executor.resource.gpu.amount=1` in the in Spark 
+   Configuration tab, Databricks overrides this to `spark.executor.resource.gpu.amount=N` 
+   (where N is the number of GPUs per node). This will result in failed executors when starting the
+   cluster.
 
-4. Cannot spin off multiple executors on a multi-GPU node. 
+3. Parquet rebase mode is set to "LEGACY" by default.
 
-	Even though it is possible to set `spark.executor.resource.gpu.amount=N` (where N is the number
-    of GPUs per node) in the in Spark Configuration tab, Databricks overrides this to
-    `spark.executor.resource.gpu.amount=1`.  This will result in failed executors when starting the
-    cluster.
+   The following Spark configurations are set to `LEGACY` by default on Databricks:
+
+   ```
+   spark.sql.legacy.parquet.datetimeRebaseModeInWrite
+   spark.sql.legacy.parquet.int96RebaseModeInWrite
+   ```
+
+   These settings will cause a CPU fallback for Parquet writes involving dates and timestamps.
+   If you do not need `LEGACY` write semantics, set these configs to `EXCEPTION` which is
+   the default value in Apache Spark 3.0 and higher.
 
-5. Databricks makes changes to the runtime without notification.
+4. Databricks makes changes to the runtime without notification.
 
     Databricks makes changes to existing runtimes, applying patches, without notification.
 	[Issue-3098](https://github.com/NVIDIA/spark-rapids/issues/3098) is one example of this.  We run
 	regular integration tests on the Databricks environment to catch these issues and fix them once
-	detected.
-
-<sup>*</sup> The timings in this Getting Started guide are only illustrative. Please see our [release announcement](https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd) for official benchmarks.
+	detected.
diff --git a/docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb b/docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb
@@ -133,7 +133,7 @@
     "1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
     "2. Reboot the cluster\n",
     "3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
-    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.12/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
+    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.02/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
     "5. Inside the mortgage example notebook, update the data paths\n",
     "  `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
     "  `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"

diff --git a/docs/get-started/xgboost-examples/csp/databricks/generate-init-script.ipynb b/docs/get-started/xgboost-examples/csp/databricks/generate-init-script.ipynb
@@ -133,7 +133,7 @@
     "1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
     "2. Reboot the cluster\n",
     "3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
-    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.12/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
+    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.02/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
     "5. Inside the mortgage example notebook, update the data paths\n",
     "  `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
     "  `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"