Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] update doc based on the latest changes #10847

Merged
merged 1 commit into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 5 additions & 31 deletions doc/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ R
JVM
---

* XGBoost4j/XGBoost4j-Spark
* XGBoost4j-Spark

.. code-block:: xml
:caption: Maven
Expand All @@ -172,11 +172,6 @@ JVM

<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>latest_version_num</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
Expand All @@ -188,11 +183,10 @@ JVM
:caption: sbt

libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j" % "latest_version_num",
"ml.dmlc" %% "xgboost4j-spark" % "latest_version_num"
)

* XGBoost4j-GPU/XGBoost4j-Spark-GPU
* XGBoost4j-Spark-GPU

.. code-block:: xml
:caption: Maven
Expand All @@ -205,11 +199,6 @@ JVM

<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark-gpu_${scala.binary.version}</artifactId>
Expand All @@ -221,15 +210,14 @@ JVM
:caption: sbt

libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num",
"ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num"
)

This will check out the latest stable version from the Maven Central.

For the latest release version number, please check `release page <https://github.com/dmlc/xgboost/releases>`_.

To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-gpu_2.12`` and ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).
To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).


.. note:: Windows not supported in the JVM package
Expand Down Expand Up @@ -292,7 +280,7 @@ JVM

resolvers += "XGBoost4J Snapshot Repo" at "https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/"

Then add XGBoost4J as a dependency:
Then add XGBoost4J-Spark as a dependency:

.. code-block:: xml
:caption: maven
Expand All @@ -304,12 +292,6 @@ Then add XGBoost4J as a dependency:
</properties>

<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>latest_version_num-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
Expand All @@ -321,11 +303,10 @@ Then add XGBoost4J as a dependency:
:caption: sbt

libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j" % "latest_version_num-SNAPSHOT",
"ml.dmlc" %% "xgboost4j-spark" % "latest_version_num-SNAPSHOT"
)

* XGBoost4j-GPU/XGBoost4j-Spark-GPU
* XGBoost4j-Spark-GPU

.. code-block:: xml
:caption: maven
Expand All @@ -337,12 +318,6 @@ Then add XGBoost4J as a dependency:
</properties>

<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark-gpu_${scala.binary.version}</artifactId>
Expand All @@ -354,7 +329,6 @@ Then add XGBoost4J as a dependency:
:caption: sbt

libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num-SNAPSHOT",
"ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num-SNAPSHOT"
)

Expand Down
32 changes: 14 additions & 18 deletions doc/jvm/xgboost4j_spark_gpu_tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#############################################
XGBoost4J-Spark-GPU Tutorial (version 1.6.1+)
#############################################
############################
XGBoost4J-Spark-GPU Tutorial
############################

**XGBoost4J-Spark-GPU** is an open source library aiming to accelerate distributed XGBoost training on Apache Spark cluster from
end to end with GPUs by leveraging the `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_ product.
Expand Down Expand Up @@ -71,7 +71,7 @@ To make the Iris dataset recognizable to XGBoost, we need to encode the String-t
label, i.e. "class", to the Double-typed label.

One way to convert the String-typed label to Double is to use Spark's built-in feature transformer
`StringIndexer <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>`_.
`StringIndexer <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/feature/StringIndexer.html>`_.
But this feature is not accelerated in RAPIDS Accelerator, which means it will fall back
to CPU. Instead, we use an alternative way to achieve the same goal with the following code:

Expand Down Expand Up @@ -107,10 +107,10 @@ With window operations, we have mapped the string column of labels to label indi
Training
========

The GPU version of XGBoost-Spark supports both regression and classification
XGBoost4j-Spark-Gpu supports regression, classification and ranking
models. Although we use the Iris dataset in this tutorial to show how we use
``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
usage in Regression is very similar to classification.
``XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
usage in Regression and Ranking is very similar to classification.

To train a XGBoost model for classification, we need to define a XGBoostClassifier first:

Expand Down Expand Up @@ -168,12 +168,13 @@ model can then be used in other tasks like prediction.
Prediction
==========

When we get a model, either a XGBoostClassificationModel or a XGBoostRegressionModel, it takes a DataFrame as an input,
When we get a model, a XGBoostClassificationModel or a XGBoostRegressionModel or a XGBoostRankerModel, it takes a DataFrame as an input,
reads the column containing feature vectors, predicts for each feature vector, and outputs a new DataFrame
with the following columns by default:

* XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
* XGBoostRegressionModel will output prediction a label(``predictionCol``).
* XGBoostRankerModel will output prediction a label(``predictionCol``).

.. code-block:: scala

Expand Down Expand Up @@ -226,25 +227,20 @@ would be ``"spark.task.resource.gpu.amount=1/spark.executor.cores"``. However, i
using a XGBoost version earlier than 2.1.0 or a Spark standalone cluster version below 3.4.0,
you still need to set ``"spark.task.resource.gpu.amount"`` equal to ``"spark.executor.resource.gpu.amount"``.

.. note::

As of now, the stage-level scheduling feature in XGBoost is limited to the Spark standalone cluster mode.
However, we have plans to expand its compatibility to YARN and Kubernetes once Spark 3.5.1 is officially released.

Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar",`
Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar",
provided below is an instance demonstrating how to submit the xgboost application to an Apache
Spark Standalone cluster.

.. code-block:: bash

rapids_version=23.10.0
xgboost_version=2.0.1
rapids_version=24.08.0
xgboost_version=$LATEST_VERSION
main_class=Iris
app_jar=iris-1.0.0.jar

spark-submit \
--master $master \
--packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
--packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
--conf spark.executor.cores=12 \
--conf spark.task.cpus=1 \
--conf spark.executor.resource.gpu.amount=1 \
Expand All @@ -255,7 +251,7 @@ Spark Standalone cluster.
--class ${main_class} \
${app_jar}

* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-spark-gpu`` packages by ``--packages``
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``

For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
Expand Down
Loading
Loading