diff --git a/doc/install.rst b/doc/install.rst index 532868149b05..579c1ee5e00f 100644 --- a/doc/install.rst +++ b/doc/install.rst @@ -159,7 +159,7 @@ R JVM --- -* XGBoost4j/XGBoost4j-Spark +* XGBoost4j-Spark .. code-block:: xml :caption: Maven @@ -172,11 +172,6 @@ JVM ... - - ml.dmlc - xgboost4j_${scala.binary.version} - latest_version_num - ml.dmlc xgboost4j-spark_${scala.binary.version} @@ -188,11 +183,10 @@ JVM :caption: sbt libraryDependencies ++= Seq( - "ml.dmlc" %% "xgboost4j" % "latest_version_num", "ml.dmlc" %% "xgboost4j-spark" % "latest_version_num" ) -* XGBoost4j-GPU/XGBoost4j-Spark-GPU +* XGBoost4j-Spark-GPU .. code-block:: xml :caption: Maven @@ -205,11 +199,6 @@ JVM ... - - ml.dmlc - xgboost4j-gpu_${scala.binary.version} - latest_version_num - ml.dmlc xgboost4j-spark-gpu_${scala.binary.version} @@ -221,7 +210,6 @@ JVM :caption: sbt libraryDependencies ++= Seq( - "ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num", "ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num" ) @@ -229,7 +217,7 @@ This will check out the latest stable version from the Maven Central. For the latest release version number, please check `release page `_. -To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-gpu_2.12`` and ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix). +To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix). .. note:: Windows not supported in the JVM package @@ -292,7 +280,7 @@ JVM resolvers += "XGBoost4J Snapshot Repo" at "https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/" -Then add XGBoost4J as a dependency: +Then add XGBoost4J-Spark as a dependency: .. code-block:: xml :caption: maven @@ -304,12 +292,6 @@ Then add XGBoost4J as a dependency: - ... - - ml.dmlc - xgboost4j_${scala.binary.version} - latest_version_num-SNAPSHOT - ml.dmlc xgboost4j-spark_${scala.binary.version} @@ -321,11 +303,10 @@ Then add XGBoost4J as a dependency: :caption: sbt libraryDependencies ++= Seq( - "ml.dmlc" %% "xgboost4j" % "latest_version_num-SNAPSHOT", "ml.dmlc" %% "xgboost4j-spark" % "latest_version_num-SNAPSHOT" ) -* XGBoost4j-GPU/XGBoost4j-Spark-GPU +* XGBoost4j-Spark-GPU .. code-block:: xml :caption: maven @@ -337,12 +318,6 @@ Then add XGBoost4J as a dependency: - ... - - ml.dmlc - xgboost4j-gpu_${scala.binary.version} - latest_version_num-SNAPSHOT - ml.dmlc xgboost4j-spark-gpu_${scala.binary.version} @@ -354,7 +329,6 @@ Then add XGBoost4J as a dependency: :caption: sbt libraryDependencies ++= Seq( - "ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num-SNAPSHOT", "ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num-SNAPSHOT" ) diff --git a/doc/jvm/xgboost4j_spark_gpu_tutorial.rst b/doc/jvm/xgboost4j_spark_gpu_tutorial.rst index edabe8a92a49..8591c87c1696 100644 --- a/doc/jvm/xgboost4j_spark_gpu_tutorial.rst +++ b/doc/jvm/xgboost4j_spark_gpu_tutorial.rst @@ -1,6 +1,6 @@ -############################################# -XGBoost4J-Spark-GPU Tutorial (version 1.6.1+) -############################################# +############################ +XGBoost4J-Spark-GPU Tutorial +############################ **XGBoost4J-Spark-GPU** is an open source library aiming to accelerate distributed XGBoost training on Apache Spark cluster from end to end with GPUs by leveraging the `RAPIDS Accelerator for Apache Spark `_ product. @@ -71,7 +71,7 @@ To make the Iris dataset recognizable to XGBoost, we need to encode the String-t label, i.e. "class", to the Double-typed label. One way to convert the String-typed label to Double is to use Spark's built-in feature transformer -`StringIndexer `_. +`StringIndexer `_. But this feature is not accelerated in RAPIDS Accelerator, which means it will fall back to CPU. Instead, we use an alternative way to achieve the same goal with the following code: @@ -107,10 +107,10 @@ With window operations, we have mapped the string column of labels to label indi Training ======== -The GPU version of XGBoost-Spark supports both regression and classification +XGBoost4j-Spark-Gpu supports regression, classification and ranking models. Although we use the Iris dataset in this tutorial to show how we use -``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the -usage in Regression is very similar to classification. +``XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the +usage in Regression and Ranking is very similar to classification. To train a XGBoost model for classification, we need to define a XGBoostClassifier first: @@ -168,12 +168,13 @@ model can then be used in other tasks like prediction. Prediction ========== -When we get a model, either a XGBoostClassificationModel or a XGBoostRegressionModel, it takes a DataFrame as an input, +When we get a model, a XGBoostClassificationModel or a XGBoostRegressionModel or a XGBoostRankerModel, it takes a DataFrame as an input, reads the column containing feature vectors, predicts for each feature vector, and outputs a new DataFrame with the following columns by default: * XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label. * XGBoostRegressionModel will output prediction a label(``predictionCol``). +* XGBoostRankerModel will output prediction a label(``predictionCol``). .. code-block:: scala @@ -226,25 +227,20 @@ would be ``"spark.task.resource.gpu.amount=1/spark.executor.cores"``. However, i using a XGBoost version earlier than 2.1.0 or a Spark standalone cluster version below 3.4.0, you still need to set ``"spark.task.resource.gpu.amount"`` equal to ``"spark.executor.resource.gpu.amount"``. -.. note:: - - As of now, the stage-level scheduling feature in XGBoost is limited to the Spark standalone cluster mode. - However, we have plans to expand its compatibility to YARN and Kubernetes once Spark 3.5.1 is officially released. - -Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar",` +Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar", provided below is an instance demonstrating how to submit the xgboost application to an Apache Spark Standalone cluster. .. code-block:: bash - rapids_version=23.10.0 - xgboost_version=2.0.1 + rapids_version=24.08.0 + xgboost_version=$LATEST_VERSION main_class=Iris app_jar=iris-1.0.0.jar spark-submit \ --master $master \ - --packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \ + --packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \ --conf spark.executor.cores=12 \ --conf spark.task.cpus=1 \ --conf spark.executor.resource.gpu.amount=1 \ @@ -255,7 +251,7 @@ Spark Standalone cluster. --class ${main_class} \ ${app_jar} -* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages`` +* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-spark-gpu`` packages by ``--packages`` * Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin`` For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration `_. diff --git a/doc/jvm/xgboost4j_spark_tutorial.rst b/doc/jvm/xgboost4j_spark_tutorial.rst index 90859dfba2d9..82aa8133e88c 100644 --- a/doc/jvm/xgboost4j_spark_tutorial.rst +++ b/doc/jvm/xgboost4j_spark_tutorial.rst @@ -1,8 +1,10 @@ -####################################### -XGBoost4J-Spark Tutorial (version 0.9+) -####################################### +######################## +XGBoost4J-Spark Tutorial +######################## -**XGBoost4J-Spark** is a project aiming to seamlessly integrate XGBoost and Apache Spark by fitting XGBoost to Apache Spark's MLLIB framework. With the integration, user can not only uses the high-performant algorithm implementation of XGBoost, but also leverages the powerful data processing engine of Spark for: +**XGBoost4J-Spark** is a project aiming to seamlessly integrate XGBoost and Apache Spark by fitting XGBoost to +Apache Spark's MLLIB framework. With the integration, user can not only uses the high-performant algorithm +implementation of XGBoost, but also leverages the powerful data processing engine of Spark for: * Feature Engineering: feature extraction, transformation, dimensionality reduction, and selection, etc. * Pipelines: constructing, evaluating, and tuning ML Pipelines @@ -10,7 +12,7 @@ XGBoost4J-Spark Tutorial (version 0.9+) This tutorial is to cover the end-to-end process to build a machine learning pipeline with XGBoost4J-Spark. We will discuss -* Using Spark to preprocess data to fit to XGBoost/XGBoost4J-Spark's data interface +* Using Spark to preprocess data to fit to XGBoost4J-Spark's data interface * Training a XGBoost model with XGBoost4J-Spark * Serving XGBoost model (prediction) with Spark * Building a Machine Learning Pipeline with XGBoost4J-Spark @@ -27,32 +29,30 @@ Build an ML Application with XGBoost4J-Spark Refer to XGBoost4J-Spark Dependency =================================== -Before we go into the tour of how to use XGBoost4J-Spark, you should first consult :ref:`Installation from Maven repository ` in order to add XGBoost4J-Spark as a dependency for your project. We provide both stable releases and snapshots. +Before we go into the tour of how to use XGBoost4J-Spark, you should first consult :ref:`Installation from Maven repository ` +in order to add XGBoost4J-Spark as a dependency for your project. We provide both stable releases and snapshots. -.. note:: XGBoost4J-Spark requires Apache Spark 2.4+ +.. note:: XGBoost4J-Spark requires Apache Spark 3.0+ - XGBoost4J-Spark now requires **Apache Spark 2.4+**. Latest versions of XGBoost4J-Spark uses facilities of `org.apache.spark.ml.param.shared` extensively to provide for a tight integration with Spark MLLIB framework, and these facilities are not fully available on earlier versions of Spark. + XGBoost4J-Spark now requires **Apache Spark 3.0+**. Latest versions of XGBoost4J-Spark uses facilities of `org.apache.spark.ml.param.shared` + extensively to provide for a tight integration with Spark MLLIB framework, and these facilities are not fully available on earlier versions of Spark. - Also, make sure to install Spark directly from `Apache website `_. **Upstream XGBoost is not guaranteed to work with third-party distributions of Spark, such as Cloudera Spark.** Consult appropriate third parties to obtain their distribution of XGBoost. - -Installation from maven repo - -.. note:: Use of Python in XGBoost4J-Spark - - By default, we use the tracker in `Python package `_ to drive the training with XGBoost4J-Spark. It requires Python 3.6+. We also have an experimental Scala version of tracker which can be enabled by passing the parameter ``tracker_conf`` as ``scala``. + Also, make sure to install Spark directly from `Apache website `_. **Upstream XGBoost is not guaranteed to + work with third-party distributions of Spark, such as Cloudera Spark.** Consult appropriate third parties to obtain their distribution of XGBoost. Data Preparation ================ As aforementioned, XGBoost4J-Spark seamlessly integrates Spark and XGBoost. The integration enables users to apply various types of transformation over the training/test datasets with the convenient -and powerful data processing framework, Spark. +and powerful data processing framework: Spark. In this section, we use `Iris `_ dataset as an example to showcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost. Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width", -"petal length" and "petal width". In addition, it contains the "class" column, which is essentially the label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica". +"petal length" and "petal width". In addition, it contains the "class" column, which is essentially the +label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica". Read Dataset with Spark's Built-In Reader ----------------------------------------- @@ -73,7 +73,10 @@ The first thing in data transformation is to load the dataset as Spark's structu StructField("class", StringType, true))) val rawInput = spark.read.schema(schema).csv("input_path") -At the first line, we create a instance of `SparkSession `_ which is the entry of any Spark program working with DataFrame. The ``schema`` variable defines the schema of DataFrame wrapping Iris data. With this explicitly set schema, we can define the columns' name as well as their types; otherwise the column name would be the default ones derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's built-in csv reader to load Iris csv file as a DataFrame named ``rawInput``. +At the first line, we create a instance of `SparkSession `_ +which is the entry of any Spark program working with DataFrame. The ``schema`` variable defines the schema of DataFrame wrapping Iris data. +With this explicitly set schema, we can define the columns' name as well as their types; otherwise the column name would be the default ones +derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's built-in csv reader to load Iris csv file as a DataFrame named ``rawInput``. Spark also contains many built-in readers for other format. The latest version of Spark supports CSV, JSON, Parquet, and LIBSVM. @@ -85,7 +88,8 @@ To make Iris dataset be recognizable to XGBoost, we need to 1. Transform String-typed label, i.e. "class", to Double-typed label. 2. Assemble the feature columns as a vector to fit to the data interface of Spark ML framework. -To convert String-typed label to Double, we can use Spark's built-in feature transformer `StringIndexer `_. +To convert String-typed label to Double, we can use Spark's built-in feature transformer +`StringIndexer `_. .. code-block:: scala @@ -98,16 +102,21 @@ To convert String-typed label to Double, we can use Spark's built-in feature tra With a newly created StringIndexer instance: -1. we set input column, i.e. the column containing String-typed label -2. we set output column, i.e. the column to contain the Double-typed label. +1. we set input column, i.e. the column containing String-typed label. +2. we set output column, i.e. the column containing the Double-typed label. 3. Then we ``fit`` StringIndex with our input DataFrame ``rawInput``, so that Spark internals can get information like total number of distinct values, etc. -Now we have a StringIndexer which is ready to be applied to our input DataFrame. To execute the transformation logic of StringIndexer, we ``transform`` the input DataFrame ``rawInput`` and to keep a concise DataFrame, +Now we have a StringIndexer which is ready to be applied to our input DataFrame. To execute the transformation logic of StringIndexer, +we ``transform`` the input DataFrame ``rawInput`` and to keep a concise DataFrame, we drop the column "class" and only keeps the feature columns and the transformed Double-typed label column (in the last line of the above code snippet). -The ``fit`` and ``transform`` are two key operations in MLLIB. Basically, ``fit`` produces a "transformer", e.g. StringIndexer, and each transformer applies ``transform`` method on DataFrame to add new column(s) containing transformed features/labels or prediction results, etc. To understand more about ``fit`` and ``transform``, You can find more details in `here `_. +The ``fit`` and ``transform`` are two key operations in MLLIB. Basically, ``fit`` produces a "transformer", e.g. StringIndexer, +and each transformer applies ``transform`` method on DataFrame to add new column(s) containing transformed features/labels or +prediction results, etc. To understand more about ``fit`` and ``transform``, You can find more details in +`here `_. -Similarly, we can use another transformer, `VectorAssembler `_, to assemble feature columns "sepal length", "sepal width", "petal length" and "petal width" as a vector. +Similarly, we can use another transformer, `VectorAssembler `_, +to assemble feature columns "sepal length", "sepal width", "petal length" and "petal width" as a vector. .. code-block:: scala @@ -121,11 +130,6 @@ Now, we have a DataFrame containing only two columns, "features" which contains "sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Double-typed labels. A DataFrame like this (containing vector-represented features and numeric labels) can be fed to XGBoost4J-Spark's training engine directly. -.. note:: - - There is no need to assemble feature columns from version 1.6.1+. Instead, users can specify an array of - feature column names by ``setFeaturesCol(value: Array[String])`` and XGBoost4j-Spark will do it. - Dealing with missing values ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -148,69 +152,20 @@ Example of setting a missing value (e.g. -999) to the "missing" parameter in XGB setFeaturesCol("features"). setLabelCol("classIndex") -.. note:: Missing values with Spark's VectorAssembler - - If given a Dataset with enough features having a value of 0 Spark's VectorAssembler transformer class will return a - SparseVector where the absent values are meant to indicate a value of 0. This conflicts with XGBoost's default to - treat values absent from the SparseVector as missing. The model would effectively be - treating 0 as missing but not declaring that to be so which can lead to confusion when using the trained model on - other platforms. To avoid this, XGBoost will raise an exception if it receives a SparseVector and the "missing" - parameter has not been explicitly set to 0. To workaround this issue the user has three options: - - 1. Explicitly convert the Vector returned from VectorAssembler to a DenseVector to return the zeros to the dataset. If - doing this with missing values encoded as NaN, you will want to set ``setHandleInvalid = "keep"`` on VectorAssembler - in order to keep the NaN values in the dataset. You would then set the "missing" parameter to whatever you want to be - treated as missing. However this may cause a large amount of memory use if your dataset is very sparse. For example: - - .. code-block:: scala - - val assembler = new VectorAssembler().setInputCols(feature_names.toArray).setOutputCol("features").setHandleInvalid("keep") - - // conversion to dense vector using Array() - - val featurePipeline = new Pipeline().setStages(Array(assembler)) - val featureModel = featurePipeline.fit(df_training) - val featureDf = featureModel.transform(df_training) - - val xgbParam = Map("eta" -> 0.1f, - "max_depth" -> 2, - "objective" -> "multi:softprob", - "num_class" -> 3, - "num_round" -> 100, - "num_workers" -> 2, - "allow_non_zero_for_missing" -> "true", - "missing" -> -999) - - val xgb = new XGBoostClassifier(xgbParam) - val xgbclassifier = xgb.fit(featureDf) - - - 2. Before calling VectorAssembler you can transform the values you want to represent missing into an irregular value - that is not 0, NaN, or Null and set the "missing" parameter to 0. The irregular value should ideally be chosen to be - outside the range of values that your features have. - - 3. Do not use the VectorAssembler class and instead use a custom way of constructing a SparseVector that allows for - specifying sparsity to indicate a non-zero value. You can then set the "missing" parameter to whatever sparsity - indicates in your Dataset. If this approach is taken you can pass the parameter - ``"allow_non_zero_for_missing_value" -> true`` to bypass XGBoost's assertion that "missing" must be zero when given a - SparseVector. - - Option 1 is recommended if memory constraints are not an issue. Option 3 requires more work to get set up but is - guaranteed to give you correct results while option 2 will be quicker to set up but may be difficult to find a good - irregular value that does not conflict with your feature values. +.. note:: Missing values -.. note:: Using a non-default missing value when using other bindings of XGBoost. - - When XGBoost is saved in native format only the booster itself is saved, the value of the missing parameter is not - saved alongside the model. Thus, if a non-default missing parameter is used to train the model in Spark the user should - take care to use the same missing parameter when using the saved model in another binding. + If the feature is vector type, the single feature instance could be a SparseVector, where "0" will be treated as the missing value. + In order to get the correct model, XGBoost4j-Spark will convert the SparseVector to array by restoring the "0". However, we can't + assume 0 for missing values as it may be meaningful. So in this case, users need to specify the missing value explicitly + even the missing value has been set to `Float.NaN` by default in the XGBoost4j-Spark. Training ======== -XGBoost supports both regression and classification. While we use Iris dataset in this tutorial to show how we use XGBoost/XGBoost4J-Spark to resolve a multi-classes classification problem, the usage in Regression is very similar to classification. +XGBoost supports regression, classification and ranking. While we use Iris dataset in this tutorial to show how we +use XGBoost4J-Spark to resolve a multi-classes classification problem, the usage in Regression and Ranking is very similar to classification. -To train a XGBoost model for classification, we need to claim a XGBoostClassifier first: +To train a XGBoost model for classification, we need to create a XGBoostClassifier first: .. code-block:: scala @@ -218,17 +173,19 @@ To train a XGBoost model for classification, we need to claim a XGBoostClassifie val xgbParam = Map("eta" -> 0.1f, "max_depth" -> 2, "objective" -> "multi:softprob", - "num_class" -> 3, - "num_round" -> 100, - "num_workers" -> 2) + "num_class" -> 3) val xgbClassifier = new XGBoostClassifier(xgbParam). + setNumRound(100). + setNumWorkers(2). setFeaturesCol("features"). setLabelCol("classIndex") -The available parameters for training a XGBoost model can be found in :doc:`here `. In XGBoost4J-Spark, we support not only the default set of parameters but also the camel-case variant of these parameters to keep consistent with Spark's MLLIB parameters. +The available parameters for training a XGBoost model can be found in :doc:`here `. In XGBoost4J-Spark, we support +not only the default set of parameters but also the camel-case variant of these parameters to keep consistent with Spark's MLLIB parameters. Specifically, each parameter in :doc:`this page ` has its -equivalent form in XGBoost4J-Spark with camel case. For example, to set ``max_depth`` for each tree, you can pass parameter just like what we did in the above code snippet (as ``max_depth`` wrapped in a Map), or you can do it through setters in XGBoostClassifer: +equivalent form in XGBoost4J-Spark with camel case. For example, to set ``max_depth`` for each tree, you can pass parameter just +like what we did in the above code snippet (as ``max_depth`` wrapped in a Map), or you can do it through setters in XGBoostClassifer: .. code-block:: scala @@ -237,7 +194,9 @@ equivalent form in XGBoost4J-Spark with camel case. For example, to set ``max_de setLabelCol("classIndex") xgbClassifier.setMaxDepth(2) -After we set XGBoostClassifier parameters and feature/label column, we can build a transformer, XGBoostClassificationModel by fitting XGBoostClassifier with the input DataFrame. This ``fit`` operation is essentially the training process and the generated model can then be used in prediction. +After we set XGBoostClassifier parameters and feature/label column, we can build a transformer, XGBoostClassificationModel by +fitting XGBoostClassifier with the input DataFrame. This ``fit`` operation is essentially the training process and the generated +model can then be used in prediction. .. code-block:: scala @@ -246,16 +205,23 @@ After we set XGBoostClassifier parameters and feature/label column, we can build Early Stopping ---------------- -Early stopping is a feature to prevent the unnecessary training iterations. By specifying ``num_early_stopping_rounds`` or directly call ``setNumEarlyStoppingRounds`` over a XGBoostClassifier or XGBoostRegressor, we can define number of rounds if the evaluation metric going away from the best iteration and early stop training iterations. +Early stopping is a feature to prevent the unnecessary training iterations. By specifying ``num_early_stopping_rounds`` or +directly call ``setNumEarlyStoppingRounds`` over a XGBoostClassifier or XGBoostRegressor, we can define number of rounds if +the evaluation metric going away from the best iteration and early stop training iterations. -When it comes to custom eval metrics, in additional to ``num_early_stopping_rounds``, you also need to define ``maximize_evaluation_metrics`` or call ``setMaximizeEvaluationMetrics`` to specify whether you want to maximize or minimize the metrics in training. For built-in eval metrics, XGBoost4J-Spark will automatically select the direction. +When it comes to custom eval metrics, in additional to ``num_early_stopping_rounds``, you also need to define ``maximize_evaluation_metrics`` +or call ``setMaximizeEvaluationMetrics`` to specify whether you want to maximize or minimize the metrics in training. For built-in eval metrics, +XGBoost4J-Spark will automatically select the direction. -For example, we need to maximize the evaluation metrics (set ``maximize_evaluation_metrics`` with true), and set ``num_early_stopping_rounds`` with 5. The evaluation metric of 10th iteration is the maximum one until now. In the following iterations, if there is no evaluation metric greater than the 10th iteration's (best one), the traning would be early stopped at 15th iteration. +For example, we need to maximize the evaluation metrics (set ``maximize_evaluation_metrics`` with true), and set ``num_early_stopping_rounds`` +with 5. The evaluation metric of 10th iteration is the maximum one until now. In the following iterations, if there is no evaluation metric +greater than the 10th iteration's (best one), the training would be early stopped at 15th iteration. -Training with Evaluation Sets ------------------------------ +Training with Evaluation Dataset +-------------------------------- -You can also monitor the performance of the model during training with multiple evaluation datasets. By specifying ``eval_sets`` or call ``setEvalSets`` over a XGBoostClassifier or XGBoostRegressor, you can pass in multiple evaluation datasets typed as a Map from String to DataFrame. +You can also monitor the performance of the model during training with evaluation dataset. By calling ``setEvalDataset`` over a +XGBoostClassifier, XGBoostRegressor or XGBoostRanker. Prediction ========== @@ -265,12 +231,15 @@ XGBoost4j-Spark supports two ways for model serving: batch prediction and single Batch Prediction ---------------- -When we get a model, either XGBoostClassificationModel or XGBoostRegressionModel, it takes a DataFrame, read the column containing feature vectors, predict for each feature vector, and output a new DataFrame with the following columns by default: +When we get a model, either XGBoostClassificationModel, XGBoostRegressionModel or XGBoostRankerModel, it takes a DataFrame, read the column containing +feature vectors, predict for each feature vector, and output a new DataFrame with the following columns by default: * XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label. * XGBoostRegressionModel will output prediction label(``predictionCol``). +* XGBoostRankerModel will output prediction label(``predictionCol``). -Batch prediction expects the user to pass the testset in the form of a DataFrame. XGBoost4J-Spark starts a XGBoost worker for each partition of DataFrame for parallel prediction and generates prediction results for the whole DataFrame in a batch. +Batch prediction expects the user to pass the testset in the form of a DataFrame. XGBoost4J-Spark starts a XGBoost worker +for each partition of DataFrame for parallel prediction and generates prediction results for the whole DataFrame in a batch. .. code-block:: scala @@ -309,7 +278,7 @@ With the above code snippet, we get a result DataFrame, result containing margin Single instance prediction -------------------------- -XGBoostClassificationModel or XGBoostRegressionModel support make prediction on single instance as well. +XGBoostClassificationModel, XGBoostRegressionModel or XGBoostRankerModel supports making prediction on single instance as well. It accepts a single Vector as feature, and output the prediction label. However, the overhead of single-instance prediction is high due to the internal overhead of XGBoost, use it carefully! @@ -325,9 +294,13 @@ Model Persistence Model and pipeline persistence ------------------------------ -A data scientist produces an ML model and hands it over to an engineering team for deployment in a production environment. Reversely, a trained model may be used by data scientists, for example as a baseline, across the process of data exploration. So it's important to support model persistence to make the models available across usage scenarios and programming languages. +A data scientist produces an ML model and hands it over to an engineering team for deployment in a production environment. +Reversely, a trained model may be used by data scientists, for example as a baseline, across the process of data exploration. +So it's important to support model persistence to make the models available across usage scenarios and programming languages. -XGBoost4j-Spark supports saving and loading XGBoostClassifier/XGBoostClassificationModel and XGBoostRegressor/XGBoostRegressionModel. It also supports saving and loading a ML pipeline which includes these estimators and models. +XGBoost4j-Spark supports saving and loading XGBoostClassifier/XGBoostClassificationModel and XGBoostRegressor/XGBoostRegressionModel +and XGBoostRanker/XGBoostRankerModel to/from file system. It also supports saving and loading a ML pipeline which includes these +estimators and models. We can save the XGBoostClassificationModel to file system: @@ -347,7 +320,7 @@ and then loading the model in another session: .. note:: - Besides dumping the model to raw format, users are able to dump the model to be json or ubj format from ``version 1.7.0+``. + Besides dumping the model to raw format, users are able to dump the model to be json or ubj format. .. code-block:: scala @@ -362,7 +335,7 @@ Interact with Other Bindings of XGBoost After we train a model with XGBoost4j-Spark on massive dataset, sometimes we want to do model serving in single machine or integrate it with other single node libraries for further processing. -After saving the model, we can load this model with single node Python XGBoost directly from ``version 1.7.0+``. +After saving the model, we can load this model with single node Python XGBoost directly. .. code-block:: scala @@ -373,22 +346,7 @@ After saving the model, we can load this model with single node Python XGBoost d import xgboost as xgb bst = xgb.Booster({'nthread': 4}) - bst.load_model("/tmp/xgbClassificationModel/data/XGBoostClassificationModel") - -Before ``version 1.7.0``, XGBoost4j-Spark needs to export model to local manually by: - -.. code-block:: scala - - val nativeModelPath = "/tmp/nativeModel" - xgbClassificationModel.nativeBooster.saveModel(nativeModelPath) - -Then we can load this model with single node Python XGBoost: - -.. code-block:: python - - import xgboost as xgb - bst = xgb.Booster({'nthread': 4}) - bst.load_model(nativeModelPath) + bst.load_model("/tmp/xgbClassificationModel/data/model") .. note:: Consistency issue between XGBoost4J-Spark and other bindings @@ -400,7 +358,13 @@ Then we can load this model with single node Python XGBoost: spark.read.format("libsvm").load("trainingset_libsvm") - Spark assumes that the dataset is using 1-based indexing (feature indices staring with 1). However, when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is using 0-based indexing (feature indices starting with 0) by default. It creates a pitfall for the users who train model with Spark but predict with the dataset in the same format in other bindings of XGBoost. The solution is to transform the dataset to 0-based indexing before you predict with, for example, Python API, or you append ``?indexing_mode=1`` to your file path when loading with DMatirx. For example in Python: + Spark assumes that the dataset is using 1-based indexing (feature indices staring with 1). However, + when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes + that the dataset is using 0-based indexing (feature indices starting with 0) by default. It creates a + pitfall for the users who train model with Spark but predict with the dataset in the same format in + other bindings of XGBoost. The solution is to transform the dataset to 0-based indexing before you + predict with, for example, Python API, or you append ``?indexing_mode=1`` to your file path when + loading with DMatirx. For example in Python: .. code-block:: python @@ -433,7 +397,8 @@ The we build the ML pipeline which includes 4 stages: * Use XGBoostClassifier to train classification model. * Convert indexed double label back to original string label. -We have shown the first three steps in the earlier sections, and the last step is finished with a new transformer `IndexToString `_: +We have shown the first three steps in the earlier sections, and the last step is finished with a new +transformer `IndexToString `_: .. code-block:: scala @@ -465,11 +430,14 @@ After we get the PipelineModel, we can make prediction on the test dataset and e Pipeline with Hyper-parameter Tunning ===================================== -The most critical operation to maximize the power of XGBoost is to select the optimal parameters for the model. Tuning parameters manually is a tedious and labor-consuming process. With the latest version of XGBoost4J-Spark, we can utilize the Spark model selecting tool to automate this process. +The most critical operation to maximize the power of XGBoost is to select the optimal parameters for the model. +Tuning parameters manually is a tedious and labor-consuming process. With the latest version of XGBoost4J-Spark, +we can utilize the Spark model selecting tool to automate this process. The following example shows the code snippet utilizing CrossValidation and MulticlassClassificationEvaluator to search the optimal combination of two XGBoost parameters, ``max_depth`` and ``eta``. (See :doc:`/parameter`.) -The model producing the maximum accuracy defined by MulticlassClassificationEvaluator is selected and used to generate the prediction for the test set. +The model producing the maximum accuracy defined by MulticlassClassificationEvaluator is selected and used to +generate the prediction for the test set. .. code-block:: scala @@ -497,21 +465,25 @@ The model producing the maximum accuracy defined by MulticlassClassificationEval Run XGBoost4J-Spark in Production ********************************* -XGBoost4J-Spark is one of the most important steps to bring XGBoost to production environment easier. In this section, we introduce three key features to run XGBoost4J-Spark in production. +XGBoost4J-Spark is one of the most important steps to bring XGBoost to production environment easier. In this section, +we introduce three key features to run XGBoost4J-Spark in production. Parallel/Distributed Training ============================= -The massive size of training dataset is one of the most significant characteristics in production environment. To ensure that training in XGBoost scales with the data size, XGBoost4J-Spark bridges the distributed/parallel processing framework of Spark and the parallel/distributed training mechanism of XGBoost. +The massive size of training dataset is one of the most significant characteristics in production environment. To ensure +that training in XGBoost scales with the data size, XGBoost4J-Spark bridges the distributed/parallel processing framework +of Spark and the parallel/distributed training mechanism of XGBoost. -In XGBoost4J-Spark, each XGBoost worker is wrapped by a Spark task and the training dataset in Spark's memory space is fed to XGBoost workers in a transparent approach to the user. +In XGBoost4J-Spark, each XGBoost worker is wrapped by a Spark task and the training dataset in Spark's memory space is +fed to XGBoost workers in a transparent approach to the user. In the code snippet where we build XGBoostClassifier, we set parameter ``num_workers`` (or ``numWorkers``). This parameter controls how many parallel workers we want to have when training a XGBoostClassificationModel. .. note:: Regarding OpenMP optimization - By default, we allocate a core per each XGBoost worker. Therefore, the OpenMP optimization within each XGBoost worker does not take effect and the parallelization of training is achieved - by running multiple workers (i.e. Spark tasks) at the same time. + By default, we allocate a core per each XGBoost worker. Therefore, the OpenMP optimization within each XGBoost worker does + not take effect and the parallelization of training is achieved by running multiple workers (i.e. Spark tasks) at the same time. If you do want OpenMP optimization, you have to @@ -521,17 +493,23 @@ This parameter controls how many parallel workers we want to have when training Gang Scheduling =============== XGBoost uses `AllReduce `_. -algorithm to synchronize the stats, e.g. histogram values, of each worker during training. Therefore XGBoost4J-Spark requires that all of ``nthread * numWorkers`` cores should be available before the training runs. +algorithm to synchronize the stats, e.g. histogram values, of each worker during training. Therefore XGBoost4J-Spark requires +that all of ``nthread * numWorkers`` cores should be available before the training runs. -In the production environment where many users share the same cluster, it's hard to guarantee that your XGBoost4J-Spark application can get all requested resources for every run. By default, the communication layer in XGBoost will block the whole application when it requires more resources to be available. This process usually brings unnecessary resource waste as it keeps the ready resources and try to claim more. Additionally, this usually happens silently and does not bring the attention of users. +In the production environment where many users share the same cluster, it's hard to guarantee that your XGBoost4J-Spark application +can get all requested resources for every run. By default, the communication layer in XGBoost will block the whole application when +it requires more resources to be available. This process usually brings unnecessary resource waste as it keeps the ready resources +and try to claim more. Additionally, this usually happens silently and does not bring the attention of users. -XGBoost4J-Spark allows the user to setup a timeout threshold for claiming resources from the cluster. If the application cannot get enough resources within this time period, the application would fail instead of wasting resources for hanging long. To enable this feature, you can set with XGBoostClassifier/XGBoostRegressor: +XGBoost4J-Spark allows the user to setup a timeout threshold for claiming resources from the cluster. If the application cannot get +enough resources within this time period, the application would fail instead of wasting resources for hanging long. To enable this +feature, you can set with XGBoostClassifier/XGBoostRegressor/XGBoostRanker: .. code-block:: scala - xgbClassifier.setTimeoutRequestWorkers(60000L) + xgbClassifier.setRabitTrackerTimeout(60000L) -or pass in ``timeout_request_workers`` in ``xgbParamMap`` when building XGBoostClassifier: +or pass in ``rabit_tracker_timeout`` in ``xgbParamMap`` when building XGBoostClassifier: .. code-block:: scala @@ -541,20 +519,24 @@ or pass in ``timeout_request_workers`` in ``xgbParamMap`` when building XGBoostC "num_class" -> 3, "num_round" -> 100, "num_workers" -> 2, - "timeout_request_workers" -> 60000L) + "rabit_tracker_timeout" -> 60000L) val xgbClassifier = new XGBoostClassifier(xgbParam). setFeaturesCol("features"). setLabelCol("classIndex") -If XGBoost4J-Spark cannot get enough resources for running two XGBoost workers, the application would fail. Users can have external mechanism to monitor the status of application and get notified for such case. +If XGBoost4J-Spark cannot get enough resources for running two XGBoost workers, the application would fail. +Users can have external mechanism to monitor the status of application and get notified for such case. Checkpoint During Training ========================== Transient failures are also commonly seen in production environment. To simplify the design of XGBoost, -we stop training if any of the distributed workers fail. However, if the training fails after having been through a long time, it would be a great waste of resources. +we stop training if any of the distributed workers fail. However, if the training fails after having been +through a long time, it would be a great waste of resources. -We support creating checkpoint during training to facilitate more efficient recovery from failure. To enable this feature, you can set how many iterations we build each checkpoint with ``setCheckpointInterval`` and the location of checkpoints with ``setCheckpointPath``: +We support creating checkpoint during training to facilitate more efficient recovery from failure. To enable this feature, +you can set how many iterations we build each checkpoint with ``setCheckpointInterval`` and the location of checkpoints +with ``setCheckpointPath``: .. code-block:: scala @@ -577,4 +559,5 @@ An equivalent way is to pass in parameters in XGBoostClassifier's constructor: setFeaturesCol("features"). setLabelCol("classIndex") -If the training failed during these 100 rounds, the next run of training would start by reading the latest checkpoint file in ``/checkpoints_path`` and start from the iteration when the checkpoint was built until to next failure or the specified 100 rounds. +If the training failed during these 100 rounds, the next run of training would start by reading the latest checkpoint +file in ``/checkpoints_path`` and start from the iteration when the checkpoint was built until to next failure or the specified 100 rounds. diff --git a/doc/jvm/xgboost_spark_migration.rst b/doc/jvm/xgboost_spark_migration.rst index 5d75457ec019..976acd5bb6d4 100644 --- a/doc/jvm/xgboost_spark_migration.rst +++ b/doc/jvm/xgboost_spark_migration.rst @@ -1,19 +1,19 @@ -######################################################## -Migration Guide: How to migrate to XGBoost-Spark jvm 3.x -######################################################## +########################################################## +Migration Guide: How to migrate to XGBoost4j-Spark jvm 3.x +########################################################## -XGBoost-Spark jvm packages underwent significant modifications in version 3.0, +XGBoost4j-Spark jvm packages underwent significant modifications in version 3.0, which may cause compatibility issues with existing user code. This guide will walk you through the process of updating your code to ensure -it's compatible with XGBoost-Spark 3.0 and later versions. +it's compatible with XGBoost4j-Spark 3.0 and later versions. -********************** -XGBoost Spark Packages -********************** +************************ +XGBoost4j Spark Packages +************************ -XGBoost-Spark 3.0 introduced a single uber package named xgboost-spark_2.12-3.0.0.jar, which bundles -both xgboost4j and xgboost4j-spark. This means you can now simply use `xgboost-spark`` for your application. +XGBoost4j-Spark 3.0 has assembled xgboost4j package into xgboost4j-spark_2.12-3.0.0.jar, which means +you can now simply use `xgboost4j-spark` for your application. * For CPU @@ -21,7 +21,7 @@ both xgboost4j and xgboost4j-spark. This means you can now simply use `xgboost-s ml.dmlc - xgboost-spark_${scala.binary.version} + xgboost4j-spark_${scala.binary.version} 3.0.0 @@ -31,19 +31,19 @@ both xgboost4j and xgboost4j-spark. This means you can now simply use `xgboost-s ml.dmlc - xgboost-spark-gpu_${scala.binary.version} + xgboost4j-spark-gpu_${scala.binary.version} 3.0.0 -When submitting the XGBoost application to the Spark cluster, you only need to specify the single `xgboost-spark` package. +When submitting the XGBoost application to the Spark cluster, you only need to specify the single `xgboost4j-spark` package. * For CPU .. code-block:: bash spark-submit \ - --jars xgboost-spark_2.12-3.0.0.jar \ + --jars xgboost4j-spark_2.12-3.0.0.jar \ ... \ @@ -52,7 +52,7 @@ When submitting the XGBoost application to the Spark cluster, you only need to s .. code-block:: bash spark-submit \ - --jars xgboost-spark_2.12-3.0.0.jar \ + --jars xgboost4j-spark-gpu_2.12-3.0.0.jar \ ... \ *************** @@ -63,65 +63,18 @@ Learning to rank using XGBoostRegressor has been replaced by a dedicated `XGBoos to support ranking algorithms. .. code-block:: scala - - // before 3.0 + + // before xgboost4j-spark 3.0 val regressor = new XGBoostRegressor().setObjective("rank:ndcg") - // after 3.0 + // after xgboost4j-spark 3.0 val ranker = new XGBoostRanker() -****************************** -XGBoost Constructor Parameters -****************************** - -XGBoost Spark now categorizes parameters into two groups: XGBoost-Spark parameters and XGBoost parameters. -When constructing an XGBoost estimator, only XGBoost-specific parameters are permitted. XGBoost-Spark specific -parameters must be configured using the estimator's setter methods. It's worth noting that -`XGBoost Parameters `_ -can be set both during construction and through the estimator's setter methods. - -.. code-block:: scala - - // before 3.0 - val xgboost_paras = Map( - "eta" -> "1", - "max_depth" -> "6", - "objective" -> "binary:logistic", - "num_round" -> 5, - "num_workers" -> 1, - "features" -> "feature_column", - "label" -> "label_column", - ) - val classifier = new XGBoostClassifier(xgboost_paras) - - - // after 3.0 - val xgboost_paras = Map( - "eta" -> "1", - "max_depth" -> "6", - "objective" -> "binary:logistic", - ) - val classifier = new XGBoostClassifier(xgboost_paras) - .setNumRound(5) - .setNumWorkers(1) - .setFeaturesCol("feature_column") - .setLabelCol("label_column") - - // Or you can use setter to set all parameters - val classifier = new XGBoostClassifier() - .setNumRound(5) - .setNumWorkers(1) - .setFeaturesCol("feature_column") - .setLabelCol("label_column") - .setEta(1) - .setMaxDepth(6) - .setObjective("binary:logistic") - ****************** Removed Parameters ****************** -Starting from 3.0, below parameters are removed. +Starting from xgboost4j-spark 3.0, below parameters are removed. - cacheTrainingSet @@ -129,7 +82,7 @@ Starting from 3.0, below parameters are removed. in your code prior to fitting the data to an estimator. .. code-block:: scala - + val df = input.cache() val model = new XGBoostClassifier().fit(df) @@ -138,7 +91,7 @@ Starting from 3.0, below parameters are removed. The following method can be employed to do the evaluation. .. code-block:: scala - + val Array(train, eval) = trainDf.randomSplit(Array(0.7, 0.3)) val classifier = new XGBoostClassifer().setEvalDataset(eval) val model = classifier.fit(train) @@ -148,7 +101,7 @@ Starting from 3.0, below parameters are removed. The following method can be used to configure RabitTracker. .. code-block:: scala - + val classifier = new XGBoostClassifer() .setRabitTrackerTimeout(100) .setRabitTrackerHostIp("192.168.0.2")