NVIDIA · SYangster · Sep 19, 2023 · Sep 7, 2023 · Sep 7, 2023 · Sep 7, 2023
diff --git a/examples/advanced/vertical_xgboost/README.md b/examples/advanced/vertical_xgboost/README.md
@@ -7,13 +7,13 @@ Before starting please make sure you set up a [virtual environment](../../../REA
 python3 -m pip install -r requirements.txt
 ```
 
-> **_NOTE:_** If vertical federated learning support is not available in the XGBoost PyPI release yet, reinstall XGBoost from a [wheel](https://xgboost.readthedocs.io/en/stable/install.html#nightly-build) with a recent commit.
+> **_NOTE:_** If vertical federated learning support or GPU support is not available in the XGBoost PyPI release yet, either reinstall XGBoost from a [wheel](https://xgboost.readthedocs.io/en/stable/install.html#nightly-build) with a recent commit from the master branch, or build from [source](https://github.com/dmlc/xgboost/blob/master/plugin/federated/README.md). When building XGBoost from source, ensure that gRPC, CUDA, and NCCL are installed with sufficient versions and use the cmake options `-DPLUGIN_FEDERATED -DUSE_CUDA -DUSE_NCCL` (`-DNCCL_LIBRARY -DUSE_NCCL_LIB_PATH` might also be needed depending on the location of NCCL). Lastly, we recommend using a [cuda image](https://hub.docker.com/r/nvidia/cuda/tags) if you prefer working with docker.
 
 ## Preparing HIGGS Data
 In this example we showcase a binary classification task based on the [HIGGS dataset](https://archive.ics.uci.edu/dataset/280/higgs), which contains 11 million instances, each with 28 features and 1 class label.
 
 ### Download and Store Dataset
-We first download the dataset from the HIGGS link above, which is a single zipped `.csv` file.
+First download the dataset from the HIGGS link above, which is a single zipped `.csv` file.
 By default, we assume the dataset is downloaded, uncompressed, and stored in `~/dataset/HIGGS.csv`.
 
 ### Vertical Data Splits
@@ -47,7 +47,13 @@ Next, we can use `FedXGBHistogramExecutor` and set XGBoost training parameters i
 
 Lastly, we must subclass `XGBDataLoader` and implement the `load_data()` method. For vertical federated learning, it is important when creating the `xgb.Dmatrix` to set `data_split_mode=1` for column mode, and to specify the presence of a label column `?format=csv&label_column=0` for the csv file. To support PSI, the dataloader can also read in the dataset based on the calculated intersection, and split the data into training and validation.
 
-> **_NOTE:_** For secure mode, make sure to provide the required certificates for the federated communicator. Also as of now, GPUs are not yet supported by vertical federated XGBoost.
+> **_NOTE:_** For secure mode, make sure to provide the required certificates for the federated communicator.
+
+### GPU Support
+By default, CPU based training is used.
+
+In order to enable GPU accelerated training, first ensure that your machine has CUDA installed and has at least one GPU.
+In `config_fed_client.json` set `"use_gpus": true` and  `"tree_method": "hist"` in `xgb_params`. Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`. If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.
 
 ## Run the Example
 Run the vertical xgboost job:

diff --git a/examples/advanced/vertical_xgboost/jobs/vertical_xgb/app/config/config_fed_client.json b/examples/advanced/vertical_xgboost/jobs/vertical_xgb/app/config/config_fed_client.json
@@ -17,11 +17,12 @@
             "gamma": 1.0,
             "max_depth": 8,
             "min_child_weight": 100,
-            "tree_method": "approx",
+            "tree_method": "hist",
             "grow_policy": "depthwise",
             "eval_metric": "auc"
           },
-          "data_loader_id": "dataloader"
+          "data_loader_id": "dataloader",
+          "use_gpus": false
         }
       }
     }

diff --git a/nvflare/app_opt/xgboost/histogram_based/executor.py b/nvflare/app_opt/xgboost/histogram_based/executor.py
@@ -73,7 +73,15 @@ class FedXGBHistogramExecutor(Executor):
     This class implements a basic xgb_train logic, feel free to overwrite the function for custom behavior.
     """
 
-    def __init__(self, num_rounds, early_stopping_rounds, xgb_params: dict, data_loader_id: str, verbose_eval=False):
+    def __init__(
+        self,
+        num_rounds,
+        early_stopping_rounds,
+        xgb_params: dict,
+        data_loader_id: str,
+        verbose_eval=False,
+        use_gpus=False,
+    ):
         """Federated XGBoost Executor for histogram-base collaboration.
 
         This class sets up the training environment for Federated XGBoost.
@@ -88,14 +96,17 @@ def __init__(self, num_rounds, early_stopping_rounds, xgb_params: dict, data_loa
                 https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.training
             data_loader_id: the ID points to XGBDataLoader.
             verbose_eval: verbose_eval in xgboost.train
+            use_gpus: flag to enable gpu training
         """
         super().__init__()
         self.app_dir = None
 
         self.num_rounds = num_rounds
         self.early_stopping_rounds = early_stopping_rounds
-        self.verbose_eval = verbose_eval
         self.xgb_params = xgb_params
+        self.data_loader_id = data_loader_id
+        self.verbose_eval = verbose_eval
+        self.use_gpus = use_gpus
 
         self.rank = None
         self.world_size = None
@@ -104,7 +115,6 @@ def __init__(self, num_rounds, early_stopping_rounds, xgb_params: dict, data_loa
         self._client_key_path = None
         self._client_cert_path = None
         self._server_address = "localhost"
-        self.data_loader_id = data_loader_id
         self.train_data = None
         self.val_data = None
 
@@ -236,6 +246,11 @@ def train(self, shareable: Shareable, fl_ctx: FLContext, abort_signal: Signal) -
         self.rank = rank_map[client_name]
         self.world_size = world_size
 
+        if self.use_gpus:
+            # mapping each rank to a GPU (can set to cuda:{0} if simulating with only one gpu)
+            self.log_info(fl_ctx, f"Training with GPU {self.rank}")
+            self.xgb_params["device"] = f"cuda:{self.rank}"
+
         self.log_info(fl_ctx, f"Using xgb params: {self.xgb_params}")
         params = XGBoostParams(
             xgb_params=self.xgb_params,