From 9c27c3beeda6c732c701ef4756febc84f130d81a Mon Sep 17 00:00:00 2001
From: Thomas Capelle <tcapelle@pm.me>
Date: Fri, 14 Jun 2024 14:22:48 +0200
Subject: [PATCH] fix it (#531)

---
 ...edit_Scorecards_with_XGBoost_and_W&B.ipynb | 164 ++++++++----------
 1 file changed, 68 insertions(+), 96 deletions(-)

diff --git a/colabs/boosting/Credit_Scorecards_with_XGBoost_and_W&B.ipynb b/colabs/boosting/Credit_Scorecards_with_XGBoost_and_W&B.ipynb
index 57b392ef..2af09b7a 100644
--- a/colabs/boosting/Credit_Scorecards_with_XGBoost_and_W&B.ipynb
+++ b/colabs/boosting/Credit_Scorecards_with_XGBoost_and_W&B.ipynb
@@ -1,7 +1,6 @@
 {
  "cells": [
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -10,7 +9,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -22,7 +20,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -31,7 +28,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -43,21 +39,19 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# In this notebook\n",
     "\n",
-    "In this colab we'll cover how Weights and Biases enables regulated entities to \n",
+    "In this colab we'll cover how Weights and Biases enables regulated entities to\n",
     "- **Track and version** their data ETL pipelines (locally or in cloud services such as S3 and GCS)\n",
-    "- **Track experiment results** and store trained models \n",
-    "- **Visually inspect** multiple evaluation metrics \n",
+    "- **Track experiment results** and store trained models\n",
+    "- **Visually inspect** multiple evaluation metrics\n",
     "- **Optimize performance** with hyperparameter sweeps"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -67,7 +61,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -75,7 +68,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -85,7 +77,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -93,7 +84,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -107,7 +97,7 @@
    "outputs": [],
    "source": [
     "!pip install -qq \"wandb>=0.13.10\" dill\n",
-    "!pip install -qq \"xgboost>=1.7.4\" \"scikit-learn>=1.2.1\""
+    "!pip install -qq \"xgboost>=2.0.0\" \"scikit-learn>=1.2.1\""
    ]
   },
   {
@@ -137,7 +127,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -145,7 +134,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -153,7 +141,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -161,13 +148,12 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Weights and Biases **Artifacts** enable you to log end-to-end training pipelines to ensure your experiments are always reproducible.\n",
     "\n",
-    "Data privacy is critical to Weights & Biases and so we support the creation of Artifacts from reference locations such as your own private cloud such as AWS S3 or Google Cloud Storage. Local, on-premises of W&B are also available upon request. \n",
+    "Data privacy is critical to Weights & Biases and so we support the creation of Artifacts from reference locations such as your own private cloud such as AWS S3 or Google Cloud Storage. Local, on-premises of W&B are also available upon request.\n",
     "\n",
     "By default, W&B stores artifact files in a private Google Cloud Storage bucket located in the United States. All files are encrypted at rest and in transit. For sensitive files, we recommend a private W&B installation or the use of reference artifacts.\n",
     "\n",
@@ -196,12 +182,11 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Login to W&B\n",
-    "Login to Weights and Biases "
+    "Login to Weights and Biases"
    ]
   },
   {
@@ -225,13 +210,12 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Vehicle Loan Dataset\n",
     "\n",
-    "We will be using a simplified version of the [Vehicle Loan Default Prediction dataset](https://www.kaggle.com/sneharshinde/ltfs-av-data) from L&T which has been stored in W&B Artifacts. "
+    "We will be using a simplified version of the [Vehicle Loan Default Prediction dataset](https://www.kaggle.com/sneharshinde/ltfs-av-data) from L&T which has been stored in W&B Artifacts."
    ]
   },
   {
@@ -250,7 +234,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -264,11 +247,10 @@
    "outputs": [],
    "source": [
     "def function_to_string(fn):\n",
-    "    return getsource(detect.code(fn)) "
+    "    return getsource(detect.code(fn))"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -287,7 +269,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -314,12 +295,10 @@
     "from data_utils import (\n",
     "    describe_data_g_targ,\n",
     "    one_hot_encode_data,\n",
-    "    load_training_data,\n",
     ")"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -344,7 +323,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -358,13 +336,13 @@
    "outputs": [],
    "source": [
     "# Create a new artifact for the processed data, including the function that created it, to Artifacts\n",
-    "processed_ds_art = wandb.Artifact(name='vehicle_defaults_processed', \n",
+    "processed_ds_art = wandb.Artifact(name='vehicle_defaults_processed',\n",
     "                                    type='processed_dataset',\n",
     "                                    description='One-hot encoded dataset',\n",
     "                                    metadata={'preprocessing_fn': function_to_string(one_hot_encode_data)}\n",
     "                                 )\n",
     "\n",
-    "# Attach our processed data to the Artifact \n",
+    "# Attach our processed data to the Artifact\n",
     "processed_ds_art.add_file(processed_data_path)\n",
     "\n",
     "# Log this Artifact to the current wandb run\n",
@@ -374,19 +352,18 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Get Train/Validation Split\n",
     "\n",
-    "Here we show an alternative pattern for how to create a wandb run object. In the cell below, the code to split the dataset is wrapped with a call to `wandb.init() as run`. \n",
+    "Here we show an alternative pattern for how to create a wandb run object. In the cell below, the code to split the dataset is wrapped with a call to `wandb.init() as run`.\n",
     "\n",
     "Here we will:\n",
     "\n",
     "- Start a wandb run\n",
     "- Download our one-hot-encoded dataset from Artifacts\n",
-    "- Do the Train/Val split and log the params used in the split \n",
+    "- Do the Train/Val split and log the params used in the split\n",
     "- Log the new `trndat` and `valdat` datasets to Artifacts\n",
     "- Finish the wandb run automatically"
    ]
@@ -398,49 +375,48 @@
    "outputs": [],
    "source": [
     "with wandb.init(project=WANDB_PROJECT, job_type='train-val-split') as run:     # config is optional here\n",
-    "    \n",
+    "\n",
     "    # Download the subset of the vehicle loan default data from W&B\n",
     "    dataset_art = run.use_artifact('vehicle_defaults_processed:latest', type='processed_dataset')\n",
     "    dataset_dir = dataset_art.download(data_dir)\n",
     "    dataset = pd.read_csv(processed_data_path)\n",
-    "    \n",
+    "\n",
     "    # Set Split Params\n",
     "    test_size = 0.25\n",
     "    random_state = 42\n",
-    "    \n",
+    "\n",
     "    # Log the splilt params\n",
     "    run.config.update({'test_size':test_size, 'random_state': random_state})\n",
-    "    \n",
+    "\n",
     "    # Do the Train/Val Split\n",
-    "    trndat, valdat = model_selection.train_test_split(dataset, test_size=test_size, \n",
+    "    trndat, valdat = model_selection.train_test_split(dataset, test_size=test_size,\n",
     "                                                      random_state=random_state, stratify=dataset[[targ_var]])\n",
     "\n",
     "    print(f'Train dataset size: {trndat[targ_var].value_counts()} \\n')\n",
     "    print(f'Validation dataset sizeL {valdat[targ_var].value_counts()}')\n",
-    "    \n",
+    "\n",
     "    # Save split datasets\n",
     "    train_path = data_dir/'train.csv'\n",
     "    val_path = data_dir/'val.csv'\n",
     "    trndat.to_csv(train_path, index=False)\n",
     "    valdat.to_csv(val_path, index=False)\n",
-    "    \n",
+    "\n",
     "    # Create a new artifact for the processed data, including the function that created it, to Artifacts\n",
-    "    split_ds_art = wandb.Artifact(name='vehicle_defaults_split', \n",
+    "    split_ds_art = wandb.Artifact(name='vehicle_defaults_split',\n",
     "                                        type='train-val-dataset',\n",
     "                                        description='Processed dataset split into train and valiation',\n",
     "                                        metadata={'test_size': test_size, 'random_state': random_state}\n",
     "                                     )\n",
-    "    \n",
-    "    # Attach our processed data to the Artifact \n",
+    "\n",
+    "    # Attach our processed data to the Artifact\n",
     "    split_ds_art.add_file(train_path)\n",
     "    split_ds_art.add_file(val_path)\n",
-    "    \n",
+    "\n",
     "    # Log the Artifact\n",
     "    run.log_artifact(split_ds_art)"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -459,7 +435,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -488,7 +463,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -496,7 +470,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -513,7 +486,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -530,7 +502,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -543,7 +514,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "base_rate = round(trndict['base_rate'], 6) \n",
+    "base_rate = round(trndict['base_rate'], 6)\n",
     "early_stopping_rounds = 40"
    ]
   },
@@ -561,7 +532,7 @@
     "        , 'max_depth': 3\n",
     "        , 'min_child_weight': 100  ## def: 1\n",
     "        , 'n_estimators': 25\n",
-    "        , 'nthread': 24 \n",
+    "        , 'nthread': 24\n",
     "        , 'random_state': 42\n",
     "        , 'reg_alpha': 0\n",
     "        , 'reg_lambda': 0          ## def: 1\n",
@@ -571,11 +542,10 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Log the xgboost training parameters to the W&B run config "
+    "Log the xgboost training parameters to the W&B run config"
    ]
   },
   {
@@ -589,11 +559,28 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### 3) Load the Training Data from W&B Artifacts"
+    "#### 3) Let's select the data for train/validation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_dir"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "valdat"
    ]
   },
   {
@@ -602,17 +589,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Load our training data from Artifacts\n",
-    "trndat, valdat = load_training_data(run=run, data_dir=data_dir, \n",
-    "                                    artifact_name='vehicle_defaults_split:latest')\n",
-    "\n",
     "## Extract target column as a series\n",
     "y_trn = trndat.loc[:,targ_var].astype(int)\n",
     "y_val = valdat.loc[:,targ_var].astype(int)"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -627,20 +609,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from wandb.xgboost import WandbCallback\n",
+    "from wandb.integration.xgboost import WandbCallback\n",
     "\n",
     "# Initialize the XGBoostClassifier with the WandbCallback\n",
-    "xgbmodel = xgb.XGBClassifier(**bst_params, \n",
+    "xgbmodel = xgb.XGBClassifier(**bst_params,\n",
     "                             callbacks=[WandbCallback(log_model=True)],\n",
     "                             early_stopping_rounds=run.config['early_stopping_rounds'])\n",
     "\n",
     "# Train the model\n",
-    "xgbmodel.fit(trndat[p_vars], y_trn, \n",
+    "xgbmodel.fit(trndat[p_vars], y_trn,\n",
     "             eval_set=[(valdat[p_vars], y_val)])"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -657,10 +638,10 @@
     "\n",
     "# Get train and validation predictions\n",
     "trnYpreds = xgbmodel.predict_proba(trndat[p_vars])[:,1]\n",
-    "valYpreds = xgbmodel.predict_proba(valdat[p_vars])[:,1] \n",
+    "valYpreds = xgbmodel.predict_proba(valdat[p_vars])[:,1]\n",
     "\n",
     "# Log additional Train metrics\n",
-    "false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_trn, trnYpreds) \n",
+    "false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_trn, trnYpreds)\n",
     "run.summary['train_ks_stat'] = max(true_positive_rate - false_positive_rate)\n",
     "run.summary['train_auc'] = metrics.auc(false_positive_rate, true_positive_rate)\n",
     "run.summary['train_log_loss'] = -(y_trn * np.log(trnYpreds) + (1-y_trn) * np.log(1-trnYpreds)).sum() / len(y_trn)\n",
@@ -671,12 +652,11 @@
     "run.summary[\"val_ks_pval\"] = ks_pval\n",
     "run.summary[\"val_auc\"] = metrics.roc_auc_score(y_val, valYpreds)\n",
     "run.summary[\"val_acc_0.5\"] = metrics.accuracy_score(y_val, np.where(valYpreds >= 0.5, 1, 0))\n",
-    "run.summary[\"val_log_loss\"] = -(y_val * np.log(valYpreds) \n",
+    "run.summary[\"val_log_loss\"] = -(y_val * np.log(valYpreds)\n",
     "                                     + (1-y_val) * np.log(1-valYpreds)).sum() / len(y_val)"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -697,14 +677,13 @@
     "    d +=1\n",
     "    valYpreds_2d = valYpreds_2d[::1, ::d]\n",
     "    y_val_arr = y_val_arr[::d]\n",
-    "    \n",
+    "\n",
     "run.log({\"ROC_Curve\" : wandb.plot.roc_curve(y_val_arr, valYpreds_2d.T,\n",
     "                                           labels=['no_default','loan_default'],\n",
     "                                           classes_to_plot=[1])})"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -721,7 +700,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -735,7 +713,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -743,7 +720,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -782,7 +758,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -790,7 +765,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -803,9 +777,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def train():     \n",
+    "def train():\n",
     "    with wandb.init(job_type=\"sweep\") as run:\n",
-    "    \n",
+    "\n",
     "        bst_params = {\n",
     "            'objective': 'binary:logistic'\n",
     "            , 'base_score': base_rate\n",
@@ -814,34 +788,34 @@
     "            , 'max_depth': 3\n",
     "            , 'min_child_weight': run.config['min_child_weight']\n",
     "            , 'n_estimators': 25\n",
-    "            , 'nthread': 24 \n",
+    "            , 'nthread': 24\n",
     "            , 'random_state': 42\n",
     "            , 'reg_alpha': 0\n",
     "            , 'reg_lambda': 0          ## def: 1\n",
     "            , 'eval_metric': ['auc', 'logloss']\n",
-    "            , 'tree_method': 'hist' \n",
+    "            , 'tree_method': 'hist'\n",
     "        }\n",
-    "        \n",
+    "\n",
     "        # Initialize the XGBoostClassifier with the WandbCallback\n",
-    "        xgbmodel = xgb.XGBClassifier(**bst_params, \n",
+    "        xgbmodel = xgb.XGBClassifier(**bst_params,\n",
     "                                     callbacks=[WandbCallback()],\n",
     "                                     early_stopping_rounds=run.config['early_stopping_rounds'])\n",
     "\n",
     "        # Train the model\n",
-    "        xgbmodel.fit(trndat[p_vars], y_trn, \n",
+    "        xgbmodel.fit(trndat[p_vars], y_trn,\n",
     "                     eval_set=[(valdat[p_vars], y_val)])\n",
     "\n",
     "        bstr = xgbmodel.get_booster()\n",
     "\n",
     "        # Log booster metrics\n",
-    "        run.summary[\"best_ntree_limit\"] = bstr.best_ntree_limit\n",
-    "        \n",
+    "        run.summary[\"best_iteration\"] = bstr.best_iteration\n",
+    "\n",
     "        # Get train and validation predictions\n",
     "        trnYpreds = xgbmodel.predict_proba(trndat[p_vars])[:,1]\n",
-    "        valYpreds = xgbmodel.predict_proba(valdat[p_vars])[:,1] \n",
+    "        valYpreds = xgbmodel.predict_proba(valdat[p_vars])[:,1]\n",
     "\n",
     "        # Log additional Train metrics\n",
-    "        false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_trn, trnYpreds) \n",
+    "        false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_trn, trnYpreds)\n",
     "        run.summary['train_ks_stat'] = max(true_positive_rate - false_positive_rate)\n",
     "        run.summary['train_auc'] = metrics.auc(false_positive_rate, true_positive_rate)\n",
     "        run.summary['train_log_loss'] = -(y_trn * np.log(trnYpreds) + (1-y_trn) * np.log(1-trnYpreds)).sum() / len(y_trn)\n",
@@ -852,12 +826,11 @@
     "        run.summary[\"val_ks_pval\"] = ks_pval\n",
     "        run.summary[\"val_auc\"] = metrics.roc_auc_score(y_val, valYpreds)\n",
     "        run.summary[\"val_acc_0.5\"] = metrics.accuracy_score(y_val, np.where(valYpreds >= 0.5, 1, 0))\n",
-    "        run.summary[\"val_log_loss\"] = -(y_val * np.log(valYpreds) \n",
+    "        run.summary[\"val_log_loss\"] = -(y_val * np.log(valYpreds)\n",
     "                                             + (1-y_val) * np.log(1-valYpreds)).sum() / len(y_val)"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -870,12 +843,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "count = 10 # number of runs to execute\n",
+    "count = 5 # number of runs to execute\n",
     "wandb.agent(sweep_id, function=train, count=count)"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -890,9 +862,9 @@
     "- Fastai\n",
     "- XGBoost\n",
     "- Sci-Kit Learn\n",
-    "- LightGBM \n",
+    "- LightGBM\n",
     "\n",
-    "**See [W&B integrations for details](https://docs.wandb.ai/guides/integrations)** "
+    "**See [W&B integrations for details](https://docs.wandb.ai/guides/integrations)**"
    ]
   }
  ],
@@ -909,5 +881,5 @@
   }
  },
  "nbformat": 4,
- "nbformat_minor": 4
+ "nbformat_minor": 0
 }