From 9c27c3beeda6c732c701ef4756febc84f130d81a Mon Sep 17 00:00:00 2001 From: Thomas Capelle Date: Fri, 14 Jun 2024 14:22:48 +0200 Subject: [PATCH] fix it (#531) --- ...edit_Scorecards_with_XGBoost_and_W&B.ipynb | 164 ++++++++---------- 1 file changed, 68 insertions(+), 96 deletions(-) diff --git a/colabs/boosting/Credit_Scorecards_with_XGBoost_and_W&B.ipynb b/colabs/boosting/Credit_Scorecards_with_XGBoost_and_W&B.ipynb index 57b392ef..2af09b7a 100644 --- a/colabs/boosting/Credit_Scorecards_with_XGBoost_and_W&B.ipynb +++ b/colabs/boosting/Credit_Scorecards_with_XGBoost_and_W&B.ipynb @@ -1,7 +1,6 @@ { "cells": [ { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -10,7 +9,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -22,7 +20,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -31,7 +28,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -43,21 +39,19 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# In this notebook\n", "\n", - "In this colab we'll cover how Weights and Biases enables regulated entities to \n", + "In this colab we'll cover how Weights and Biases enables regulated entities to\n", "- **Track and version** their data ETL pipelines (locally or in cloud services such as S3 and GCS)\n", - "- **Track experiment results** and store trained models \n", - "- **Visually inspect** multiple evaluation metrics \n", + "- **Track experiment results** and store trained models\n", + "- **Visually inspect** multiple evaluation metrics\n", "- **Optimize performance** with hyperparameter sweeps" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -67,7 +61,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -75,7 +68,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -85,7 +77,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -93,7 +84,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -107,7 +97,7 @@ "outputs": [], "source": [ "!pip install -qq \"wandb>=0.13.10\" dill\n", - "!pip install -qq \"xgboost>=1.7.4\" \"scikit-learn>=1.2.1\"" + "!pip install -qq \"xgboost>=2.0.0\" \"scikit-learn>=1.2.1\"" ] }, { @@ -137,7 +127,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -145,7 +134,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -153,7 +141,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -161,13 +148,12 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Weights and Biases **Artifacts** enable you to log end-to-end training pipelines to ensure your experiments are always reproducible.\n", "\n", - "Data privacy is critical to Weights & Biases and so we support the creation of Artifacts from reference locations such as your own private cloud such as AWS S3 or Google Cloud Storage. Local, on-premises of W&B are also available upon request. \n", + "Data privacy is critical to Weights & Biases and so we support the creation of Artifacts from reference locations such as your own private cloud such as AWS S3 or Google Cloud Storage. Local, on-premises of W&B are also available upon request.\n", "\n", "By default, W&B stores artifact files in a private Google Cloud Storage bucket located in the United States. All files are encrypted at rest and in transit. For sensitive files, we recommend a private W&B installation or the use of reference artifacts.\n", "\n", @@ -196,12 +182,11 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Login to W&B\n", - "Login to Weights and Biases " + "Login to Weights and Biases" ] }, { @@ -225,13 +210,12 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Vehicle Loan Dataset\n", "\n", - "We will be using a simplified version of the [Vehicle Loan Default Prediction dataset](https://www.kaggle.com/sneharshinde/ltfs-av-data) from L&T which has been stored in W&B Artifacts. " + "We will be using a simplified version of the [Vehicle Loan Default Prediction dataset](https://www.kaggle.com/sneharshinde/ltfs-av-data) from L&T which has been stored in W&B Artifacts." ] }, { @@ -250,7 +234,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -264,11 +247,10 @@ "outputs": [], "source": [ "def function_to_string(fn):\n", - " return getsource(detect.code(fn)) " + " return getsource(detect.code(fn))" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -287,7 +269,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -314,12 +295,10 @@ "from data_utils import (\n", " describe_data_g_targ,\n", " one_hot_encode_data,\n", - " load_training_data,\n", ")" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -344,7 +323,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -358,13 +336,13 @@ "outputs": [], "source": [ "# Create a new artifact for the processed data, including the function that created it, to Artifacts\n", - "processed_ds_art = wandb.Artifact(name='vehicle_defaults_processed', \n", + "processed_ds_art = wandb.Artifact(name='vehicle_defaults_processed',\n", " type='processed_dataset',\n", " description='One-hot encoded dataset',\n", " metadata={'preprocessing_fn': function_to_string(one_hot_encode_data)}\n", " )\n", "\n", - "# Attach our processed data to the Artifact \n", + "# Attach our processed data to the Artifact\n", "processed_ds_art.add_file(processed_data_path)\n", "\n", "# Log this Artifact to the current wandb run\n", @@ -374,19 +352,18 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Get Train/Validation Split\n", "\n", - "Here we show an alternative pattern for how to create a wandb run object. In the cell below, the code to split the dataset is wrapped with a call to `wandb.init() as run`. \n", + "Here we show an alternative pattern for how to create a wandb run object. In the cell below, the code to split the dataset is wrapped with a call to `wandb.init() as run`.\n", "\n", "Here we will:\n", "\n", "- Start a wandb run\n", "- Download our one-hot-encoded dataset from Artifacts\n", - "- Do the Train/Val split and log the params used in the split \n", + "- Do the Train/Val split and log the params used in the split\n", "- Log the new `trndat` and `valdat` datasets to Artifacts\n", "- Finish the wandb run automatically" ] @@ -398,49 +375,48 @@ "outputs": [], "source": [ "with wandb.init(project=WANDB_PROJECT, job_type='train-val-split') as run: # config is optional here\n", - " \n", + "\n", " # Download the subset of the vehicle loan default data from W&B\n", " dataset_art = run.use_artifact('vehicle_defaults_processed:latest', type='processed_dataset')\n", " dataset_dir = dataset_art.download(data_dir)\n", " dataset = pd.read_csv(processed_data_path)\n", - " \n", + "\n", " # Set Split Params\n", " test_size = 0.25\n", " random_state = 42\n", - " \n", + "\n", " # Log the splilt params\n", " run.config.update({'test_size':test_size, 'random_state': random_state})\n", - " \n", + "\n", " # Do the Train/Val Split\n", - " trndat, valdat = model_selection.train_test_split(dataset, test_size=test_size, \n", + " trndat, valdat = model_selection.train_test_split(dataset, test_size=test_size,\n", " random_state=random_state, stratify=dataset[[targ_var]])\n", "\n", " print(f'Train dataset size: {trndat[targ_var].value_counts()} \\n')\n", " print(f'Validation dataset sizeL {valdat[targ_var].value_counts()}')\n", - " \n", + "\n", " # Save split datasets\n", " train_path = data_dir/'train.csv'\n", " val_path = data_dir/'val.csv'\n", " trndat.to_csv(train_path, index=False)\n", " valdat.to_csv(val_path, index=False)\n", - " \n", + "\n", " # Create a new artifact for the processed data, including the function that created it, to Artifacts\n", - " split_ds_art = wandb.Artifact(name='vehicle_defaults_split', \n", + " split_ds_art = wandb.Artifact(name='vehicle_defaults_split',\n", " type='train-val-dataset',\n", " description='Processed dataset split into train and valiation',\n", " metadata={'test_size': test_size, 'random_state': random_state}\n", " )\n", - " \n", - " # Attach our processed data to the Artifact \n", + "\n", + " # Attach our processed data to the Artifact\n", " split_ds_art.add_file(train_path)\n", " split_ds_art.add_file(val_path)\n", - " \n", + "\n", " # Log the Artifact\n", " run.log_artifact(split_ds_art)" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -459,7 +435,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -488,7 +463,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -496,7 +470,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -513,7 +486,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -530,7 +502,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -543,7 +514,7 @@ "metadata": {}, "outputs": [], "source": [ - "base_rate = round(trndict['base_rate'], 6) \n", + "base_rate = round(trndict['base_rate'], 6)\n", "early_stopping_rounds = 40" ] }, @@ -561,7 +532,7 @@ " , 'max_depth': 3\n", " , 'min_child_weight': 100 ## def: 1\n", " , 'n_estimators': 25\n", - " , 'nthread': 24 \n", + " , 'nthread': 24\n", " , 'random_state': 42\n", " , 'reg_alpha': 0\n", " , 'reg_lambda': 0 ## def: 1\n", @@ -571,11 +542,10 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Log the xgboost training parameters to the W&B run config " + "Log the xgboost training parameters to the W&B run config" ] }, { @@ -589,11 +559,28 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "#### 3) Load the Training Data from W&B Artifacts" + "#### 3) Let's select the data for train/validation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_dir" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "valdat" ] }, { @@ -602,17 +589,12 @@ "metadata": {}, "outputs": [], "source": [ - "# Load our training data from Artifacts\n", - "trndat, valdat = load_training_data(run=run, data_dir=data_dir, \n", - " artifact_name='vehicle_defaults_split:latest')\n", - "\n", "## Extract target column as a series\n", "y_trn = trndat.loc[:,targ_var].astype(int)\n", "y_val = valdat.loc[:,targ_var].astype(int)" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -627,20 +609,19 @@ "metadata": {}, "outputs": [], "source": [ - "from wandb.xgboost import WandbCallback\n", + "from wandb.integration.xgboost import WandbCallback\n", "\n", "# Initialize the XGBoostClassifier with the WandbCallback\n", - "xgbmodel = xgb.XGBClassifier(**bst_params, \n", + "xgbmodel = xgb.XGBClassifier(**bst_params,\n", " callbacks=[WandbCallback(log_model=True)],\n", " early_stopping_rounds=run.config['early_stopping_rounds'])\n", "\n", "# Train the model\n", - "xgbmodel.fit(trndat[p_vars], y_trn, \n", + "xgbmodel.fit(trndat[p_vars], y_trn,\n", " eval_set=[(valdat[p_vars], y_val)])" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -657,10 +638,10 @@ "\n", "# Get train and validation predictions\n", "trnYpreds = xgbmodel.predict_proba(trndat[p_vars])[:,1]\n", - "valYpreds = xgbmodel.predict_proba(valdat[p_vars])[:,1] \n", + "valYpreds = xgbmodel.predict_proba(valdat[p_vars])[:,1]\n", "\n", "# Log additional Train metrics\n", - "false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_trn, trnYpreds) \n", + "false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_trn, trnYpreds)\n", "run.summary['train_ks_stat'] = max(true_positive_rate - false_positive_rate)\n", "run.summary['train_auc'] = metrics.auc(false_positive_rate, true_positive_rate)\n", "run.summary['train_log_loss'] = -(y_trn * np.log(trnYpreds) + (1-y_trn) * np.log(1-trnYpreds)).sum() / len(y_trn)\n", @@ -671,12 +652,11 @@ "run.summary[\"val_ks_pval\"] = ks_pval\n", "run.summary[\"val_auc\"] = metrics.roc_auc_score(y_val, valYpreds)\n", "run.summary[\"val_acc_0.5\"] = metrics.accuracy_score(y_val, np.where(valYpreds >= 0.5, 1, 0))\n", - "run.summary[\"val_log_loss\"] = -(y_val * np.log(valYpreds) \n", + "run.summary[\"val_log_loss\"] = -(y_val * np.log(valYpreds)\n", " + (1-y_val) * np.log(1-valYpreds)).sum() / len(y_val)" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -697,14 +677,13 @@ " d +=1\n", " valYpreds_2d = valYpreds_2d[::1, ::d]\n", " y_val_arr = y_val_arr[::d]\n", - " \n", + "\n", "run.log({\"ROC_Curve\" : wandb.plot.roc_curve(y_val_arr, valYpreds_2d.T,\n", " labels=['no_default','loan_default'],\n", " classes_to_plot=[1])})" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -721,7 +700,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -735,7 +713,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -743,7 +720,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -782,7 +758,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -790,7 +765,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -803,9 +777,9 @@ "metadata": {}, "outputs": [], "source": [ - "def train(): \n", + "def train():\n", " with wandb.init(job_type=\"sweep\") as run:\n", - " \n", + "\n", " bst_params = {\n", " 'objective': 'binary:logistic'\n", " , 'base_score': base_rate\n", @@ -814,34 +788,34 @@ " , 'max_depth': 3\n", " , 'min_child_weight': run.config['min_child_weight']\n", " , 'n_estimators': 25\n", - " , 'nthread': 24 \n", + " , 'nthread': 24\n", " , 'random_state': 42\n", " , 'reg_alpha': 0\n", " , 'reg_lambda': 0 ## def: 1\n", " , 'eval_metric': ['auc', 'logloss']\n", - " , 'tree_method': 'hist' \n", + " , 'tree_method': 'hist'\n", " }\n", - " \n", + "\n", " # Initialize the XGBoostClassifier with the WandbCallback\n", - " xgbmodel = xgb.XGBClassifier(**bst_params, \n", + " xgbmodel = xgb.XGBClassifier(**bst_params,\n", " callbacks=[WandbCallback()],\n", " early_stopping_rounds=run.config['early_stopping_rounds'])\n", "\n", " # Train the model\n", - " xgbmodel.fit(trndat[p_vars], y_trn, \n", + " xgbmodel.fit(trndat[p_vars], y_trn,\n", " eval_set=[(valdat[p_vars], y_val)])\n", "\n", " bstr = xgbmodel.get_booster()\n", "\n", " # Log booster metrics\n", - " run.summary[\"best_ntree_limit\"] = bstr.best_ntree_limit\n", - " \n", + " run.summary[\"best_iteration\"] = bstr.best_iteration\n", + "\n", " # Get train and validation predictions\n", " trnYpreds = xgbmodel.predict_proba(trndat[p_vars])[:,1]\n", - " valYpreds = xgbmodel.predict_proba(valdat[p_vars])[:,1] \n", + " valYpreds = xgbmodel.predict_proba(valdat[p_vars])[:,1]\n", "\n", " # Log additional Train metrics\n", - " false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_trn, trnYpreds) \n", + " false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_trn, trnYpreds)\n", " run.summary['train_ks_stat'] = max(true_positive_rate - false_positive_rate)\n", " run.summary['train_auc'] = metrics.auc(false_positive_rate, true_positive_rate)\n", " run.summary['train_log_loss'] = -(y_trn * np.log(trnYpreds) + (1-y_trn) * np.log(1-trnYpreds)).sum() / len(y_trn)\n", @@ -852,12 +826,11 @@ " run.summary[\"val_ks_pval\"] = ks_pval\n", " run.summary[\"val_auc\"] = metrics.roc_auc_score(y_val, valYpreds)\n", " run.summary[\"val_acc_0.5\"] = metrics.accuracy_score(y_val, np.where(valYpreds >= 0.5, 1, 0))\n", - " run.summary[\"val_log_loss\"] = -(y_val * np.log(valYpreds) \n", + " run.summary[\"val_log_loss\"] = -(y_val * np.log(valYpreds)\n", " + (1-y_val) * np.log(1-valYpreds)).sum() / len(y_val)" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -870,12 +843,11 @@ "metadata": {}, "outputs": [], "source": [ - "count = 10 # number of runs to execute\n", + "count = 5 # number of runs to execute\n", "wandb.agent(sweep_id, function=train, count=count)" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -890,9 +862,9 @@ "- Fastai\n", "- XGBoost\n", "- Sci-Kit Learn\n", - "- LightGBM \n", + "- LightGBM\n", "\n", - "**See [W&B integrations for details](https://docs.wandb.ai/guides/integrations)** " + "**See [W&B integrations for details](https://docs.wandb.ai/guides/integrations)**" ] } ], @@ -909,5 +881,5 @@ } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 0 }