Generalization

prof-rossetti · Sep 17, 2024 · c82b365 · c82b365
1 parent 49a6e6d
commit c82b365
Show file tree

Hide file tree

Showing 7 changed files with 481 additions and 20 deletions.
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -262,6 +262,9 @@ website:
  text: "Machine Learning Foundations"
  contents:
 
+ - section:
+ href: notes/predictive-modeling/ml-foundations/generalization.qmd
+ text: "Generalization"
  - section:
  href: notes/predictive-modeling/ml-foundations/data-encoding.qmd
  text: "Data Encoding"

diff --git a/docs/images/aws-underfitting-overfitting.png b/docs/images/aws-underfitting-overfitting.png
diff --git a/docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd b/docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd
@@ -6,22 +6,32 @@ When preparing features (`x` values) for training machine learning models, the m
 So if we have categorical or textual data, we will need to use a **data encoding** strategy to represent the data in a different way.
 
 
-use either an ordinal or one-hot encoding strategy, depending on whether there is a certain ordered relationship.
+For categorical data, we'll use either an ordinal or one-hot encoding strategy, depending on whether there is a certain ordered relationship in the data or not.
 
 
-## Ordinal Encoding
+## Ordinal Encoding for Categorical Data
 
 If the data has an order about it, where one category means more or less than others, then we will convert the categories into a linear range of numbered values.
 
 
-## Time Step Encoding
+## Time Step Encoding for Time-series Data
 
-When dealing with time series data, we need to convert the dates to numbers. We can create an ordered list of time step integers, where 1 represents the first time point, 2 represents the second time point, etc.
+When dealing with time-series data, we need to convert the dates to numbers.
 
+We can take advantage of the linear nature of time, and model dates as integer time steps. For example, starting at one for the earliest data point, and incrementing by one with each subsequent data point. This assumes our observations are over uniform time intervals (i.e. daily, monthly, annual frequency, etc.).
 
+To create an ordered list of time step integers, we sort our dataset by date in ascending order, putting the earliest date first. Then we add a column of integers incrementing from one to the length of the dataset:
 
+```python
+df.sort_values(by="date", ascending=True, inplace=True)
 
-## One-hot Encoding
+df["time_step"] = range(1, len(df) + 1)
+df
+```
+
+
+
+## One-hot Encoding for Categorical Data
 
 
 If the data is truly categorical, where there is no ordinal relationship present, we will perform "one-hot" encoding.

diff --git a/docs/notes/predictive-modeling/ml-foundations/generalization.ipynb b/docs/notes/predictive-modeling/ml-foundations/generalization.ipynb
@@ -0,0 +1,189 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Generalization\n",
+ "\n",
+ "\n",
+ "In machine learning, **generalization** refers to the model's ability to perform well on unseen data, rather than simply memorizing the patterns in the training set. A model that generalizes well captures the underlying structure of the data and performs accurately when exposed to new inputs. Achieving good generalization is one of the primary goals in machine learning.\n",
+ "\n",
+ "Let's discuss key concepts related to generalization, including the trade-off between overfitting and underfitting, the importance of splitting datasets into training and testing sets, and the role of cross-validation in evaluating model performance.\n",
+ "\n",
+ "## Overfitting vs Underfitting\n",
+ "\n",
+ "![Three different models, illustrating trade-offs between overfitting and underfitting. Source: [`sklearn` package](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html)](../../../images/sklearn-underfitting-overfitting.png)\n",
+ "\n",
+ "### Overfitting\n",
+ "\n",
+ "**Overfitting** occurs when a model is too complex and learns not only the underlying patterns but also the noise and random fluctuations in the training data. An overfitted model performs very well on the training data but fails to generalize to unseen data. This results in poor performance on the test set, as the model struggles to adapt to new inputs that do not perfectly match the training data.\n",
+ "\n",
+ "In technical terms, overfitting happens when a model has low bias but high variance. The model fits the training data very closely, but any small changes in input data lead to significant variations in the output predictions.\n",
+ "\n",
+ "Common causes of overfitting include:\n",
+ "\n",
+ " + Using a model that is too complex for the given data (e.g., deep neural networks on small datasets).\n",
+ " + Training the model for too long without proper regularization.\n",
+ " + Using too many features or irrelevant features.\n",
+ "\n",
+ "Symptoms of overfitting:\n",
+ "\n",
+ " + Very low training error, but significantly higher error on the validation or test set.\n",
+ " + High variance in performance across different subsets of the data.\n",
+ "\n",
+ "\n",
+ "### Underfitting\n",
+ "\n",
+ "**Underfitting** occurs when a model is too simple to capture the underlying structure of the data. An underfitted model performs poorly both on the training data and the test data because it fails to learn the important relationships between input features and output labels.\n",
+ "\n",
+ "In technical terms, underfitting happens when a model has high bias but low variance. The model is too rigid, making overly simplistic predictions that do not adequately capture the complexities of the data.\n",
+ "\n",
+ "Common causes of underfitting include:\n",
+ "\n",
+ " + Using a model that is too simple for the task at hand (e.g., linear regression for non-linear data).\n",
+ " + Not training the model long enough or with sufficient data.\n",
+ " + Using too few features or ignoring important features.\n",
+ "\n",
+ "Symptoms of underfitting:\n",
+ "\n",
+ " + High error on both the training set and the test set.\n",
+ " + The model makes simplistic predictions that fail to capture the complexity of the data.\n",
+ "\n",
+ "### Finding a Balance\n",
+ "\n",
+ "The goal in predictive modeling is to find a model that strikes a balance between overfitting and underfitting. This balance is achieved by using appropriate model complexity, proper data preprocessing, and regularization techniques. A model that generalizes well will have low error on both the training and testing datasets.\n",
+ "\n",
+ "![Three different models, illustrating trade-offs between overfitting and underfitting. Source: [AWS Machine Learning](https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html)](../../../images/aws-underfitting-overfitting.png)\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "Additional resources about generalization:\n",
+ "\n",
+ " + <https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html>\n",
+ " + <https://developers.google.com/machine-learning/crash-course/overfitting/generalization>\n",
+ " + <https://developers.google.com/machine-learning/crash-course/overfitting/overfitting>\n",
+ "\n",
+ "\n",
+ "## Data Splitting Strategies\n",
+ "\n",
+ "When building a machine learning model, it is important to evaluate its performance on data that the model has not seen during training. This ensures that the model is not overfitting and can generalize to new data.\n",
+ "\n",
+ "### Two-way Splits\n",
+ "\n",
+ "To keep some of the data unseen, we split the available data into training and testing datasets:\n",
+ "\n",
+ " + **Training set**: This is the portion of the data that the model is trained on. The model learns patterns and relationships in the data using this set.\n",
+ " + **Testing set**: This is a separate portion of the data that the model has never seen before. After training the model, the test set is used to evaluate its generalization ability.\n",
+ "\n",
+ "A common strategy for splitting the data is the train-test split, where a portion of the data (often 70-80%) is reserved for training, and the remaining (20-30%) is used for testing. This approach allows us to estimate the model's performance on unseen data.\n",
+ "\n",
+ "![Two-way split (training and test sets). Source: [Google ML Concepts](https://developers.google.com/machine-learning/crash-course/overfitting/dividing-datasets).](../../../images/partition-two-sets.png)\n",
+ "\n",
+ "\n",
+ "### Three-way Splits\n",
+ "\n",
+ "In practice, we often use a **validation set** in addition to the training and test sets, particularly when fine-tuning a model's hyperparameters. The validation set allows us to adjust and optimize the model's hyperparameters across multiple runs without ever exposing the model to the test set. This reduces the risk of overfitting to the test data.\n",
+ "\n",
+ "![Three-way split (training, validation, and test sets). Source: [Google ML Concepts](https://developers.google.com/machine-learning/crash-course/overfitting/dividing-datasets).](../../../images/partition-three-sets.png)\n",
+ "\n",
+ "After training the model on the training data, we evaluate its performance on the validation set. This process can be repeated iteratively, adjusting hyperparameters and retraining the model until the performance is satisfactory.\n",
+ "\n",
+ "![Workflow using validation set. Source: [Google ML Concepts](https://developers.google.com/machine-learning/crash-course/overfitting/dividing-datasets).](../../../images/workflow-with-validation-set.svg)\n",
+ "\n",
+ "Once we believe the model is well-optimized, we use the test set to evaluate its true generalization ability on unseen data. By limiting the model's exposure to the test set until the final evaluation, we ensure that the test results provide an unbiased estimate of real-world performance.\n",
+ "\n",
+ "\n",
+ "\n",
+ "### Cross Validation\n",
+ "\n",
+ "With **cross validation**, instead of relying on a single training or validation set, we use multiple validation sets to improve the model's robustness and reduce the risk of overfitting.\n",
+ "\n",
+ "![K-fold cross validation (k=4). Source: [Google ML Concepts](https://developers.google.com/machine-learning/glossary#k-fold-cross-validation).](../../../images/k-fold-cross-validation.png)\n",
+ "\n",
+ "\n",
+ "The dataset is divided into several folds (commonly called **K-fold cross-validation**), and the model is trained and validated on different subsets of the data in each iteration. This provides a more comprehensive understanding of the model’s performance across various data splits, making it less sensitive to any specific partitioning.\n",
+ "\n",
+ "\n",
+ "Cross validation is especially valuable when fine-tuning model hyperparameters, as it prevents overfitting to a specific validation set or the test set by providing a more generalized evaluation before the final test set assessment.\n",
+ "\n",
+ "\n",
+ "## Data Splitting Methods\n",
+ "\n",
+ "This section provides some practical methods for splitting data in Python.\n",
+ "\n",
+ "\n",
+ "### Shuffled Splits\n",
+ "\n",
+ "In most machine learning problems, we typically perform a shuffled split, where the order of the data is randomized before partitioning it into training and testing sets. This helps ensure that the distribution in the training set closely resembles that of the test set, which reduces potential biases.\n",
+ "\n",
+ "One common way of implementing a shuffled two-way split is to leverage the [`train_test_split` function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from `sklearn`:\n",
+ "\n",
+ "\n",
+ "```python\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=99)\n",
+ "print(\"TRAIN:\", x_train.shape, y_train.shape)\n",
+ "print(\"TEST:\", x_test.shape, y_test.shape)\n",
+ "```\n",
+ "\n",
+ "When using the `train_test_split` function, we pass in the features (`x`) and labels (`y`), specify a `test_size` either as a fraction (e.g. 0.2 for 20% of the data) or as an absolute number of samples. We also supply a `random_state` to enable reproducibility. As a result, we obtain four different datasets: features and labels for training and testing, respectively.\n",
+ "\n",
+ "\n",
+ ":::{.callout-note title=\"Reproducibility\"}\n",
+ "The `random_state` parameter ensures that the same random shuffling and splitting occurs every time you run the code. You can choose any integer, but once it's set, subsequent executions will produce the same split. This enables consistent, reproducible results, and allows us to more accurately compare model performance across multiple runs. Without consistency of splits, results may differ slightly due to random variations in the data split, potentially confounding differences between runs and leading to misleading model evaluations.\n",
+ ":::\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "### Sequential Splits for Time-series Forecasting\n",
+ "\n",
+ "Most of the time we want to shuffle the data when splitting, however this may not be the case with time-series data.\n",
+ "\n",
+ "If we shuffle the data when performing a train/test split for time-series forecasting, several critical issues arise due to the nature of time-dependent data:\n",
+ "\n",
+ " + **Data Leakage**: Shuffling can lead to training on future data points and testing on past ones, which is unrealistic in real-world forecasting. This would allow the model to \"see the future,\" resulting in overly optimistic performance during evaluation. In practice, you'll never have access to future data when making predictions.\n",
+ "\n",
+ " + **Loss of Temporal Structure**: Time series data inherently depends on the order of observations. Shuffling breaks the sequence and removes temporal relationships, leading the model to learn patterns that don't reflect how time-dependent data actually behaves. This can distort predictions and diminish the model's forecasting ability.\n",
+ "\n",
+ " + **Unreliable Performance Metrics**: If the model is trained on future data, performance metrics like accuracy or RMSE will be unrealistically high, but once deployed, the model's performance will significantly degrade as it won't have access to future data in a real-time scenario.\n",
+ "\n",
+ "In short, shuffling time series data before splitting leads to unrealistic results and invalidates the model's ability to generalize properly. The correct approach is to split based on time (e.g., using methods like time-based cross-validation or time series splits), ensuring that the training set only contains past data relative to the test set.\n"
+ ],
+ "id": "b1afba72"
+ },
+ {
+ "cell_type": "code",
+ "metadata": {},
+ "source": [
+ "print(len(df))\n",
+ "\n",
+ "training_size = round(len(df) * .8)\n",
+ "print(training_size)\n",
+ "\n",
+ "x_train = x.iloc[:training_size] # slice all before\n",
+ "y_train = y.iloc[:training_size] # slice all before\n",
+ "\n",
+ "x_test = x.iloc[training_size:] # slice all after\n",
+ "y_test = y.iloc[training_size:] # slice all after\n",
+ "print(\"TRAIN:\", x_train.shape)\n",
+ "print(\"TEST:\", x_test.shape)"
+ ],
+ "id": "f938279b",
+ "execution_count": null,
+ "outputs": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "name": "python3",
+ "language": "python",
+ "display_name": "Python 3 (ipykernel)"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}