#[fit] Ai 1
#[fit] Learning a Model
- What is
$$x$$ ,$$f$$ ,$$y$$ , and that damned hat? - The simplest models and evaluating them
- Frequentist Statistics
- Noise and Sampling
- Bootstrap
- SMALL World vs BIG World
- Approximation
- THE REAL WORLD HAS NOISE
- Complexity amongst Models
- Validation and Cross Validation
##[fit] 1. SMALL World
##[fit] BIG World
- Small World given a map or model of the world, how do we do things in this map?
- BIG World compares maps or models. Asks: whats the best map?
(Behaim Globe, 21 inches (51 cm) in diameter and was fashioned from a type of papier-mache and coated with gypsum. (wikipedia))
#[fit]RISK: What does it mean to FIT?
Minimize distance from the line?
Minimize squared distance from the line. Empirical Risk Minimization.
##[fit]Get intercept
#[fit] HYPOTHESIS SPACES
For example, a polynomial looks so:
All polynomials of a degree or complexity
$$ \cal{H}1: h_1(x) = \theta_0 + \theta_1 x $$ $$ \cal{H}{20}: h_{20}(x) = \sum_{i=0}^{20} \theta_i x^i$$
Small World answers the question: given a model class (i.e. a Hypothesis space, whats the best model in it). Thus its looking for a particular
BIG World compares model spaces. Wants to find the true f(x), or at least the best
Why not test ALL hypothesis spaces?
#[fit] 2. Approximation
Well usually you are only given a sample. What is it?
Its a set of
If you had the population you could construct many samples of a smaller size by randomly choosing subsamples of such points.
If not you could bootstrap.
A sample of 30 points of data. Which fit is better? Line in
- Even on a population, there is a
$$p(x)$$ . There are more young voters in India - sampling: different samples can have varying
$$p(x)$$ , so we can denote this$$\hat{p}(x)$$ - noise comes from measurement error, missing features, etc..a combination of many small things...thus we have a
$$p(y | x)$$ even on the population - because only certain values of y may have been chosen for a given x bin by the sampling process, we have a
$$\hat{p}(y | x)$$ - mis-specification: choice of hypothesis set create bias which adds to the noise
#[fit] 3. THE REAL WORLD #[fit] HAS NOISE
#Statement of the Learning Problem
The sample must be representative of the population!
A: In-sample risk is small B: Population, or out-of-sample risk is WELL estimated by in-sample risk. Thus the out of sample risk is also small.
Which fit is better now? The line or the curve?
- look at fits on different "training sets
$${\cal D}$$ " - in other words, different samples
- in real life we are not so lucky, usually we get only one sample
- but lets pretend, shall we?
#UNDERFITTING (Bias) vs OVERFITTING (Variance)
##[fit] 4. Complexity ##[fit] amongst Models
#TRAIN AND TEST
#MODEL COMPARISON: A Large World approach
- want to choose which Hypothesis set is best
- it should be the one that minimizes risk
- but minimizing the training risk tells us nothing: interpolation
- we need to minimize the training risk but not at the cost of generalization
- thus only minimize till test set risk starts going up
DATA SIZE MATTERS: straight line fits to a sine curve
Corollary: Must fit simpler models to less data! This will motivate the analysis of learning curves later.
##[fit]5. Validation and ##[fit]Cross Validation
##[fit] Do we still have a test set?
Trouble:
- no discussion on the error bars on our error estimates
- "visually fitting" a value of
$$d \implies$$ contaminated test set.
The moment we use it in the learning process, it is not a test set.
#[fit]VALIDATION
- train-test not enough as we fit for
$$d$$ on test set and contaminate it - thus do train-validate-test
- we wrongly already attempted to fit
$$d$$ on our previous test set. - choose the
$$d, g^{-*}$$ combination with the lowest validation set risk. - $$R_{val}(g^{-}, d^)$$ has an optimistic bias since
$$d$$ effectively fit on validation set
- finally retrain on the entire train+validation set using the appropriate
$$d^*$$ - works as training for a given hypothesis space with more data typically reduces the risk even further.
#[fit]CROSS-VALIDATION
#[fit]CROSS-VALIDATION
#is
- a resampling method
- robust to outlier validation set
- allows for larger training sets
- allows for error estimates
Here we find
- validation process as one that estimates
$$R_{out}$$ directly, on the validation set. It's critical use is in the model selection process. - once you do that you can estimate
$$R_{out}$$ using the test set as usual, but now you have also got the benefit of a robust average and error bars. - key subtlety: in the risk averaging process, you are actually averaging over different
$$g^-$$ models, with different parameters.
We'll see a "small-world" approach to deal with finding the right model, where we'll choose a Hypothesis set that includes very complex models, and then find a way to subset this set.
This method is called
##[fit] Regularization