- Basic of Statistical Learning: decomposition of error
- The Curse of Dimensionality: adversarial phenomena that emerges as the input space become high dimensional
- Addressing the curse: signals
-
Distribution function
$$v$$ generating$$x_{i}$$ -
A target unknown function
$$f^{*}(x_{i})$$ generating the labels$$y_{i}$$
In order to analyse Machine Learning algorithms and provide guarantees, we need assumptions on both
Synonyms: hypothesis class, function approximation
- Polynomials of degree
$$k$$ - Neural Networks
Complexity measure: Some type of norm or quantity that you can evaluate in your hypothesis that is meant to divide your hypothesis into into that are simple and those that are complicated. e.g: how many neurons we have in our neural network?
Sobolev Norm
Learner: Try to find hypothesis with low complexity.
Idea: Notion of comparing or measuring errors quantitatively.
Loss function: For example measuring the square distance (Mean Squared Error).
Energy functions.
Point-wise measure.
Notion of average:
- Population average: expectation for the data of this point wise measure. How well we are gonna do.
- Empirical average: commonly named the training loss.
Conclusion: We are gonna try to reduce the population error but we can only work with the empirical error. How these 2 notions of average relate to each other?
How far is ML from classical statistics: I can't afford the luxury of fixing my hypothesis. In the training era we are gonna be updating our initial guess for
Training and test eras can't be related point-wise.
Not see it as a random variable, but rather as a expectation of a random function. Training and test eras as objects, as functions, they should be close to each other. Redemacher complexities
Supervised learning sense.
Focus on low complexity hypothesis class.
Consider an algorithm as an estimator.
Convex optimization: See also convex constraints. The constraint form is not easy to use and that's why we can also consider the penalized form.
Penalized form: Introduces a Lagrangian multiplier where the constraint now becomes part of the optimization objective.
Hyper parameter: Lambda nows control indirectly the strength of the regularization.
Interpolation form: Popularized by large neural networks. Functions hugely expresive that we can even completely fit the data. This implies an empirical risk of value 0. Strong assumption for this method to work: If and only if we assume that there's no noise in the data. This luxury only occurs in certain data sets.
Idea: It's always the same. We have something that we want to control. The way to introduce to actually make progress is always the same, add and subtract the appropriated quantity.
Guarantee in the performance in the test set once an arbitrary hipothesis has been chosen. $$ \mathcal{R}(\hat{f}) - \textcolor{Yellow}{inf_{f \in \mathcal{F}} \mathcal{R}(f)} = $$
Infimun error: The best we could do a posteriori. If we had an oracle who could give us as many samples as we want while having theoretical infinity amount of computational resources, we could effectively have selected the best hypothesis.
To further interpreter the previous difference, we need to transform it into multiple differences.
Add and subtract the best we can do by resctricting the complexity $$ inf_{f \in \mathcal{F}_\delta}\mathcal{R}(f) $$
$$ = \big ( \mathcal{R}(\hat{f}) - inf_{f \in \mathcal{F}_\delta}\mathcal{R}(f)\big )
- \textcolor{Red}{(inf_{f \in \mathcal{F}\delta}\mathcal{R}(f) - inf{f \in \mathcal{F}} \mathcal{R}(f)}
$$
Now we have one object that we can start to interpret. So this red object is telling us the difference between the error If we minimise it over the full class minus the error we get if we only look at the complexities of size
$$\delta$$ . $$ inf_{f \in \mathcal{F}} \mathcal{R}(f) $$ In this term there is no empirical information, there isn't any hats, it's a pure term. Thus is completely connected with what we call approximation. If$$\delta$$ gets immensely high, the term is going to become smaller. Therefore it can be conceived as the approximation error. $$ \textcolor{Red}{(inf_{f \in \mathcal{F}\delta}\mathcal{R}(f) - inf{f \in \mathcal{F}} \mathcal{R}(f)} = \epsilon_{appr} $$
Now we are going to introduce the empirical objective, the test error over the "ball": $$ \hat{\mathcal{R}}(\hat{f}) - inf_{f \in \mathcal{F}\delta}\mathcal{\hat{R}}(f) $$ Thus the decomposition progress as follows, $$ \textcolor{Green}{\big (\hat{\mathcal{R}}(\hat{f}) - inf{f \in \mathcal{F}_\delta}\mathcal{\hat{R}}(f) \big ) +} (\mathcal{R}(\hat{f}) - \hat{\mathcal{R}}(\hat{f}))
- \big ( inf_{f \in \mathcal{F}\delta}\mathcal{\hat{R}}(f) - inf{f \in \mathcal{F}\delta}\mathcal{R}(f) \big ) + \textcolor{red}{\epsilon{appr}} $$ We also add and subtract the infimun of the training error: $$ inf_{f \in \mathcal{F}\delta}\mathcal{\hat{R}}(f) $$ Interpretation of the current state of the terms inhabiting the decomposition expression. $$ \textcolor{Green}{\big (\hat{\mathcal{R}}(\hat{f}) - inf{f \in \mathcal{F}\delta}\mathcal{\hat{R}}(f) \big )} = \epsilon{opt} $$ It's the training error of my hypothesis minus the best training error in my ball. So if we are able to solve the empirical base minimization problem, how much does this term cost?
This term is going to be zero. If we are good at optimising things, then this green term is going to be small.
Interpretation: Optimization error.
2 terms remain:
-
$$(\mathcal{R}(\hat{f}) - \hat{\mathcal{R}}(\hat{f})) $$ Point-wise difference. - $$( inf_{f \in \mathcal{F}\delta}\mathcal{\hat{R}}(f) - inf{f \in \mathcal{F}_\delta}\mathcal{R}(f))$$
Interpretation: They really look like something that relates the population objective with the training objective.
We can unify these two terms together and upper bound them: $$ \textcolor{Cyan}{2sup_{f \in \mathcal{F}_{\delta}}|\mathcal{R}(f) - \mathcal{\hat{R}}(f)|} $$ This term is two times the largest fluctuation between the training and the test over our "ball" of available hypothesis.
Conceptualization: We can clearly see that if we have 2 functions and we want to minimize them, the difference between, we can always upper bound the difference.
Thus our expression over the error, or test error, looks as follows: $$ \textcolor{Green}{\leq : \epsilon_{opt}}
- \textcolor{Cyan}{2sup_{f \in \mathcal{F}_{\delta}}|\mathcal{R}(f) - \mathcal{\hat{R}}(f)|}
- \textcolor{Red}{\epsilon_{appr}} \ = \textcolor{Green}{\epsilon_{opt}} + \textcolor{Cyan}{\epsilon_{stat}} + \textcolor{Red}{\epsilon_{appr}} $$ The error made is a contribution of three different sources of error.
-
$$\textcolor{Green}{\epsilon_{opt}}$$ (Optimization error): It measures our ability to solve this empirical risk minimization efficiently. -
$$\textcolor{Cyan}{\epsilon_{stat}}$$ (Statistical error): Penalizes uniform fluctuations over the "ball" between the true function, the test function, and the random function, which is the training error. -
$$\textcolor{Red}{\epsilon_{appr}}$$ (Approximation error): How well we can approximate the target function$$f^{*}$$ with small complexity.
-
$$\textcolor{Yellow}{inf_{f \in \mathcal{F}} \mathcal{R}(f)} = 0$$ if$$\mathcal{F}$$ is dense. e.g. neural networks with non-polynomial activation (Universal Approximation Theorems) - Approximation error: Exploit as much as we can the hypothesis like the prior information we have on the target function.
- Statistical error:
- Optimization error: Solve these problems in a efficient way.
Conclusion: If we want to be able to learn in high-dimensions we need to be good at these three errors at the same time.
Appendix questions: Question: Does the empirical error always need to have a convex constraint?
Answer: For sake of simplicity always assume the hypothesis space to be convex. In the context of neural networks this is synonymous to considering the last layer to be hugely wide (as wide as you can). Nevertheless even if our hypothesis space wasn't convex, our previous decomposition of error stills holds. All the presented equations are dimension-free.
There exists one hyper parameter in the generated expression for the descomposition of error
Question: How the hyper parameter
Answer: It depends to the dimensionality, but it also depends on finer properties of the functional class.
Question: How the terms that emerged in our descomposition of error behave as a function of the dimensionality of the input space?
Dynamic programming -> synonym: High-dimensional statistics.
Basic Principle of Learning: Intepolation
Propagate the information that we observed to the propagation form the neighbors.
Similarity: The principle of learning by basically finding patterns that are similar, it's something that suffers a lot in high-dimensions.
Encapsulates the notion of locallity. It's a hypothesis of a function that only depends on locality. Thus the value of the function at one point is going to be close to the value of the function at a neighbour point.
$$
f: \mathcal{X} \subseteq \R^{d} \to \R : is : \beta Lipshitz : if \
|f(x) - f(x')| \leq \beta \Vert x - x'\Vert
$$
Statement: If
Setup: $$ { (x_i, f^{*}(x_{i})}_{i=1...n} $$ Our hypothesis space is going to be all the functions that $$ F = { f: \R^{d} \to \R, f : bounded, f : Lipshitz} $$ This introduced space can indeed be shown to be an Bannach space. This conclusion means that the emerged space has a notion of complexity, norm. Now we can define our estimator.
Estimator: ERM is the interpolant form. $$ \hat{f} = argmin_{f \in \mathcal{F}} \Big { Lip(f), f(x_i) = f^{*}(x_i) \forall i\Big } $$ This implies we are going through all the points.
Question: How do we complete the error between this estimator and the ground truth?
Pick
Mechanism to compute bounds in machine learning: Add and subtract trick.
We are going to consider the point that is closest from the point cloud to
Now we add and subtract it. Thus the previous expression becomes: $$ |\hat{f}(x) - f^{*}(x)| \leq |\hat{f}(x) - \hat{f}(x_{i0})|
- |\hat{f}(x_{i0} - f^{}(x_{i0})| + |f^{}(x_{i0}) - f^{}(x)|
$$
We need to bound these terms. The expression can be simplified by substituting the term
$$
|\hat{f}(x_{i0} - f^{}(x_{i0})|
$$
By 0. Because this point has been picked from the training set and by definition we know that our interpolant passes through all the points. Thus this term value is 0 by construction. The last term,
$$
|f^{}(x_{i0}) - f^{}(x)|
$$
force us to use the hypothesis $$f^{}$$. $$f^{}$$ and $$\hat{f}^{}$$ are both selected because we chosen them to minimize the Lipschitz constant. In particular it isn't only a Lipschitz constant, but the Lipschitz constant of $$\hat{f}$$ is at most the one of $$f^{}$$ because
$$f^{*}$$ is an interpolant. This is $$ \leq 2 \Vert x_{i0} - x \Vert $$ two times the distance between$$x_{i0}$$ minus$$x$$ .
In conclusion we have, having fixed the Lipschitz constant to be one, $$ \mathbb{E}{x} |\hat{f}(x) - f^{*}(x)|^{2} \leq 4 \mathbb{E}{x} \Vert x - x_{io}\Vert^{2} $$ This defined as the closest point from the constant.
Optimal transport and the exact Wasserstein loss
The term $$4\mathbb{E}{x} \Vert x - x{io}\Vert^{2}$$ is a well known quantity, called optimal transport distance. $$ 4\mathbb{E}{x} \Vert x - x{io}\Vert^{2} = 4W_{2}^{2}(v, \hat{v}_{n}) $$
Wasserstein distance (distance between two Gaussians): Given a data distribution, sample of size
Read the space. This means we need to just evaluate every possible point and find the smallest value. This of course has exponential dependency in dimension.
Exponential blow up of complexity. Thus we need to make assumptions.
This is overcome in practice working with spaces that are nearly convex. They possess no bad local minima.
Instead of finding a global minima we focus about finding a local minimum.
Local minimum, formally called second-order stationary points.
Question: How hard is find a local minimum in high-dimensions?
Answer: Easier than finding a global. Quantitatively in terms of iteration complexity and having an error of
Strong assumption of non bad local minima: This might not always be the case.