You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Chapter 3 & 4: A Formal Learning Model and Learning via Uniform Convergence
A Formal Learning Model and Learning via Uniform Convergence
PAC(Probably Approximately correct) learning
A hypothesis class $\mathcal{H}$ is PAC learnable
if there exist a function $m_H : (0, 1)^2 \to \mathcal{N}$ and a learning algorithm with the
following property: For every $ \epsilon, \delta \in (0, 1)$, for every distribution $\mathcal{D}$ over $\mathcal{X}$, and
for every labeling function $f : X \to$ {0; 1}, if the realizable assumption holds
with respect to $\mathcal{H},\mathcal{D},f$, then when running the learning algorithm on $m \geq
m_H(\epsilon, \delta)$ i.i.d. examples generated by $\mathcal{D}$ and labeled by $\mathcal{f}$, the algorithm returns
a hypothesis $h$ such that, with probability of at least $1- \delta$ (over the choice of the examples), $\mathcal{L}_{\mathcal{D},f}(h) \leq \epsilon$
$\epsilon$ - accuracy parameter (how far the output classifier from the optimal) - Approximately.
$\delta$ - confidence parameter (how likely the classifier is to meet the accuracy requirement) - Probably.
$m_H : (0, 1)^2 \to \mathcal{N}$ defines sample complexity (how many samples are needed).
Corollary
Every finite hypothesis class is PAC learnable with following sample complexity:
Realizability assumption does not hold quite often;
Labels may not be fully determined by the features which we measure;
$\Rightarrow \mathcal{D}$ should be considered as a probability distribution over $\mathcal{X}\times\mathcal{Y}$.
In other words, $\mathcal{D}$ - joint distribution over domain points and labels.
And now we want learning algorithm to find a predictor whose error is not much larger than the best possible error of
a predictor in some hypothesis class, given for bench marking.
Agnostic PAC
A hypothesis class $\mathcal{H}$ is Agnostic PAC learnable
if there exist a function $m_H : (0, 1)^2 \to \mathcal{N}$ and a learning algorithm with the
following property: For every $\epsilon, \delta \in (0, 1)$, for every distribution $\mathcal{D}$ over $\mathcal{X}\times\mathcal{Y}$, when running learning algorithm on $m \geq
m_H(\epsilon, \delta)$ i.i.d. examples generated by $\mathcal{D}$, the algorithm returns
a hypothesis $h$ such that, with probability of at least $1- \delta$ (over the choice of the examples), $\mathcal{L}_{\mathcal{D}}(h) \leq min_{h' \in \mathcal{H}} \mathcal{L}_{\mathcal{D}}(h') + \epsilon$
In case of Agnostic PAC, if realizability holds - Agnostic PAC gives same guarantee as PAC, when realizability does not hold no learner can guarantee an arbitrarily small error, but it is still OK, since learner can still declare a success if its error
is not much larger than the best error achievable by a predictor from the class $\mathcal{H}$.
Uniform convergence
A training set $\mathcal{S}$ is called $\epsilon$-representative
(w.r.t. domain $\mathcal{Z}$, hypothesis class $\mathcal{H}$, loss function $l$, and distribution $\mathcal{D}$) if
$\forall h \in \mathcal{H}, | \mathcal{L}_{\mathcal{S}}(h) - \mathcal{L}_{\mathcal{D}}(h)| \leq \epsilon$
Assume that a training set S is $\frac{\epsilon}{2}$-representative (w.r.t. domain $\mathcal{Z}$, hypothesis class $\mathcal{H}$, loss function $l$, and distribution $\mathcal{D}$). Then, any output of
$ERM_H(S)$, namely, any $h_S$$\in$$argmin_{h \in \mathcal{H}}$$\mathcal{L}_{S}(h)$, satisfies
$\mathcal{L}_{\mathcal{D}}(h_S) \leq min_{h \in \mathcal{H}}\mathcal{L}_{\mathcal{D}}(h) + \epsilon$
We say that a hypothesis class $\mathcal{H}$ has
the Uniform Convergence property (w.r.t. a domain $\mathcal{Z}$ and a loss function $l$) if
there exists a function $m^\mathcal{UC}_{\mathcal{H}} : (0, 1)^2 \to \mathcal{N}$ such that for every $\epsilon, \delta \in (0, 1)$ and for every probability distribution $\mathcal{D}$ over $\mathcal{Z}$, if $\mathcal{S}$ is a sample of $m \geq m^\mathcal{UC}_{\mathcal{H}}(\epsilon, \delta)$
examples drawn i.i.d. according to $\mathcal{D}$, then, with probability of at least $1 - \delta$, $\mathcal{S}$ is $\epsilon$-representative.
Similar to the definition of sample complexity for PAC learning, the function $m^\mathcal{UC}_{\mathcal{H}}$
measures the (minimal) sample complexity of obtaining the Uniform Convergence
property, namely, how many examples we need to ensure that with probability of at least $1 - \delta$ � the sample would be �$\epsilon$-representative.
The term uniform here refers to having a fixed sample size that works for all
members of $\mathcal{H}$ and over all possible probability distributions over the domain.
Let $\mathcal{H}$ be a finite hypothesis class, let $\mathcal{Z}$ be a domain and let $l: \mathcal{H} \times \mathcal{Z} \to{[0,1]}$ be a loss function, then $\mathcal{H}$ enjoys the uniform convergence with sample complexity:
$m^\mathcal{UC}_{\mathcal{H}}(\epsilon, \delta) \leq \lceil \frac{\log(\frac{2|H|}{\delta})}{2\epsilon^2} \rceil$
Questions:
Q: How to approach a good hypothesis class?
A: Approaching a good hypothesis class is undoubtedly a very important question in the field of practical machine learning.
Unfortunately, the best answer to the question is "based on previous knowledge of the problem under consideration".
Obviously, it does not provide any recipe for how to proceed and requires from the researches deep understanding of
the problem and experience. In certain sense, choosing "proper" hypothesis class is more an "art" than a "science".
Q: P49 of the book, Remark 3.2, why the algorithm will return a outside of H?
A: The "optimal" solution might be a bit outside of the hypothesis class. When it is impossible to find a solution within
some constraints, it can be reasonable to relax those constraints. After relaxation, we are more likely to find a solution.
Q: P57,Remark 4.1("Discretization trick"). Is it really a trick to solve problem or something we have to deal with?
A: First of all, let us pay attention, that authors of the book put the name of the remark into quotes. Given that, authors,
probably, wanted to pay attention to the fact when we are dealing with practical tasks and computers, we have a useful tool
"to get a very good estimate of the practical sample complexity of infinite hypothesis classes"
Q: The bound of PAC learnability and bound of agnostic PAC learnability is different, why? and which part of the assumption makes the difference?
A: If we look at the definitions of PAC and Agnostic PAC learnabilities, we would find, that the definitions are very similar, but still
they have some differences, which lead to different bounds. The main difference is, of course, that Agnostic PAC tries to find a hypothesis,
whose error is not much larger than the best error achievable by a hypothesis from the hypothesis class, while PAC tries to find a hypothesis with an arbitrary small error. Because of this difference, the function $m_H(\epsilon, \delta)$ should be
defined differently for PAC and Agnostic PAC.