layout | title |
---|---|
post |
Learning in directed models |
We now turn our attention to the third and last part of the course: learning. Given a dataset, we would like to fit a model that will make useful predictions on various tasks that we care about.
A graphical model has two components: the graph structure, and the parameters of the factors induced by this graph. These components lead to two different learning tasks:
- Parameter learning, where the graph structure is known and we want to estimate the factors.
- Structure learning, where we want to estimate the graph, i.e., determine from data how the variables depend on each other.
We are going to first focus on parameter learning, and come back to structure learning in a later chapter.
We will consider parameter learning in both directed and undirected models. It turns out that the former will admit easy, closed form solutions, while the latter will involve potentially intractable numerical optimization techniques. In later sections, we will look at latent variable models (which contain unobserved hidden variables that succinctly model the event of interest) as well as Bayesian learning, a general approach to statistics that offers certain advantages to more standard approaches.
Before, we start our discussion of learning, let's first reflect on what it means to fit a model, and what is a desirable objective for this task.
Let's assume that the domain is governed by some underlying distribution $$p^$$. We are given a dataset $$D$$ of $$m$$ samples from $$p^$$; The standard assumption is that the data instances are independent and identically distributed (iid). We are also given a family of models
The goal of learning is to return a model that precisely captures the distribution $$p^$$ from which our data was sampled. This is in general not achievable because limited data only provides a rough approximation of the true underlying distribution, and because of computational reasons. Still, we want to somehow select the "best" approximation to the underlying distribution $$p^$$.
What is "best" in this case? It depends on what we want to do:
- Density estimation: we are interested in the full distribution (so later we can compute whatever conditional probabilities we want)
- Specific prediction tasks: we are using the distribution to make a prediction, e.g., is this email spam or not?
- Structure or knowledge discovery: we are interested in the model itself, e.g., how do some genes interact with each other?
Let's assume that we want to learn the full distribution so that later we can answer any probabilistic inference query. In this setting we can view the learning problem as density estimation. We want to construct a
$$ KL(p^|p) = \sum_x p^(x) \log \frac{p^(x)}{p(x)} = -H(p^) - \E_{x \sim p^*} [ \log p(x) ]. $$
The first term does not depend on
This objective asks that
However, there is still a problem: in general we do not know $$p^$$. We will thus approximate the expected log-likelihood $$\E_{x \sim p^} [ \log p(x) ]$$ with the empirical log-likelihood (a Monte-Carlo estimate):
where
Maximum likelihood learning is then defined as
The Monte Carlo samples provide empirical samples from the true, data distribution. Suppose we call this empirical distribution
The KL Divergence between the density defined by the empirical distribution
We want to minimize this divergence. We ignore the left term (equivalent to
We convert the problem to a maximization problem by swapping the sign:
Plugging in our definition of the empirical distribution
Swapping the order of two finite sums is always legal, and the sum over
As a simple example, consider estimating the outcome of a biased coin. Our dataset is a set of tosses and our task is to estimate the probability of heads/tails on the next flip. We assume that the process is controlled by a probability distribution
How should we choose
More generally, our log-likelihood function is simply
for which the optimal solution is
We may now generalize this by introducing the concept of a loss function. A loss function
Notice that the loss function which corresponds to maximum likelihood estimation is the log loss
Another example of a loss is the conditional log-likelihood. Suppose we want to predict a set of variables
Suppose next that our ultimate goal is structured prediction, i.e., given
One reasonable choice would be the classification error:
which is the probability over all
The moral of the story here is that it often makes sense to choose a loss that is appropriate to the task at hand, e.g., prediction rather than full density estimation.
Empirical risk minimization can easily overfit the data. The data we have is a sample, and usually there is vast amount of samples that we have never seen. Our model should generalize well to these "never-seen" samples.
Thus, we typically restrict the hypothesis space of distributions that we search over. If the hypothesis space is very limited, it might not be able to represent
If we select a highly expressive hypothesis class, we might represent better the data. However, when we have small amount of data, multiple models can fit well, or even better than the true model. Moreover, small perturbations on D will result in very different estimates. This limitation is called the variance.
Thus, there is an inherent bias-variance trade off when selecting the hypothesis class. One of the main challenges of machine learning is to choose a model that is sufficiently rich to be useful, yet not so complex as to overfit the training set.
High bias can be avoided by increasing the capacity of the model. We may avoid high variance using several approaches.
We may impose hard constraints, e.g., by selecting a less expressive hypothesis class: Bayesian networks with at most
At training, we minimize empirical loss
However, we are actually interested in minimizing
We cannot guarantee with certainty the quality of our learned model. This is because the data
Let us now apply this long discussion to a particular problem of interest: parameter learning in Bayesian networks.
Suppose that we are given a Bayesian network
Taking logs and combining like terms, this becomes
Thus, maximization of the (log) likelihood function decomposes into separate maximizations for the local conditional distributions! This is essentially the same as the head/tails example we saw earlier (except with more categories). It's a simple calculus exercise to formally show that
$$ \theta^*{x_i \mid x{pa(i)}} = \frac{#(x_i, x_{pa(i)})}{#(x_{pa(i)})}. $$
We thus conclude that in Bayesian networks with discrete variables, the maximum-likelihood estimate has a closed-form solution. Even when the variables are not discrete, the task is equally simple: the log-factors are linearly separable, hence the log-likelihood reduces to estimating each of them separately. The simplicity of learning is one of the most convenient features of Bayesian networks.