Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor changes throughout the book while preparing for my grad course #249

Open
wants to merge 64 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
958327e
whitespace dt
gwtaylor Sep 1, 2017
e46096e
fix typos dt
gwtaylor Sep 1, 2017
ed2566b
whitespace formal
gwtaylor Sep 1, 2017
a4cb0e0
fix typo formal
gwtaylor Sep 1, 2017
c6bd5fa
whitespace knn
gwtaylor Sep 1, 2017
63a03ca
fix typo knn
gwtaylor Sep 1, 2017
c9dcd68
remove auto-generated file
gwtaylor Sep 2, 2017
bc5422e
move fig knn_classifyit onto next page
gwtaylor Sep 2, 2017
f557b76
move thinkaboutit under fig knn_classifyit
gwtaylor Sep 2, 2017
c6a8bcf
change baseline for better fit of thinkaboutit
gwtaylor Sep 2, 2017
23be973
remove figure caption from \mathreview boxes
gwtaylor Sep 2, 2017
2449ae0
fix overlap problem between first two ski figs
gwtaylor Sep 2, 2017
8ea6414
add missing but referenced fig knn:uniform
gwtaylor Sep 2, 2017
2d6be51
whitespace perc
gwtaylor Sep 2, 2017
a81bf79
fix fig reference
gwtaylor Sep 2, 2017
1b976c6
remove spurious second reference to figure
gwtaylor Sep 2, 2017
47989d1
fix typos perc
gwtaylor Sep 3, 2017
367455e
remove spurious square
gwtaylor Sep 3, 2017
12263df
whitespace prac
gwtaylor Sep 3, 2017
2f85965
fix typos prac
gwtaylor Sep 3, 2017
fd792e2
add star to f in approximation error
gwtaylor Sep 3, 2017
6db7522
whitespace bias
gwtaylor Sep 3, 2017
484bb07
typos bias
gwtaylor Sep 3, 2017
ecc14a0
in this section calling hypothesis class F
gwtaylor Sep 3, 2017
b271a7c
whitespace complex
gwtaylor Sep 3, 2017
7caee1a
typos complex
gwtaylor Sep 3, 2017
63fae58
whitespace loss
gwtaylor Sep 3, 2017
d7acbad
typos loss
gwtaylor Sep 3, 2017
461b670
regularizer on the weight, not the input
gwtaylor Sep 3, 2017
e49d5b9
term in exp gets big and negative, exp goes to zero
gwtaylor Sep 3, 2017
1afb587
manipulating regularized hinge loss
gwtaylor Sep 3, 2017
a1890b3
whitespace prob
gwtaylor Sep 3, 2017
ba3dcc6
typos prob
gwtaylor Sep 3, 2017
1e85005
whitespace nnet
gwtaylor Sep 3, 2017
c0393c1
typos nnet
gwtaylor Sep 3, 2017
1ba9147
whitespace kernel
gwtaylor Sep 3, 2017
3ee01a5
ddot to dotp
gwtaylor Sep 3, 2017
3d0d973
typos kernel
gwtaylor Sep 3, 2017
6a78179
whitespace thy
gwtaylor Sep 9, 2017
5b05965
typos thy
gwtaylor Sep 9, 2017
8e58de9
whitespace ens
gwtaylor Sep 9, 2017
e320d4a
typos ens
gwtaylor Sep 9, 2017
799423f
whitespace opt
gwtaylor Sep 9, 2017
22e2523
typos opt
gwtaylor Sep 9, 2017
8ead692
whitespace unsup
gwtaylor Sep 9, 2017
be6956f
typos unsup
gwtaylor Sep 9, 2017
98e0387
whitespace em
gwtaylor Sep 9, 2017
735e165
remove redundant header (see footer)
gwtaylor Sep 9, 2017
c0badcf
fix em equation references
gwtaylor Sep 9, 2017
db3f10c
typos em
gwtaylor Sep 9, 2017
c9a2954
whitespace srl
gwtaylor Sep 9, 2017
072bd88
fix figure srl:trellis file name and ref
gwtaylor Sep 9, 2017
fa1cd67
typos srl
gwtaylor Sep 9, 2017
2688e03
fix notation in gradient of structured hinge loss
gwtaylor Sep 9, 2017
0917de8
whitespace imit
gwtaylor Sep 9, 2017
a8d2e79
typos imit
gwtaylor Sep 9, 2017
4d1b5ba
fix typo in dt
gwtaylor Sep 9, 2017
046c9e7
fix typo in dt
gwtaylor Sep 9, 2017
f8262e9
fix typo formal
gwtaylor Sep 9, 2017
f2ba28e
think about it p. 104 move up
Sep 5, 2017
febfe3a
typo fix
Sep 5, 2017
d1dd4ca
pushed figure 8.1 and 8.2 up on the page
Sep 5, 2017
efa41a8
moved fig5.12 up on page
Sep 5, 2017
e50366b
moved fig 5.1 up on page
Sep 5, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions book/bias.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ \chapter{Bias and Fairness} \label{sec:bias}
\chapterquote{Science and everyday life cannot\\and should not be separated.}{Rosalind~Franklin}

\begin{learningobjectives}
\item
\item
\end{learningobjectives}

\dependencies{\chref{sec:dt},\chref{sec:knn},\chref{sec:perc},\chref{sec:prac}}
Expand Down Expand Up @@ -83,7 +83,7 @@ \section{Unsupervised Adaptation}
All examples are drawn according to some fixed base distribution $\Dbase$.
Some of these are selected to go into the new distribution, and some of them are selected to go into the old distribution.
The mechanism for deciding which ones are kept and which are thrown out is governed by a \emph{selection variable}, which we call $s$.
The choice of selection-or-not, $s$, is based \emph{only} on the input example $\vx$ and not on it's label.\thinkaboutit{What could go wrong if $s$ got to look at the label, too?}
The choice of selection-or-not, $s$, is based \emph{only} on the input example $\vx$ and not on its label.\thinkaboutit{What could go wrong if $s$ got to look at the label, too?}
In particular, we define:
~
\begin{align}
Expand Down Expand Up @@ -171,7 +171,7 @@ \section{Supervised Adaptation}
{bias:easyadapt}%
{\FUN{EasyAdapt}(\VAR{$\langle (\vxold_n,\yold_n) \rangle_{n=1}^N$}, \VAR{$\langle (\vxnew_m, \ynew_m) \rangle_{m=1}^M$}, \VAR{$\cA$})}
{
\SETST{$D$}{$\left\langle ( \langle \VARm{\vxold_n}, \VARm{\vxold_n}, \vec 0 \rangle, \VARm{\yold_n} ) \right\rangle_{\VARm{n}=1}^{\VARm{N}}
\SETST{$D$}{$\left\langle ( \langle \VARm{\vxold_n}, \VARm{\vxold_n}, \vec 0 \rangle, \VARm{\yold_n} ) \right\rangle_{\VARm{n}=1}^{\VARm{N}}
\bigcup
\left\langle ( \langle \VARm{\vxnew_m}, \vec 0, \VARm{\vxnew_m} \rangle, \VARm{\ynew_m} ) \right\rangle_{\VARm{m}=1}^{\VARm{M}} $}
\COMMENT{union} \\ \COMMENT{of transformed data}
Expand All @@ -183,7 +183,7 @@ \section{Supervised Adaptation}
Although this approach is general, it is most effective when the two distributions are ``not too close but not too far'':
\begin{itemize}
\item If the distributions are too far, and there's little information to share, you're probably better off throwing out the old distribution data and training just on the (untransformed) new distribution data.
\item If the distributions are too close, then you might as well just take the union of the (untransformed) old and new distribution data, and training on that.
\item If the distributions are too close, then you might as well just take the union of the (untransformed) old and new distribution data, and train on that.
\end{itemize}
In general, the interplay between how far the distributions are and how much new distribution data you have is complex, and you should always try ``old only'' and ``new only'' and ``simple union'' as baselines.

Expand Down Expand Up @@ -212,8 +212,8 @@ \section{Fairness and Data Bias}
Informally, the 80\% rule says that your rate of hiring women (for instance) must be at least 80\% of your rate of hiring men.
Formally, the rule states:
\begin{align}
\Pr(y = +1 \| \text{G} \neq \text{male})
& \geq 0.8 ~\times~ \Pr(y = +1 \| \text{G} = \text{male})
\Pr(y = +1 \| \text{G} \neq \text{male})
& \geq 0.8 ~\times~ \Pr(y = +1 \| \text{G} = \text{male})
\end{align}
Of course, gender/male can be replaced with any other protected attribute.

Expand Down Expand Up @@ -243,7 +243,7 @@ \section{How Badly can it Go?}
%
The question is: how badly can $f$ do on the new distribution?

We can calculate this directly.
We can calculate this directly.
%
\begin{align}
& \ep\xth{new} \nonumber \\
Expand Down Expand Up @@ -283,15 +283,15 @@ \section{How Badly can it Go?}
The core idea is that if we're learning a function $f$ from some hypothesis class $\cF$, and this hypothesis class isn't rich enough to peek at the 29th decimal digit of feature 1, then perhaps things are not as bad as they could be.
This motivates the idea of looking at a measure of distance between probability distributions that \emph{depends on the hypothesis class}.
A popular measure is the \concept{$d_\cA$-distance} or the \concept{discrepancy}.
The discrepancy measure distances between probability distributions based on how much two function $f$ and $f'$ in the hypothesis class can disagree on their labels.
The discrepancy measure distances between probability distributions based on how much two functions $f$ and $f'$ in the hypothesis class can disagree on their labels.
Let:
%
\begin{align}
\ep_P(f,f')
&= \Ep_{\vx \sim P} \Big[ \Ind[ f(\vx) \neq f'(\vx) ] \Big]
\end{align}
%
You can think of $\ep_P(f,f')$ as the \emph{error} of $f'$ when the ground truth is given by $f$, where the error is taken with repsect to examples drawn from $P$.
You can think of $\ep_P(f,f')$ as the \emph{error} of $f'$ when the ground truth is given by $f$, where the error is taken with respect to examples drawn from $P$.
Given a hypothesis class $\cF$, the discrepancy between $P$ and $Q$ is defined as:
%
\begin{align}
Expand All @@ -304,7 +304,7 @@ \section{How Badly can it Go?}

One very attractive property of the discrepancy is that you can estimate it from finite \emph{unlabeled} samples from $\Dold$ and $\Dnew$.
Although not obvious at first, the discrepancy is very closely related to a quantity we saw earlier in unsupervised adaptation: a classifier that distinguishes between $\Dold$ and $\Dnew$.
In fact, the discrepancy is precisely twice the \emph{accuracy} of the best classifier from $\cH$ at separating $\Dold$ from $\Dnew$.
In fact, the discrepancy is precisely twice the \emph{accuracy} of the best classifier from $\cF$ at separating $\Dold$ from $\Dnew$.

How does this work in practice?
Exactly as in the section on unsupervised adaptation, we train a classifier to distinguish between $\Dold$ and $\Dnew$.
Expand All @@ -324,7 +324,7 @@ \section{How Badly can it Go?}
%
\begin{align}
\underbrace{\ep\xth{new}(f)}_{\textrm{error on } \Dnew}
&\leq
&\leq
\underbrace{\ep\xth{old}(f)}_{\textrm{error on } \Dold} +
\underbrace{\ep\xth{best}}_{\textrm{minimal avg error}} +
\underbrace{d_\cA(\Dold,\Dnew)}_{\textrm{distance}}
Expand All @@ -342,7 +342,7 @@ \section{Further Reading}
TODO further reading


%%% Local Variables:
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "courseml"
%%% End:
%%% End:
16 changes: 8 additions & 8 deletions book/complex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,8 @@ \section{Learning with Imbalanced Data} \label{sec:imbalanced}
that distribution. We will compute the expected error $\ep^w$ of
$f$ on the weighted problem:
\begin{align}
\ep^w
&= \Ep_{(\vx,y) \sim \cD^w}
\ep^w
&= \Ep_{(\vx,y) \sim \cD^w}
\Big[ \al^{y=1} \big[f(\vx) \neq y\big] \Big] \\
&= \sum_{\vx \in \cX} \sum_{y \in \pm 1}
\cD^w(\vx,y) \al^{y=1} \big[f(\vx) \neq y\big] \\
Expand Down Expand Up @@ -312,7 +312,7 @@ \section{Multiclass Classification}
Algorithms~\ref{alg:complex:ovatrain} and \ref{alg:complex:ovatest}.
In the testing procedure, the prediction of the $i$th classifier is
added to the overall score for class $i$. Thus, if the prediction is
positive, class $i$ gets a vote; if the prdiction is negative,
positive, class $i$ gets a vote; if the prediction is negative,
everyone else (implicitly) gets a vote. (In fact, if your learning
algorithm can output a confidence, as discussed in Section~\ref{}, you
can often do better by using the confidence as $y$, rather than a
Expand Down Expand Up @@ -533,7 +533,7 @@ \section{Ranking}
a large number of documents, somehow assimilating the preference
function into an overall permutation.

For notationally simplicity, let $\vx_{nij}$ denote the features
For notational simplicity, let $\vx_{nij}$ denote the features
associated with comparing document $i$ to document $j$ on query $n$.
Training is fairly straightforward. For every $n$ and every pair $i
\neq j$, we will create a binary classification example based on
Expand Down Expand Up @@ -603,7 +603,7 @@ \section{Ranking}
Second, rather than producing a list of scores and then calling an
arbitrary sorting algorithm, you can actually use the preference
function as the sorting function inside your own implementation of
quicksort.
quicksort.

We can now formalize the problem. Define a ranking as a function
$\si$ that maps the objects we are ranking (documents) to the desired
Expand Down Expand Up @@ -825,7 +825,7 @@ \section{Further Reading}
% \learningproblem{Collective Classification}{
% \item An input space $\cX$ and number of classes $K$
% \item An unknown distribution $\cD$ over $\cG(\cX\times[K])$
% }{A function $f : \cG(\cX) \fto \cG([K])$ minimizing:
% }{A function $f : \cG(\cX) \fto \cG([K])$ minimizing:
% $\Ep_{(V,E) \sim \cD} \left[
% \sum_{v \in V} \big[ \hat y_v \neq y_v \big]
% \right]$, where $y_v$ is the label associated with vertex $v$ in $G$
Expand Down Expand Up @@ -950,7 +950,7 @@ \section{Further Reading}
% ensure that your predictions at the $k$th layer are indicative of how
% well the algorithm will actually do at test time.

%%% Local Variables:
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "courseml"
%%% End:
%%% End:
29 changes: 0 additions & 29 deletions book/courseml.lot

This file was deleted.

22 changes: 11 additions & 11 deletions book/dt.tex
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ \section{The Decision Tree Model of Learning}
You want to find a feature that is \emph{most useful} in helping you
guess whether this student will enjoy this course.
A useful way to think about this is to look at the \concept{histogram}
of labels for each feature.
of labels for each feature.
\sidenote{A
colleague related the story of getting his 8-year old nephew to
guess a number between 1 and 100. His nephew's first four questions
Expand All @@ -247,12 +247,12 @@ \section{The Decision Tree Model of Learning}
like this course.

More formally, you will consider each feature in turn. You might
consider the feature ``Is this a System's course?'' This feature has
two possible value: no and yes. Some of the training examples have an
consider the feature ``Is this a Systems course?'' This feature has
two possible values: no and yes. Some of the training examples have an
answer of ``no'' -- let's call that the ``NO'' set. Some of the
training examples have an answer of ``yes'' -- let's call that the
``YES'' set. For each set (NO and YES) we will build a histogram over
the labels. This is the second histogram in
the labels. This is the fourth histogram (from the top) in
Figure~\ref{fig:dt_histogram}. Now, suppose you were to ask this
question on a random example and observe a value of ``no.'' Further
suppose that you must \emph{immediately} guess the label for this
Expand Down Expand Up @@ -414,7 +414,7 @@ \section{Formalizing the Learning Problem}
Note that the loss function is something that \emph{you} must decide
on based on the goals of learning.

\begin{mathreview}{Expectated Values}
\begin{mathreview}{Expected Values}
We write $\Ep_{(\vx,y) \sim \cD} [ \ell(y, f(\vx)) ]$ for the expected loss. Expectation means ``average.'' This is saying ``if you drew a bunch of $(x,y)$ pairs independently at random from $\cD$, what would your \emph{average} loss be?% (More formally, what would be the average of $\ell(y,f(\vx))$ be over these random draws?)
More formally, if $\cD$ is a discrete probability distribution, then this expectation can be expanded as:
%
Expand All @@ -426,12 +426,12 @@ \section{Formalizing the Learning Problem}
If $D$ is a \emph{finite discrete distribution}, for instance defined by a finite data set $\{ (\vx_1,y_1), \dots, (\vx_N,y_N)$ that puts equal weight on each example (probability $1/N$), then we get:
%
\begin{align}
\Ep_{(\vx,y) \sim D} [ \ell(y, f(\vx)) ]
\Ep_{(\vx,y) \sim D} [ \ell(y, f(\vx)) ]
&= \sum_{(\vx,y) \in D} [ D(\vx,y) \ell(y, f(\vx)) ]
\becauseof{definition of expectation}\\
&= \sum_{n=1}^N [ D(\vx_n,y_n) \ell(y_n, f(\vx_n)) ]
\becauseof{$D$ is discrete and finite}\\
&= \sum_{n=1}^N [ \frac 1 N \ell(y_n, f(\vx_n)) ]
&= \sum_{n=1}^N [ \frac 1 N \ell(y_n, f(\vx_n)) ]
\becauseof{definition of $D$}\\
&= \frac 1 N \sum_{n=1}^N [ \ell(y_n, f(\vx_n)) ]
\becauseof{rearranging terms}
Expand Down Expand Up @@ -501,7 +501,7 @@ \section{Formalizing the Learning Problem}
$\hat \vx$ to corresponding prediction $\hat y$. The key property
that $f$ should obey is that it should do well (as measured by $\ell$)
on future examples that are \emph{also} drawn from $\cD$. Formally,
it's \concept{expected loss} $\ep$ over $\cD$ with repsect to $\ell$
its \concept{expected loss} $\ep$ over $\cD$ with respect to $\ell$
should be as small as possible:
\begin{align} \label{eq:expectederror}
\ep
Expand Down Expand Up @@ -542,7 +542,7 @@ \section{Formalizing the Learning Problem}
\concept{generalize} beyond the training data to some future data that
it might not have seen yet!

So, putting it all together, we get a formal definition of induction
So, putting it all together, we get a formal definition of induction in
machine learning: \bigemph{Given (i) a loss function $\ell$ and (ii) a
sample $D$ from some unknown distribution $\cD$, you must compute a
function $f$ that has low expected error $\ep$ over $\cD$ with
Expand Down Expand Up @@ -624,7 +624,7 @@ \section{Further Reading}
\end{comment}


%%% Local Variables:
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "courseml"
%%% End:
%%% End:
Loading