hal3 · gwtaylor · Sep 1, 2017 · Sep 1, 2017 · Sep 1, 2017 · Sep 1, 2017
diff --git a/book/bias.tex b/book/bias.tex
@@ -3,7 +3,7 @@ \chapter{Bias and Fairness} \label{sec:bias}
 \chapterquote{Science and everyday life cannot\\and should not be separated.}{Rosalind~Franklin}
 
 \begin{learningobjectives}
-\item 
+\item
 \end{learningobjectives}
 
 \dependencies{\chref{sec:dt},\chref{sec:knn},\chref{sec:perc},\chref{sec:prac}}
@@ -83,7 +83,7 @@ \section{Unsupervised Adaptation}
 All examples are drawn according to some fixed base distribution $\Dbase$.
 Some of these are selected to go into the new distribution, and some of them are selected to go into the old distribution.
 The mechanism for deciding which ones are kept and which are thrown out is governed by a \emph{selection variable}, which we call $s$.
-The choice of selection-or-not, $s$, is based \emph{only} on the input example $\vx$ and not on it's label.\thinkaboutit{What could go wrong if $s$ got to look at the label, too?}
+The choice of selection-or-not, $s$, is based \emph{only} on the input example $\vx$ and not on its label.\thinkaboutit{What could go wrong if $s$ got to look at the label, too?}
 In particular, we define:
 ~
 \begin{align}
@@ -171,7 +171,7 @@ \section{Supervised Adaptation}
   {bias:easyadapt}%
   {\FUN{EasyAdapt}(\VAR{$\langle (\vxold_n,\yold_n) \rangle_{n=1}^N$}, \VAR{$\langle (\vxnew_m, \ynew_m) \rangle_{m=1}^M$}, \VAR{$\cA$})}
   {
-    \SETST{$D$}{$\left\langle ( \langle \VARm{\vxold_n}, \VARm{\vxold_n}, \vec 0 \rangle, \VARm{\yold_n} ) \right\rangle_{\VARm{n}=1}^{\VARm{N}} 
+    \SETST{$D$}{$\left\langle ( \langle \VARm{\vxold_n}, \VARm{\vxold_n}, \vec 0 \rangle, \VARm{\yold_n} ) \right\rangle_{\VARm{n}=1}^{\VARm{N}}
                   \bigcup
                  \left\langle ( \langle \VARm{\vxnew_m}, \vec 0, \VARm{\vxnew_m} \rangle, \VARm{\ynew_m} ) \right\rangle_{\VARm{m}=1}^{\VARm{M}} $}
     \COMMENT{union} \\ \COMMENT{of transformed data}
@@ -183,7 +183,7 @@ \section{Supervised Adaptation}
 Although this approach is general, it is most effective when the two distributions are ``not too close but not too far'':
 \begin{itemize}
 \item If the distributions are too far, and there's little information to share, you're probably better off throwing out the old distribution data and training just on the (untransformed) new distribution data.
-\item If the distributions are too close, then you might as well just take the union of the (untransformed) old and new distribution data, and training on that.
+\item If the distributions are too close, then you might as well just take the union of the (untransformed) old and new distribution data, and train on that.
 \end{itemize}
 In general, the interplay between how far the distributions are and how much new distribution data you have is complex, and you should always try ``old only'' and ``new only'' and ``simple union'' as baselines.
 
@@ -212,8 +212,8 @@ \section{Fairness and Data Bias}
 Informally, the 80\% rule says that your rate of hiring women (for instance) must be at least 80\% of your rate of hiring men.
 Formally, the rule states:
 \begin{align}
-                    \Pr(y = +1 \| \text{G} \neq \text{male}) 
-& \geq 0.8 ~\times~ \Pr(y = +1 \| \text{G} =    \text{male}) 
+                    \Pr(y = +1 \| \text{G} \neq \text{male})
+& \geq 0.8 ~\times~ \Pr(y = +1 \| \text{G} =    \text{male})
 \end{align}
 Of course, gender/male can be replaced with any other protected attribute.
 
@@ -243,7 +243,7 @@ \section{How Badly can it Go?}
 %
 The question is: how badly can $f$ do on the new distribution?
 
-We can calculate this directly. 
+We can calculate this directly.
 %
 \begin{align}
   & \ep\xth{new} \nonumber \\
@@ -283,15 +283,15 @@ \section{How Badly can it Go?}
 The core idea is that if we're learning a function $f$ from some hypothesis class $\cF$, and this hypothesis class isn't rich enough to peek at the 29th decimal digit of feature 1, then perhaps things are not as bad as they could be.
 This motivates the idea of looking at a measure of distance between probability distributions that \emph{depends on the hypothesis class}.
 A popular measure is the \concept{$d_\cA$-distance} or the \concept{discrepancy}.
-The discrepancy measure distances between probability distributions based on how much two function $f$ and $f'$ in the hypothesis class can disagree on their labels.
+The discrepancy measure distances between probability distributions based on how much two functions $f$ and $f'$ in the hypothesis class can disagree on their labels.
 Let:
 %
 \begin{align}
   \ep_P(f,f')
   &= \Ep_{\vx \sim P} \Big[ \Ind[ f(\vx) \neq f'(\vx) ] \Big]
 \end{align}
 %
-You can think of $\ep_P(f,f')$ as the \emph{error} of $f'$ when the ground truth is given by $f$, where the error is taken with repsect to examples drawn from $P$.
+You can think of $\ep_P(f,f')$ as the \emph{error} of $f'$ when the ground truth is given by $f$, where the error is taken with respect to examples drawn from $P$.
 Given a hypothesis class $\cF$, the discrepancy between $P$ and $Q$ is defined as:
 %
 \begin{align}
@@ -304,7 +304,7 @@ \section{How Badly can it Go?}
 
 One very attractive property of the discrepancy is that you can estimate it from finite \emph{unlabeled} samples from $\Dold$ and $\Dnew$.
 Although not obvious at first, the discrepancy is very closely related to a quantity we saw earlier in unsupervised adaptation: a classifier that distinguishes between $\Dold$ and $\Dnew$.
-In fact, the discrepancy is precisely twice the \emph{accuracy} of the best classifier from $\cH$ at separating $\Dold$ from $\Dnew$.
+In fact, the discrepancy is precisely twice the \emph{accuracy} of the best classifier from $\cF$ at separating $\Dold$ from $\Dnew$.
 
 How does this work in practice?
 Exactly as in the section on unsupervised adaptation, we train a classifier to distinguish between $\Dold$ and $\Dnew$.
@@ -324,7 +324,7 @@ \section{How Badly can it Go?}
 %
 \begin{align}
   \underbrace{\ep\xth{new}(f)}_{\textrm{error on } \Dnew}
-  &\leq 
+  &\leq
     \underbrace{\ep\xth{old}(f)}_{\textrm{error on } \Dold} +
     \underbrace{\ep\xth{best}}_{\textrm{minimal avg error}} +
     \underbrace{d_\cA(\Dold,\Dnew)}_{\textrm{distance}}
@@ -342,7 +342,7 @@ \section{Further Reading}
 TODO further reading
 
 
-%%% Local Variables: 
+%%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "courseml"
-%%% End: 
+%%% End:
diff --git a/book/complex.tex b/book/complex.tex
@@ -182,8 +182,8 @@ \section{Learning with Imbalanced Data} \label{sec:imbalanced}
   that distribution.  We will compute the expected error $\ep^w$ of
   $f$ on the weighted problem:
   \begin{align}
-    \ep^w 
-    &= \Ep_{(\vx,y) \sim \cD^w} 
+    \ep^w
+    &= \Ep_{(\vx,y) \sim \cD^w}
          \Big[ \al^{y=1} \big[f(\vx) \neq y\big] \Big] \\
     &= \sum_{\vx \in \cX} \sum_{y \in \pm 1}
          \cD^w(\vx,y) \al^{y=1} \big[f(\vx) \neq y\big] \\
@@ -312,7 +312,7 @@ \section{Multiclass Classification}
 Algorithms~\ref{alg:complex:ovatrain} and \ref{alg:complex:ovatest}.
 In the testing procedure, the prediction of the $i$th classifier is
 added to the overall score for class $i$.  Thus, if the prediction is
-positive, class $i$ gets a vote; if the prdiction is negative,
+positive, class $i$ gets a vote; if the prediction is negative,
 everyone else (implicitly) gets a vote.  (In fact, if your learning
 algorithm can output a confidence, as discussed in Section~\ref{}, you
 can often do better by using the confidence as $y$, rather than a
@@ -533,7 +533,7 @@ \section{Ranking}
 a large number of documents, somehow assimilating the preference
 function into an overall permutation.
 
-For notationally simplicity, let $\vx_{nij}$ denote the features
+For notational simplicity, let $\vx_{nij}$ denote the features
 associated with comparing document $i$ to document $j$ on query $n$.
 Training is fairly straightforward.  For every $n$ and every pair $i
 \neq j$, we will create a binary classification example based on
@@ -603,7 +603,7 @@ \section{Ranking}
 Second, rather than producing a list of scores and then calling an
 arbitrary sorting algorithm, you can actually use the preference
 function as the sorting function inside your own implementation of
-quicksort.  
+quicksort.
 
 We can now formalize the problem.  Define a ranking as a function
 $\si$ that maps the objects we are ranking (documents) to the desired
@@ -825,7 +825,7 @@ \section{Further Reading}
 % \learningproblem{Collective Classification}{
 % \item An input space $\cX$ and number of classes $K$
 % \item An unknown distribution $\cD$ over $\cG(\cX\times[K])$
-% }{A function $f : \cG(\cX) \fto \cG([K])$ minimizing: 
+% }{A function $f : \cG(\cX) \fto \cG([K])$ minimizing:
 % $\Ep_{(V,E) \sim \cD} \left[
 %   \sum_{v \in V} \big[ \hat y_v \neq y_v \big]
 %   \right]$, where $y_v$ is the label associated with vertex $v$ in $G$
@@ -950,7 +950,7 @@ \section{Further Reading}
 % ensure that your predictions at the $k$th layer are indicative of how
 % well the algorithm will actually do at test time.
 
-%%% Local Variables: 
+%%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "courseml"
-%%% End: 
+%%% End:
diff --git a/book/courseml.lot b/book/courseml.lot
diff --git a/book/dt.tex b/book/dt.tex
@@ -226,7 +226,7 @@ \section{The Decision Tree Model of Learning}
 You want to find a feature that is \emph{most useful} in helping you
 guess whether this student will enjoy this course.
 A useful way to think about this is to look at the \concept{histogram}
-of labels for each feature.  
+of labels for each feature.
 \sidenote{A
   colleague related the story of getting his 8-year old nephew to
   guess a number between 1 and 100.  His nephew's first four questions
@@ -247,12 +247,12 @@ \section{The Decision Tree Model of Learning}
 like this course.
 
 More formally, you will consider each feature in turn.  You might
-consider the feature ``Is this a System's course?''  This feature has
-two possible value: no and yes.  Some of the training examples have an
+consider the feature ``Is this a Systems course?''  This feature has
+two possible values: no and yes.  Some of the training examples have an
 answer of ``no'' -- let's call that the ``NO'' set.  Some of the
 training examples have an answer of ``yes'' -- let's call that the
 ``YES'' set.  For each set (NO and YES) we will build a histogram over
-the labels.  This is the second histogram in
+the labels.  This is the fourth histogram (from the top) in
 Figure~\ref{fig:dt_histogram}.  Now, suppose you were to ask this
 question on a random example and observe a value of ``no.''  Further
 suppose that you must \emph{immediately} guess the label for this
@@ -414,7 +414,7 @@ \section{Formalizing the Learning Problem}
 Note that the loss function is something that \emph{you} must decide
 on based on the goals of learning.
 
-\begin{mathreview}{Expectated Values}
+\begin{mathreview}{Expected Values}
 We write $\Ep_{(\vx,y) \sim \cD} [ \ell(y, f(\vx)) ]$ for the expected loss. Expectation means ``average.'' This is saying ``if you drew a bunch of $(x,y)$ pairs independently at random from $\cD$, what would your \emph{average} loss be?% (More formally, what would be the average of $\ell(y,f(\vx))$ be over these random draws?)
 More formally, if $\cD$ is a discrete probability distribution, then this expectation can be expanded as:
 %
@@ -426,12 +426,12 @@ \section{Formalizing the Learning Problem}
 If $D$ is a \emph{finite discrete distribution}, for instance defined by a finite data set $\{ (\vx_1,y_1), \dots, (\vx_N,y_N)$ that puts equal weight on each example (probability $1/N$), then we get:
 %
 \begin{align}
-\Ep_{(\vx,y) \sim D} [ \ell(y, f(\vx)) ] 
+\Ep_{(\vx,y) \sim D} [ \ell(y, f(\vx)) ]
 &= \sum_{(\vx,y) \in D} [ D(\vx,y) \ell(y, f(\vx)) ]
    \becauseof{definition of expectation}\\
 &= \sum_{n=1}^N [ D(\vx_n,y_n) \ell(y_n, f(\vx_n)) ]
    \becauseof{$D$ is discrete and finite}\\
-&= \sum_{n=1}^N [ \frac 1 N \ell(y_n, f(\vx_n)) ] 
+&= \sum_{n=1}^N [ \frac 1 N \ell(y_n, f(\vx_n)) ]
    \becauseof{definition of $D$}\\
 &= \frac 1 N  \sum_{n=1}^N [ \ell(y_n, f(\vx_n)) ]
    \becauseof{rearranging terms}
@@ -501,7 +501,7 @@ \section{Formalizing the Learning Problem}
 $\hat \vx$ to corresponding prediction $\hat y$.  The key property
 that $f$ should obey is that it should do well (as measured by $\ell$)
 on future examples that are \emph{also} drawn from $\cD$.  Formally,
-it's \concept{expected loss} $\ep$ over $\cD$ with repsect to $\ell$
+its \concept{expected loss} $\ep$ over $\cD$ with respect to $\ell$
 should be as small as possible:
 \begin{align} \label{eq:expectederror}
 \ep
@@ -542,7 +542,7 @@ \section{Formalizing the Learning Problem}
 \concept{generalize} beyond the training data to some future data that
 it might not have seen yet!
 
-So, putting it all together, we get a formal definition of induction
+So, putting it all together, we get a formal definition of induction in
 machine learning: \bigemph{Given (i) a loss function $\ell$ and (ii) a
   sample $D$ from some unknown distribution $\cD$, you must compute a
   function $f$ that has low expected error $\ep$ over $\cD$ with
@@ -624,7 +624,7 @@ \section{Further Reading}
 \end{comment}
 
 
-%%% Local Variables: 
+%%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "courseml"
-%%% End: 
+%%% End: