diff --git a/notes/04_density-transform/1_density-transform.tex b/notes/04_density-transform/1_density-transform.tex new file mode 100644 index 0000000..43a0f5a --- /dev/null +++ b/notes/04_density-transform/1_density-transform.tex @@ -0,0 +1,310 @@ + +\section{The ICA problem:} + +Let $\vec s = (s_1, s_2,...,s_N)^\top$ denote the concatenation of independent sources +and $\vec x \in \R^N$ describe our observations. $\vec x$ relates to $\vec s$ through a +\emph{linear transformation} $\vec A$: + +\begin{equation} +\label{eq:ica} +\vec x = \vec A \, \vec s. +\end{equation} + +We refer to $\vec A$ as the \emph{mixing matrix} and Eq.\ref{eq:ica} as the \emph{ICA problem}, +which is recovering $\vec s$ from only observing $\vec x$. + +\underline{Example scenario:} + +Two speakers are placed in a room and emit signals $s_1$ and $s_2$. +The speakers operate indepdendent of one another. +Two microphones are placed in the room and start recording. +The first microphone is placed slightly closer to speaker 2, while +the second microphone is placed slightly closer to speaker 1. +$x_1$ and $x_2$ denote the recordings of the first and second microphone respectively. +When we listen to the recordings we expect to hear a mix of $s_1$ and $s_2$. +Since microphone 1 was placed closer to speaker 2, when we only listen to $x_1$ we hear more of $s_2$ than $s_1$. +The opposite can be said when we listen only to $x_2$. + +Acoustic systems are linear. This means that $x_1$ is a superposition of \emph{both} sources $s_1$ and $s_2$. +We will assume here that the contribution of a source $s_i$ +to an observation $x_j$ is inversely proportional to the distance between the source and the microphone. +The distance-contribution relationship is \emph{linear}. We don't need this to be any more realistic. + +If we had a measurement of the distance between each microphone and each speaker, +we would tell exactly what the contribution of each of $s_1$ and $s_2$ is to each recorded observation. +If we know the exact contribution of a source to an observation, we can look at both observations and recover each source in full. + +This is what ICA tries to solve, except that it does not have any knowledge about the spatial setting. It is blind. + +\underline{Outline:} + +Before we tackle ICA itself, we first look at the more basic principle of \emph{density transformation} and +the \emph{convservation of probability}.\\ +We start with more specific cases of applying density transformations, +namely \emph{pseudo random number generators} and what the inverse of \emph{cumulative distribution functions (cdf)} can be used for.\\ +Finally we discuss how to generalize this in order to transform one probability density function (pdf) into another. + +\subsection{PRNG:} + +How can we sample from the uniform distribution $\in \lbrack0, 1)$? + +\begin{itemize} +\item Create a sequence, preferrably with a wrong period. +\item minimal pattern and sub-subsequences +\item determinism has an advantage: +\begin{enumerate} + \item \emph{reproducible} sequneces + \item efficiency; the starting element or ``seed'' and length of the sequence is suffcient is representative of the entire sequnece. +\end{enumerate} +\end{itemize} + +\underline{Linear congruential generator (LCG):} + +Start with a seed $y_0 \in \overbrace{\left\{0,\ldots,m-1\right\}}^{=:\;\mathcal{M}}$ with $m \in \N$ ($m$ controls the granularity). +The next sample $y_t$ is computed as: + +\begin{equation} +y_t = \left( \, a \; y_{t-1} \; + \; b \, \right) \, \text{mod} \; m, +\end{equation} +where\\[-0.7cm] +\begin{align*} +a \in \mathcal{M}&\; \text{is the multiplier,} \\ +b \in \mathcal{M}&\; \text{is the increment.} +\end{align*} + +Then $u_i = \frac{y_i}{m} \approx \,\mathcal{U} \in \lbrack0, 1)$. + +Although finicky and requiring careful parameterization, LCG gives us something for drawing from a uniform distribution. +Next, we look at how to draw samples of a random variable $X$ with a desired pdf $p_X(x)$. +using uniformly sampled values $\tilde z \in [0,1]$. + +\subsection{Inverse CDF:} + +If $F_{X}(x)$ is the cumulative + distribution function (cdf) of a random variable $X$, then the +random variable $Z = F_{X}(X)$ is uniformly distributed on the +interval $[0,1]$. This result provides +% (after inverting the relationship) +a general recipe to generate +samples $\tilde x$ of a random variable $X$ with a desired pdf $p_X(x)$ +from uniformly distributed random numbers $\tilde z \in [0,1]$: +\begin{enumerate} +\item Compute the cdf $F_X(x)$ of the desired pdf $p_X(x)$ + +\begin{equation} +F_X(x) = P(X \leq x) = \int_{-\infty}^{x} p(y)\,dy +\end{equation} + +The cdf is a one-to-one mapping of the domain of the cdf to the interval $[0,1]$. + If $Z$ is a uniform random variable, then $X=F_X^{-1}(Z)$ has the distribution $F$. + +\item Determine the inverse transformation $F^{-1}$. + +\item Sample uniformly distributed numbers (in $[0,1]$), $\tilde z$. +\item Get the samples $\tilde x=F^{-1}(\tilde z)$ from $X$. +\end{enumerate} + +At this point we can use a PRNG to sample from the uniform distribution and +by plugging those samples into the inverse cdf we can obtain samples from a desired pdf. + +\newpage + +\section{Density Transformation:} + +Let $X_1$ and $X_2$ be jointly continuous random variables with +density function $f_{X_1, X_2}$: + +$$ +f_{X_1, X_2}(x_1, x_2) = f(\vec x) \qquad \vec x \in \Omega \subset \R^2 +$$ + +and let $\vec u = \vec u(\vec x) = ( u_1(\vec x), u_2(\vec x)) = ( u_1(x_1, x_2), u_2(x_1, x_2)) $ be a one-to-one mapping/transformation. + +%\begin{figure} +%\centering +\includegraphics[width=0.7\textwidth]{img/u.pdf} +%\caption*{$\vec{\psi} \in \mathcal{F} = \overline{\operatorname{span} \vec{\phi}(\mathbb{R}^N)}$} +%\end{figure} + +The area of the small rectangle in $\Omega$ is $A = dx_1\, dx_2$.\\ + +The goal is to show that in order for probability to be conserved, the areas on both spaces have to be equal. +We will therefore demonstrate that: +$$ +\int_{\Omega} f(\vec{x}) \mathbf{d}\vec{x} +=\int_{u(\Omega)} f({\vec x(\vec u)}) \frac{1}{\left|\det \frac{\partial \vec{u}(\vec{x})}{\partial \vec{x}} \right|} \mathbf{d}\vec{u}. +$$ + +\begin{itemize} +\item Because $dx_1$ and $dx_2$ are \emph{infinitesimally small} we can consider the mapping +$\vec u$ to act as a linear transformation, resulting in a different shape in $u(\Omega)$. +\item The shape of the shaded area in $u(\Omega)$ is \emph{approximately} a parallelogram. +\item We will compute the ratio of the areas between the two transforms. +\item If we want to go back from $u(\Omega)$ to $\Omega$ we only need $\frac{1}{\text{ratio}}$. +\end{itemize} + +\underline{Important:} +Approximating the above as a linear transformation (i.e. a small rectangle turns into a parallelogram) only holds because for very small $d\vec x$. + +To get the area of the parallelogram in $u(\Omega)$ we need to find out what $\vec u(dx_1, dx_2))$ is. +The area of the parallelogram then becomes the magnitude of the cross product between the components of $\vec u(\vec x)$. + +\newpage + +We first only consider the vector due to $dx_1$ in blue: + +\includegraphics[width=0.75\textwidth]{img/x1.pdf} + +The difference vector between the two correspsonding points in $u(\Omega)$ becomes: + +\begin{equation*} +\begin{array}{r} +\rmat{ +u_1 (x_1 + dx_1, x_2)\\ +u_2 (x_1 + dx_2, x_2) +} - +\rmat{ +u_1 (x_1, x_2)\\ +u_2 (x_1, x_2) +} \\[0.7cm] += +\rmat{ +u_1 (x_1 + dx_1, x_2) - u_1 (x_1, x_2)\\ +u_2 (x_1 + dx_2, x_2) - u_2 (x_1, x_2) +} +\end{array} +\end{equation*} + +Because $dx_1$ is so small, we can approximate the transformed vector by the derivative, +which is essentially taking the limit. The difference vector in $u(\Omega)$ becomes: + +$$ +u: \vec e_1 \mapsto +\rmat{ +\frac{\partial u_1}{\partial x_1} \Delta x_1\\[0.2cm] +\frac{\partial u_2}{\partial x_1} \Delta x_1 +} +$$ + +\question{What about $dx_2$?} + +- We use the same procedure (one the red vector $\vec e_2$) and get: + +$$ +u: \vec e_2 \mapsto +\rmat{ +\frac{\partial u_1}{\partial x_2} \Delta x_2\\[0.2cm] +\frac{\partial u_2}{\partial x_2} \Delta x_2 +} +$$ + +The area of the parallelogram spanned by the two difference vectors in $u(\Omega)$ +is the magnitude of their cross product: +\begin{align*} +&\left| \; +\rmat{ +\frac{\partial u_1}{\partial x_1} \Delta x_1\\[0.2cm] +\frac{\partial u_2}{\partial x_1} \Delta x_1 +} +\times +\rmat{ +\frac{\partial u_1}{\partial x_2} \Delta x_2\\[0.2cm] +\frac{\partial u_2}{\partial x_2} \Delta x_2 +} +\; \right|\\ +&= +\left| \; +\underbrace{ +\frac{\partial u_1}{\partial x_1} \frac{\partial u_2}{\partial x_2} +\; - \; +\frac{\partial u_1}{\partial x_2} \frac{\partial u_2}{\partial x_1} +}_{\text{the Jacobian determinant}} +\; \right| \Delta x_1 \Delta x_2 \\ +&= +\left| \; \det \quad +\underbrace{ +\left( +\frac{\partial (u_1, u_2)}{\partial (x_1,x_2)} +\right) +}_{\text{the Jacobian}} +\; \right| \Delta x_1 \Delta x_2 +\end{align*} + +This matrix of partial derivatives is called the \emph{Jacobian}: +$$ +\frac{\partial (u_1, u_2)}{\partial (x_1,x_2)} = +\frac{\partial (u_1(\vec x), u_2(\vec x))}{\partial (x_1,x_2)} = +\frac{\partial (\vec u(\vec x))}{\partial (\vec x)} = +\underbrace{ +\rmat{ +{\partial u_1}/{\partial x_1} & {\partial u_1}/{\partial x_2}\\[0.2cm] +{\partial u_2}/{\partial x_1} & {\partial u_2}/{\partial x_2} +}%rmat +}_{ +\substack{ +\text{matrix of}\\ +\text{partial derivatives} +}%substack +} +$$ + +If we are given an area spanned by too small vectors (e.g. $\vec e_1$, $\vec e_2$) +we can compute the area of the corresponding parallelogram in $u(\Omega)$ +by multiplying the original area by the Jacobian determinant. + +\question{What if we want to transform from $u(\Omega)$ back to $\Omega$?} + +\includegraphics[width=0.7\textwidth]{img/reverse.pdf} + +We apply the inverse mapping. A very small rectangle in $u(\Omega)$ transforms into a parallelogram in $\Omega$. +This reverse transformation would require the inverse of the above matrix of partial derivatives. +The matrix of partial derivatives tells us how to transform infinitesimally small vectors back and forth. + +%Transformation between the spaces for infinitesimally small vectors: + +%$$ +%\rmat{ +%u_1(x_1) & u_1(x_2)\\ +%u_2(x_1) & u_2(x_2) +%} +%\rmat{ +%a\\ +%b +%} +%= +%\rmat{ +%u_1(x_1)\\ +%u_2(x_1) +%} +%a +%+ +%\rmat{ +%u_1(x_2)\\ +%u_2(x_2) +%} +%b +%$$ + +\question{Do we really have to compute the inverse of the matrix?} + +- No, the determinant of the inverse matrix is 1 / det of the original matrix: +$$ +\frac{1}{\left| \det \left( \frac{\partial \vec u}{\partial \vec x} \right) \right|} = +{\left| \det \left( \frac{\partial \vec x}{\partial \vec u} \right) \right|} +$$ + +\underline{Transformation between probability densities:} + +Conservation of probability: The area represents the probability of the event, +transforming it into another space should nt cause any increase or decrease in the probability of the event. + +Therefore, if we multiply the area of the rectangle in $\Omega$ by the pdf $f(\vec x)$ at $\vec x$, we get the probability of the parallelogram in $u(\Omega)$: + +$$ +\int_{\Omega} f(\vec{x}) \mathbf{d}\vec{x} +=\int_{u(\Omega)} f({\vec x(\vec u)}) \left|\det \frac{\partial \vec{x}(\vec{u})}{\partial \vec{u}} \right| \mathbf{d}\vec{u} +$$ +$$ +=\int_{u(\Omega)} f({\vec x(\vec u)}) \frac{1}{\left|\det \frac{\partial \vec{u}(\vec{x})}{\partial \vec{x}} \right|} \mathbf{d}\vec{u}. +$$ + diff --git a/notes/04_density-transform/Makefile b/notes/04_density-transform/Makefile new file mode 100644 index 0000000..dfe51f7 --- /dev/null +++ b/notes/04_density-transform/Makefile @@ -0,0 +1,40 @@ +all: slides notes clean +#all: handout + +projname = tutorial +targetname = $(projname)_$(shell basename $(CURDIR)) +compile = pdflatex +projnameS = $(projname).slides +projnameH = $(projname).handout +projnameA = $(projname).notes + +slides: $(projname).slides.tex $(projname).tex + $(compile) $(projname).slides.tex +# bibtex $(projname).slides +# $(compile) --interaction=batchmode $(projname).slides.tex +# $(compile) --interaction=batchmode $(projname).slides.tex + mv $(projname).slides.pdf $(targetname).slides.pdf + +handout: $(projname).handout.tex $(projname).tex + $(compile) $(projname).handout.tex + mv $(projname).handout.pdf $(targetname).handout.pdf + +# Repeat compilation for the references to show up correctly +notes: $(projname).notes.tex $(projname).tex + $(compile) $(projname).notes.tex +# bibtex $(projname).notes +# $(compile) --interaction=batchmode $(projname).notes.tex + $(compile) --interaction=batchmode $(projname).notes.tex + mv $(projname).notes.pdf $(targetname).notes.pdf + +clean: cleans cleanh cleana + +cleans: + rm -f $(projnameS).aux $(projnameS).bbl $(projnameS).log $(projnameS).out $(projnameS).toc $(projnameS).lof $(projnameS).glo $(projnameS).glsdefs $(projnameS).idx $(projnameS).ilg $(projnameS).ind $(projnameS).loa $(projnameS).lot $(projnameS).loe $(projnameS).snm $(projnameS).nav + +cleanh: + rm -f $(projnameH).aux $(projnameH).bbl $(projnameH).log $(projnameH).out $(projnameH).toc $(projnameH).lof $(projnameH).glo $(projnameH).glsdefs $(projnameH).idx $(projnameH).ilg $(projnameH).ind $(projnameH).loa $(projnameH).lot $(projnameH).loe $(projnameH).snm $(projnameH).nav + +cleana: + rm -f $(projnameA).aux $(projnameA).bbl $(projnameA).log $(projnameA).out $(projnameA).toc $(projnameA).lof $(projnameA).glo $(projnameA).glsdefs $(projnameA).idx $(projnameA).ilg $(projnameA).ind $(projnameA).loa $(projnameA).lot $(projnameA).loe $(projnameA).snm $(projnameA).nav + diff --git a/notes/04_density-transform/beamercolorthemetub.sty b/notes/04_density-transform/beamercolorthemetub.sty new file mode 100644 index 0000000..c41d22a --- /dev/null +++ b/notes/04_density-transform/beamercolorthemetub.sty @@ -0,0 +1,48 @@ +% Copyright 2004 by Madhusudan Singh +% +% This file may be distributed and/or modified +% +% 1. under the LaTeX Project Public License and/or +% 2. under the GNU Public License. +% +% See the file doc/licenses/LICENSE for more details. + +%\ProvidesPackageRCS $Header: beamercolorthemetub.sty, v a01 2011/11/18 09:11:41 tujl $ + +\mode + +\definecolor{darkred}{rgb}{0.8,0,0} + +\setbeamercolor{section in toc}{fg=black,bg=white} +\setbeamercolor{alerted text}{fg=darkred!80!gray} + +\setbeamercolor*{palette primary}{fg=darkred!60!black,bg=gray!30!white} +\setbeamercolor*{palette secondary}{fg=darkred!70!black,bg=gray!15!white} +\setbeamercolor*{palette tertiary}{bg=darkred!80!black,fg=gray!10!white} +\setbeamercolor*{palette quaternary}{fg=darkred,bg=gray!5!white} + +\setbeamercolor*{sidebar}{fg=darkred,bg=gray!15!white} + +\setbeamercolor*{palette sidebar primary}{fg=darkred!15!black} +\setbeamercolor*{palette sidebar secondary}{fg=white} +\setbeamercolor*{palette sidebar tertiary}{fg=darkred!50!black} +\setbeamercolor*{palette sidebar quaternary}{fg=gray!15!white} + +%\setbeamercolor*{titlelike}{parent=palette primary} +\setbeamercolor{titlelike}{parent=palette primary,fg=darkred} +\setbeamercolor{frametitle}{bg=gray!15!white} +\setbeamercolor{frametitle right}{bg=gray!60!white} + +%\setbeamercolor{Beispiel title}{bg=white,fg=black} + +\setbeamercolor*{separation line}{} +\setbeamercolor*{fine separation line}{} + +%\setbeamercolor{itemize item}{fg=darkred,bg=white} +%\setbeamercolor{itemize subitem}{fg=darkred!60!white,bg=white} +%\setbeamercolor{local structure}{fg=darkred,bg=white} +\setbeamercolor{local structure}{fg=gray,bg=white} +\setbeamercolor{structure}{fg=darkred!80!black,bg=white} +\setbeamercolor{block title}{bg=gray!10!white} +\mode + diff --git a/notes/04_density-transform/beamerthemeTUBerlin.sty b/notes/04_density-transform/beamerthemeTUBerlin.sty new file mode 100644 index 0000000..1ce3fd7 --- /dev/null +++ b/notes/04_density-transform/beamerthemeTUBerlin.sty @@ -0,0 +1,22 @@ +% Copyright 2004 by Madhusudan Singh +% +% This file may be distributed and/or modified +% +% 1. under the LaTeX Project Public License and/or +% 2. under the GNU Public License. +% +% See the file doc/licenses/LICENSE for more details. + +%\ProvidesPackageRCS $Header: beamerthemeTUBerlin.sty, v a01 2011/11/18 09:11:41 tujl $ +\mode + +\useinnertheme[shadow=true]{rounded} +\useoutertheme{infolines} +\usecolortheme{tub} + +\setbeamerfont{frametitle}{size=\normalsize} +\setbeamerfont{block title}{size={}} +%\setbeamerfont{structure}{series=\bfseries} +\setbeamercolor{titlelike}{parent=structure,bg=white} +\mode + diff --git a/notes/04_density-transform/bibliography.bib b/notes/04_density-transform/bibliography.bib new file mode 100644 index 0000000..948691f --- /dev/null +++ b/notes/04_density-transform/bibliography.bib @@ -0,0 +1,29 @@ +@book{sutton1998introduction, + title={Introduction to reinforcement learning}, + author={Sutton, Richard S and Barto, Andrew G and others}, + volume={135}, + year={1998}, + publisher={MIT press Cambridge} +} +@Book{Bertsekas07, + author = {D. P. Bertsekas}, + title = {Dynamic Programming and Optimal Control}, + publisher ={Athena Scientific}, + year = {2007}, + volume = {2}, + edition = {3rd}, + url = {http://www.control.ece.ntua.gr/UndergraduateCourses/ProxTexnSAE/Bertsekas.pdf} +} +@Article{Watkins92, + author = {C. Watkins and P. Dayan}, + title = {Q-learning}, + journal = {Machine Learning}, + year = {1992}, + OPTkey = {}, + volume = {8}, + OPTnumber = {}, + pages = {279--292}, + OPTmonth = {}, + OPTnote = {}, + OPTannote = {} +} diff --git a/notes/04_density-transform/img/reverse.pdf b/notes/04_density-transform/img/reverse.pdf new file mode 100644 index 0000000..cc267dc Binary files /dev/null and b/notes/04_density-transform/img/reverse.pdf differ diff --git a/notes/04_density-transform/img/reverse.svg b/notes/04_density-transform/img/reverse.svg new file mode 100644 index 0000000..1d5adb3 --- /dev/null +++ b/notes/04_density-transform/img/reverse.svg @@ -0,0 +1,3253 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + Slide + + Drawing + + + Group + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Slide + + Drawing + + + Group + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/notes/04_density-transform/img/u.pdf b/notes/04_density-transform/img/u.pdf new file mode 100644 index 0000000..249a0fb Binary files /dev/null and b/notes/04_density-transform/img/u.pdf differ diff --git a/notes/04_density-transform/img/u.svg b/notes/04_density-transform/img/u.svg new file mode 100644 index 0000000..90d1897 --- /dev/null +++ b/notes/04_density-transform/img/u.svg @@ -0,0 +1,4059 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + Slide + + Drawing + + + Group + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Slide + + Drawing + + + Group + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/notes/04_density-transform/img/x.svg b/notes/04_density-transform/img/x.svg new file mode 100644 index 0000000..0ceb480 --- /dev/null +++ b/notes/04_density-transform/img/x.svg @@ -0,0 +1,1591 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + Slide + + Drawing + + + Group + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/notes/04_density-transform/img/x1.pdf b/notes/04_density-transform/img/x1.pdf new file mode 100644 index 0000000..c23a119 Binary files /dev/null and b/notes/04_density-transform/img/x1.pdf differ diff --git a/notes/04_density-transform/img/x1.svg b/notes/04_density-transform/img/x1.svg new file mode 100644 index 0000000..cf54497 --- /dev/null +++ b/notes/04_density-transform/img/x1.svg @@ -0,0 +1,4967 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + Slide + + Drawing + + + Group + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Slide + + Drawing + + + Group + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/notes/04_density-transform/tutorial.handout.tex b/notes/04_density-transform/tutorial.handout.tex new file mode 100644 index 0000000..c016f5c --- /dev/null +++ b/notes/04_density-transform/tutorial.handout.tex @@ -0,0 +1,14 @@ +\documentclass[handout,ignorenonframetext]{beamer} +\newcounter{baslide} +\setcounter{baslide}{1} + +\let\oldframe +\frame +\let\oldendframe +\endframe + +\def\frame{\oldframe \label{baslide\roman{baslide}}% +\addtocounter{baslide}{1}} +\def\endframe{\oldendframe} + +\input{tutorial} diff --git a/notes/04_density-transform/tutorial.notes.tex b/notes/04_density-transform/tutorial.notes.tex new file mode 100644 index 0000000..c5da1a8 --- /dev/null +++ b/notes/04_density-transform/tutorial.notes.tex @@ -0,0 +1,17 @@ +\documentclass{../../latex/minotes} +\input{../../latex/customcommands} + +\numberwithin{equation}{section} +\numberwithin{figure}{section} + +\let\oldframe\frame +\let\oldendframe\endframe + +\newcommand{\notesonly}[1]{#1} + +\newcommand{\mystackrel}[2]{\stackrel{\mathmakebox[\widthof{#1}]{#2}}{=}} + +% frame titles only effective in presentation mode +\renewcommand{\frametitle}[1]{} + +\input{tutorial} diff --git a/notes/04_density-transform/tutorial.slides.tex b/notes/04_density-transform/tutorial.slides.tex new file mode 100644 index 0000000..5a3735c --- /dev/null +++ b/notes/04_density-transform/tutorial.slides.tex @@ -0,0 +1,11 @@ +\input{../../latex/headerMIslides} +\input{../../latex/customcommands} + +\subtitle{1.1 Intro \& 1.2 Connectionist Neuron} +\mathtoolsset{showonlyrefs} + +\newcommand{\slidesonly}[1]{#1} + +\newcommand{\mystackrel}[2]{\stackrel{\mathmakebox[\widthof{#1}]{#2}}{=}} + +\input{tutorial} diff --git a/notes/04_density-transform/tutorial.tex b/notes/04_density-transform/tutorial.tex new file mode 100644 index 0000000..5219c27 --- /dev/null +++ b/notes/04_density-transform/tutorial.tex @@ -0,0 +1,80 @@ +\usepackage[authoryear,round]{natbib} +\usepackage{multirow} + +\newcommand{\sheetnum}{% + 04 +} +%\setcounter{section}{\sheetnum-3} +\newcommand{\tutorialtitle}{% + Density Transformation +} +\newcommand{\tutorialtitleshort}{% + Density Transformation +} +% for slides +\subtitle{\sheetnum \tutorialtitle} + +%\maxdeadcycles=1000 % Workaround for ! Output loop---100 consecutive dead cycles because of too many figures + +% The following use of algroithms does not work well with the notes: +% +% +% +% +% instead use the following for your algorithms: +% +%\begin{figure}[!t] +%\removelatexerror +%\begin{algorithm}[H] + % your algo here + %\label{alg:algolabel} + %\caption{algocaption} +%\end{algorithm} +%\end{figure} +%\begin{algorithm} +% Below is the definition for the command \removelatexerror: +\makeatletter +\newcommand{\removelatexerror}{\let\@latex@error\@gobble} +\makeatother + +\begin{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\sheet{\sheetnum}{\tutorialtitleshort} + +\ttopic{\tutorialtitle} + +\columnratio{0.2,0.8}\textbf{} +\begin{paracol}{2} +%\setlength{\columnseprule}{0.1pt} +%\setlength{\columnsep}{5em} + +\begin{rightcolumn} + +% notes version will ignore it +\begin{frame} +\titlepage +\end{frame} + +\begin{frame} +\tableofcontents +\end{frame} + +\newpage + +\mode +\input{./1_density-transform} +\mode* + +\clearpage + +%\section{References} +%\begin{frame}[allowframebreaks] \frametitle{References} + %\scriptsize + %\bibliographystyle{plainnat} + %\bibliography{bibliography} +%\end{frame} + +\end{rightcolumn} +\end{paracol} + +\end{document} diff --git a/notes/07_stochopt/1_stochopt.tex b/notes/07_stochopt/1_stochopt.tex new file mode 100644 index 0000000..d7ce30a --- /dev/null +++ b/notes/07_stochopt/1_stochopt.tex @@ -0,0 +1,828 @@ + +\footnote{credit to Timm Lochmann for the content on MCMC} + +\begin{frame} +\underline{Outline} +\begin{itemize} +\item Problems with how we've been optimizing so far + \begin{itemize} + \item Exploration vs. Exploitation + \item What about discrete optimization? + \end{itemize} + \item Simulated annealing + \begin{itemize} + \item How does $\beta$ modulate $W$ + \item The stationary distribution $P_{(\vec s)}$? + \item How does $\beta$ modulate $P_{(\vec s)}$? + \end{itemize} + \item MCMC + \item Mean-field annealing + \begin{itemize} + \item derivation and algorith - use slides + \item variational approach for further reading + \end{itemize} +\end{itemize} +\end{frame} + +\section{Exploration vs. Exploitation} +\begin{frame} + +\begin{figure}[ht] + \centering + \begin{tabular}{c c} + \includegraphics[height=3.5cm]{img/gradient-descent.pdf} & + \includegraphics[height=3.5cm]{img/gradient-descent_local.pdf} + \end{tabular} + \caption{Learning by gradient descent}\label{fig:graddescent} +\end{figure} + +\end{frame} + +%An +%introduction illustrating the underlying analogy can be found in +%\textcite[ch. 7]{DudaEtAl2001}. \textcite{Murphy2012} gives a good +%and extensive discussion of undirected graphical models (Markov Random +%fields, ch~19), variational inference (ch~21; mean field for the ISING +%model, ch~21.3), and Monte Carlo Methods (ch~23), as well as MCMC +%methods (ch~24). Further information regarding variational methods can +%be found in \textcite{Bishop2006}. + +Learning is about tuning model parameters to fit some objective given training data. +For simple models with only few parameters one can formulate an analytic solution that optimizes the objective and yields the optimal parameters directly. +A soon as the number of parameters increases we opt for iterative gradient-based methods for finding the extrema of the objective function. If we were trying to minimize some cost function $E$, +iteratively updating the parameters $\vec w$ by moving them in the direction of where the gradient points steepest leads to the location of an extremum. +However, the cost function may contain multiple extrema, and there is no guarantee gradient-based learning will take us to a \emph{global} or \emph{local} optimium. +Following the gradient assuming that it will lead to a solution that represents the global optimum is considered a \emph{greedy} approach to learning. + +Completly abandoning such assumptions will lead to a \emph{random search} for the optimal parameters. When the previous set of paramters has no influence on the choice of weights in the next iteration, +then our learning approach is dominated by \emph{exploration} as opposed to \emph{exploitation} when we learn in a greedy fashion. + +\question{What are the advantages and disadvantages of exploration?} + +The advantage of exploration is that we always discard a set of paramters if the next set yields a better cost value. Therefore, exploration is not prone to getting stuck inside a local optimum. The obvious disadvantage is that there are no guarnatees on how long it will take to find the global optimum. We never know if we've converged or not. Exploitation, or the greedy approach (with the approprate learning rate schedule) is able to converge. However, wether it reaches a global or local solution depends on the starting position. + +\begin{frame} +\slidesonly{ +\frametitle{Exploration vs. Exploitation} + +} +\begin{figure}[ht] + %\centering + \begin{tabular}{c c} + exploration & exploitation\\ + \includegraphics[height=3.0cm]{img/exploration.pdf} & + \includegraphics[height=3.0cm]{img/exploitation.pdf}\\ + random search & greedy search/hill ``climbing'' + \end{tabular} + \caption{exploration vs. exploitation} + \label{fig:exploration-exploitation} +\end{figure} + +\end{frame} + +%\underline{Motivation 1:} We will look at how stochastic optimization can find a tradeoff between the two modes exploration and exploitation. + +\section{Discrete Optimization} + +\begin{frame} + +\slidesonly{ +\begin{block}{Supervised \& unsupervised learning $\rightarrow$ evaluation of cost function $E$} +Find the arguments the optimize $E$. +\begin{itemize} + \item real-valued arguments: gradient based techniques (e.g. ICA unmixing matrices) + \item discrete arguments: ??? (e.g. for cluster assignment) +\end{itemize} +\end{block} +} +\notesonly{ +So far we've been concerned with optimization problems with real valued arguments. The free paramters form a continuous space. The direction of the principle components in PCA and the unmixing matrix we are trying to find in ICA are real-valued arguments to their respective problems. +Gradient-based solutions are well suited for real-valued arguments as we tune the weights to minimize to optimize the cost function. + +But what if the problem we are trying to optimize operate on discrete arguments. This could be the case if we were tackling a problem such as K-Means clustering. K-Means clustering involves finding arguments with which to assign an observation to one of multiple clusters. The cost function that measures the quality of the assignments is continuous but a the arguments we optimize over, which effectively assign each observation to one cluster instead of another cluster, are discrete variables. Below is an example of such an assignment: + + +} + +\begin{figure}[ht] + \centering + \includegraphics[height=3.5cm]{img/clustering.pdf} + \caption{Clustering involves discrete-valued arguments.} + \label{fig:clustering} +\end{figure} + +\end{frame} +\newpage +\begin{frame} +\slidesonly{\frametitle{Formalization of the discrete optimization problem}} +\notesonly{ +\textbf{Formalization of the discrete optimization problem:} +} +\begin{block}{Setting} +\begin{itemize} + \item discrete variables $s_i, \ i = 1, \ldots, N\quad$ (e.g. $s_i \in \{+1, -1\}$ \notesonly{``binary units''} or $s_i \in \mathbb N$) + % $s_i \in \{1, 2, \dots 9 \} $ + \item \indent short-hand notation: $\vec{s}$ (``state'') -- { $\{\vec{s}\}$ is called state space } + \item {cost function:} $E: \vec{s} \mapsto E_{(\vec{s})} \in \mathbb{R}$ -- { not restricted to learning problems} +\end{itemize} +\end{block} + +We will now focus on \textbf{minization} problems. + +\begin{block}{Goal: find state $\vec{s}^*$, such that:} +\begin{equation*} + E \eqexcl \min \qquad (\text{desirable global minimum for the cost}) +\end{equation*} +Consequently, +\begin{equation*} + s^* := \argmin E. +\end{equation*} +\end{block} +\end{frame} + +\begin{frame} +\frametitle{Efficient strategies for discrete optimization} +\notesonly{ +We want to find the best configuration of a discrete system (e.g.\ +state of interacting binary units). Evaluating full search space works only for +very small problems ($\rightarrow 2^N$ possibilities for a problem with $N$ +binary units, search space is the set of corners of a +hypercube). A greedy strategy will often get trapped if there are +multiple local minima. +} + +\begin{itemize} +\item Evolutionary and genetic algorithms (GA) have been + motivated by \emph{biological} phenomena of adaptation through + \emph{mutation} and \emph{selection}. +\item GA and Monte-Carlo-Markov-Chain methods based on a \emph{proposal} and + \emph{acceptance} strategy ($\rightarrow$ Metropolis) can be interpreted + as "learning via trial and error". \emph{Simulated annealing} falls within MCMC. +\end{itemize} +\end{frame} + +\section{Simulated (stochastic) annealing} + +\begin{frame} + +\emph{Simulated annealing}\footnote{ +The ``annealing'' part in the name comes from meatallurgy. As metals cool down, the movement of the atoms slows down and their kinetic energy decreases until the metal eventually hardens. +} +\notesonly{is a method for stochastic optimization with which we can find a tradeoff between exploitation and exploration. More importantly, it is desgined to cope with optimization of discrete-valued arguments. Simulated annealing is + motivated by phenomena in physics related to the organization and + alignment of microscopic components forming specific macroscopic + configurations. } +\slidesonly{\\$\leadsto$ exploitation and exploration tradeoff} +\slidesonly{\\$\leadsto$ (more importantly) discrete optimization\\\vspace{0.5cm}} + +\underline{What does simulated annealing do?} + +In the case of \emph{minimzation} it: + +\begin{enumerate} +\item describes a strategy for switching between states +\item increases the probability of landing in lower-energy states\footnote{This preference for lower-energy states is specific to \emph{minimization} problems.} +\item allows for escaping \textbf{local} optima +\end{enumerate} + +\end{frame} + +\begin{frame} +\slidesonly{ +\frametitle{Simulated Annealing; the algorithm} + +Use \emph{inverse} temperature $\beta = 1/T$\\\vspace{0.5cm} +} +\notesonly{ +\textbf{Simulated Annealing, the algorithm} + +We will use the inverse temperature $\beta = 1/T$\\\vspace{1cm} +} + + +\texttt{initialization:} $\vec{s}_0, \, \beta_0$ small ($\leadsto$ high temperature) \\ + +\texttt{BEGIN Annealing loop} ($t=1,2,\dots$)\\ + +\oident$\vec{s}_t = \vec{s}_{t-1}$ {\tiny(initialization of inner loop)} \\ + +\oident \texttt{BEGIN State update loop} ($M$ iterations)\\ + +\begin{itemize} +\item \texttt{choose a new candidate state} $\vec{s}$ randomly {(but ``close'' to $\vec{s}_t$)}\\ + +\item \texttt{calculate difference in cost:} + \oident$\Delta E = E_{(\vec{s})}- E_{(\vec{s}_t)}$\\ +\item +\texttt{switch} $\vec{s}_t$ \texttt{to} $\vec{s}$ \texttt{ with probability} + $\color{red}\mathrm{W}_{(\vec{s}_t \rightarrow \vec{s})} = + \frac{1}{1 + \exp( \beta_t \Delta E)}$ +\texttt{\underline{otherwise} keep the previous state} $\vec{s}_t$ +\end{itemize} +\oident\texttt{END State update loop} \\ + +\oident$\color{blue}\beta_t = \tau \beta_{t-1}\qquad $ {\footnotesize($\tau>1\implies$ increase of $\beta$)} \\ % by practical ``exponential'' rule -- but not optimal + +\texttt{END Annealing loop} + +\end{frame} + + + +\begin{frame} \frametitle{Transition probability} +% \vspace{-0.5cm} + \begin{center}\includegraphics[width=8cm]{img/section3_fig1} + \end{center} + \begin{center} + \vspace{-0.5cm} + $\mathrm{W}_{(\vec{s}_t \rightarrow \vec{s})} = \frac{1}{1 + \exp( \beta_t \Delta E)}, \quad \Delta E = E_{(\vec{s})}- E_{(\vec{s}_t)}$\\ + \end{center} +\end{frame} + +the transition probability $\mathrm{W}$ measures the probability of changing to a new state $\vec s$ or reamining at $\vec s_t$. It depends on (A) the difference in cost and (B) the inverse temperature. The difference $\Delta E$ is the only way of measuring whether we gain any improvement from the transition or not. We use the inverse temperature $\beta_t$ to control how much we favor such transitions, or if we prefer a more conservative system. Below illustrates how the transition probablity can be modulated by the inverse temperature $\beta$ during the annealing process: + +\begin{frame} \frametitle{Modulating $\mathrm W$ during the annealing process} +% \textbf{ limiting cases for high vs.\ low temperature:} + \begin{center} + \includegraphics[width=10cm]{img/switchfuns} +\vspace{5mm} + \oident\oident\begin{tabular}[h]{c c c} + low $\beta$ (high temperature) & intermediate $\beta$ & high $\beta$\\\\ + \includegraphics[width=3cm]{img/section3_fig4} + & \hspace{-0.5cm}\includegraphics[width=3cm]{img/section3_fig5} + & \includegraphics[width=3cm]{img/section3_fig6} + \end{tabular} +\vspace{5mm} + \end{center} + \vspace{-2cm} + \begin{tikzpicture}[scale=0.75] +\draw[->] (0,0) -- (1,0); +\draw[->] (0,0) -- (0,1.25); +% \draw[lightgray, very thick] (-1,0.5) -- (.9,0.5); +% \draw (0,0) node[anchor=north]{0}; +\draw (0,1.25) node[anchor=west]{$E$}; +\draw (1,0) node[anchor=west]{$\vec{s}$}; +% \foreach \y in {0.5,1} \draw (0,\y) node[anchor=south east] {$\y$}; +% \foreach \y in {0.5,1} \draw (-1pt,\y) -- (1pt,\y); +\end{tikzpicture} + +\question{Which range of $\beta$ corresponds to \emph{exploration}/\emph{exploitation}?}\\ + +\end{frame} + + +In the case of: + +\begin{itemize} +\item low $\beta \corresponds$ high temperature:\\ +The transition probability is nearly constant (W=0.5) regardless of $\Delta E$. It is equally probably to accept a transition or to remain at the same state. Therefore we can regard this as \emph{exploration} mode. +\item intermediate $\beta$:\\ +Recall that we are currently considering a minimization problem. $\Delta E < 0$ whenever the new state is of lower cost than the previous one. We are no longer indifferent to this difference, but are more likely to accept transitions to lower-eneergy-states. Our process is still stochastic, therefore it is not guarnateed that we will accept every transition that yileds lower lower cost. We may either reject the transition and remain at the same higher-cost state or accept a transition to higher-cost state. such transitions are less likely to happen, but are still possible. This is where stochastic optimization is able to escape a local optimimum and resume the search elsewhere instead of opting for the greedy approach. To reiterate, this is a stochastic process, if we sample long enough, we will find that intermediate values of $\beta$ are more likely to yield samples from the ``global'' minimum of this curve than the higher ``local'' minimum in this particualr cost function. +\item high $\beta \corresponds$ low temperature:\\ +This is the \emph{exploitation} mode. This reflects the behjavior of a greedy learning algorithm such as gradient descent. Whenever we compare the cost of two sucessive states and find that $\Delta E < 0$ it tells us that the new state is of lower cost and more desriable for our mimization objective. The transition probability in this case will almost always accept a transition to a lower-cost state. Consequently it will almost always reject transitions to high-cost states, and the probabilty of remaining in a high-cost state is also very low. If the next sample is of lower-cost, we are very likely to accept that transition. Looking again at the stochastic nature of our process: We repeat the process multiple times and register how often each \emph{run} (not a single iteration) leads us to either minumum on this particular curve, we will find that the chances of finding the global minimum isjust as likely as those of reaching the local minimum. This is because the initial state is the only decisive factor. high $\beta$ means we are in exploitation mode. We are only going to accept a transition if it lowers our cost. If we restrict our sampling to ``nearby'' states, we are emulating gradient descent. As soon as we've determined the direction of descent, we will follow it. The descisive factor is, where did we start? Since in our case, the two valleys around the two minima are equally ``wide'', it makes starting inside one equal to starting inside the other. If I am already in the vicinity of one minimum, the probabilty of transitionining \emph{outside} is very low. This probability is low, regardless of whether I am in the vicinity of the global or local minimum. +\end{itemize} + + +\begin{frame}\frametitle{Annealing schedule \& convergence} +Convergence to the global optimum is guaranteed if $\beta_t \sim \ln (t)$\footnote{Geman and Geman show this in: Geman, S., and D. Geman (1984). Stochastic relaxation, Gibbs distribution, and the bayesian restoration of images. IEEE Trans. Pattern Anal Machine Intell. 6, 721-741. } + +\begin{itemize} + % \itR robust optimization procedure + \itR but: $\beta_t \sim \ln t$ is \textbf{too slow} for practical problems + \itR therefore: $\beta_{t+1} = \tau \beta_t, \quad \tau \in [1.01,1.30]$ + (exponential annealing) + \itR additionally: the \texttt{State Update loop} has to be iterated often enough, e.g. $M=500-2000$. \slidesonly{$\leadsto$ thermal equilibrium} + \notesonly{This is required for reaching thermal equilibrium. Another way to look at it is the need to capture the stationary distribution of the states in relation to their cost. We will talk more about when we discuss the Gibbs distribution.} +\end{itemize} +\end{frame} + +\section{The stationary distribution} + +\begin{frame} + + +\slidesonly{ + \begin{center} + \oident\oident\begin{tabular}[h]{c c c} + low $\beta$ (high temperature) & intermediate $\beta$ & high $\beta$\\\\ + \includegraphics[width=3cm]{img/section3_fig4} + & \hspace{-0.5cm}\includegraphics[width=3cm]{img/section3_fig5} + & \includegraphics[width=3cm]{img/section3_fig6} + \end{tabular} +\vspace{5mm} + \end{center} + \vspace{-2cm} + \begin{tikzpicture}[scale=0.75] +\draw[->] (0,0) -- (1,0); +\draw[->] (0,0) -- (0,1.25); +% \draw[lightgray, very thick] (-1,0.5) -- (.9,0.5); +% \draw (0,0) node[anchor=north]{0}; +\draw (0,1.25) node[anchor=west]{$E$}; +\draw (1,0) node[anchor=west]{$\vec{s}$}; +% \foreach \y in {0.5,1} \draw (0,\y) node[anchor=south east] {$\y$}; +% \foreach \y in {0.5,1} \draw (-1pt,\y) -- (1pt,\y); +\end{tikzpicture} + +Missing a probability distribution across states. +} + +\notesonly{ +As we discussed the effect of $\beta$ on the transition probability, we saw how this controls explorations and exploitation. +By talking about the probability of descendingf vs. jumping to a completly different location, we also talked about the probability of landing inside the valley surrounding the global minimum vs. that surrounding a \emph{local} one. Therefore, it becomes necessary to define a measure for the probability for each possible state that $\vec s$ can take. +This measure needs to fulfill the following requirements: +\begin{enumerate} +\item Reflect the probability distribution across states +\item For constant $\beta$ it should converge to the stationary distribution. That is the probability of transitioning from some state $\vec s$ to $\vec s'$ should be equal to the reverse transition. This is what is meant by ``thermal equilibrium''. +\end{enumerate} +} +\end{frame} + + + +\begin{frame}{Calculation of the stationary distribution} +\question{How do we find this stationary distribution?} +\pause +\vspace{-0.5cm} +\begin{equation*} + \underbrace{ \substack{ \text{probability of} \\ + \text{transition } + \vec{s} \rightarrow \vec{s}^{'}} }_{ + P_{(\vec{s})} \mathrm{W}_{(\vec{s} \rightarrow + \vec{s}^{'})} } + = + \underbrace{ \substack{ \text{probability of} \\ + \text{transition } + \vec{s}^{'} \rightarrow \vec{s} } }_{ + P_{(\vec{s}^{'})} \mathrm{W}_{(\vec{s}^{'} \rightarrow + \vec{s})} } +\end{equation*} +\begin{equation*} + \begin{array}{ll} + \frac{P_{(\vec{s})}}{P_{(\vec{s}^{'})}} + & = \frac{\mathrm{W}_{(\vec{s}^{'} \rightarrow \vec{s})}}{ + \mathrm{W}_{(\vec{s} \rightarrow \vec{s}^{'})}} + = \frac{1 + \exp\big\{ \beta \big( E_{(\vec{s})} - E_{(\vec{s}^{'})} + \big) \big\} }{1 + \exp\big\{ \beta \big( E_{(\vec{s}^{'})} - + E_{(\vec{s})}\big) \big\} } + = \frac{1 + \exp( \beta \Delta E)}{1 + \exp( -\beta \Delta E)} \\\\ + \pause + & = \exp( \beta \Delta E) \frac{1 + \exp( -\beta \Delta E)}{ + 1 + \exp( -\beta \Delta E) } + = \exp( \beta \Delta E ) \\\\ + \pause + & = \exp\big\{ \beta \big( E_{(\vec{s})} - E_{(\vec{s}^{'})}\big) \big\} + = \exp\big\{ \beta E_{(\vec{s})} - \beta E_{(\vec{s}^{'})} \big\}\\\\ + &= \frac{\exp\left( \beta E_{(\vec{s})}\right)}{\exp\left( \beta E_{(\vec{s}^{'})} \right)} + \slidesonly{ + \qquad \small \text{condition is fulfilled by Gibbs distribution.} + } + \end{array} +\end{equation*} +\notesonly{ +The condition is fulfilled for the \emph{Gibbs-Boltzmann} distribution: +} + +\end{frame} +\begin{frame}\frametitle{The Gibbs distribution} +\notesonly{ +The Gibbs (or Boltzmann-) distributon from statistical physics, measures the +probability of a system to be in state $\vec s$ having Energy $E_{(\vec s)}$ is +given as +} +\begin{equation} \label{eq:gibbs} +P_{(\vec{s})} := \frac{1}{Z} \exp \Big(-\frac{E_{(\vec s)}}{k_b T}\Big) += \frac{1}{Z} \exp \Big(-\beta E_{(\vec s)} \Big) +\end{equation} + +where the normalization constant / partition function $Z$ is defined as: + + +\begin{equation} \label{eq:partition} +Z := \sum\limits_{\vec{s}} \exp \Big(-\frac{E_{(\vec s)}}{k_b T}\Big) = \sum\limits_{\vec{s}} \exp(-\beta E_{(\vec s)}) +\end{equation} + +\notesonly{ +The partition function $Z$ ensures that $P_{(\vec{s})}$ is a probability +measure and the Boltzmann konstant $k_b = 1.38 \cdot 10^{-23} J/K$ +gives a scale of the \emph{temperature} $T$. This means the +probability of observing a state is fully determined by its Energy. + +%For sampling algorithms one often constructs a Markov chain whose +%stationary distribution is a Gibbs-distribution for a specified cost +%(or energy-) function. +} + +\end{frame} + +\begin{frame}\frametitle{Cost vs. probability distribution} +\question{How does $\beta$ modulate $P_{(\vec s)}$?} + +$E_{(\vec{s})}$ +\vspace{-0.2cm} +\begin{figure}[h] + \centering +\includegraphics[width=12cm]{img/section3_fig7} +\[ \begin{array}{ll} + \beta \downarrow:\pause + & \text{broad, ``delocalized'' distribution} \\\vspace{-0.3cm}\\ + \beta \uparrow:\pause + & \text{distribution localized around (global) minima} +\end{array} \] +\end{figure} + + +\end{frame} + +\newpage + + +We will now further formalize what is going on with simulated annealing. + +\section{Monte Carlo Markov Chain (MCMC)} + + +\begin{frame} + + + +Whenever it is difficult to evaluate the joint distribution\\ +opt for ``learning through trial and error''\\ + +\notesonly{ +MCMC methods are a popular +strategy in situations where it is hard to evaluate the joint (e.g.\ +posterior distribution) of a rnadom variable analytically but where it is easy to +sample from the conditional distribution (and finally the joint +distribution). Using samples from such a posterior distribution, one +can estimate many summaries of interest without explicitly knowing the +joint distribution. +} + +\slidesonly{ +\vspace{1cm} +What is: + +\begin{itemize} +\item a Markov Chain +\item the Markov property +\item the stationary distribution +\item Monte Carlo +\item the Metropolis algorithm +\item Metropolis sampling (Monte Carlo + Markov Chain) - simulated annealing +\item deterministic annealing (variational approach) - for further reading +\end{itemize} + +The following is for providing a framework around stochastic optimization. +} + +\end{frame} +\begin{frame} \frametitle{Markov chain and the Markov property} +Consider a family of random variables $X_t$, $t=1,2,\ldots$, which describe a stochastic process. $x_t$ denotes the state at time $t$. + +\notesonly{ +A sequence of such is referred to as a \emph{Markov chain} whenever $X_{t+1}$ depends only on the predecessor $X_t$ and is \emph{independent} of all values previous to that: +} + +\begin{align} +\label{eq:markovprop} +P(X_{t+1} &= x_{t+1} | X_{t} = x_{t}, X_{t-1} = x_{t-1}, \ldots, X_{0} = x_{0})\\ +&= P(X_{t+1} = x_{t+1} | X_{t} = x_{t}) +\end{align} + +\notesonly{\eqref{eq:markovprop}} +\slidesonly{This} is refered to as the \emph{Markov property}. +\notesonly{The conditional distribution of $X_t$ depends only on its +predecessor. In the more general case of Markov Random Fields, the +Markov property induces a ``local neigborhood''/ set of \emph{parents} +which fully determines the conditional distribution).} +The transition probabilities between subsequent states +$$ +P(X_{t+1}=j|X_{t}=i)=p_{ij} +$$ +can be described via the stochastic matrix $M = \{m_{ij}\}_{ij}$ with +$m_{ij}=p_{ij}$. +\\ + +\pause + +\slidesonly{Exactly what $W$ in simulated annealing describes for the states $\vec s$: +} +\notesonly{ +The transition probability $W$ we just encountered in simulated annealing with states $\vec s$ describes exactly this: +} +\begin{align} +\label{eq:markovproptransition} +W_{s_i \rightarrow s_j} = P(X_{t+1} = s_j | X_{t} = s_i) +\end{align} + +\begin{center} +s.t. $W_{s_i \rightarrow s_j} > 0 \;\;\forall (i,j)\quad$ +and $\quad\sum_j W_{s_i \rightarrow s_j} = 1 \;\;\forall i$. + +\end{center} +\end{frame} + +\begin{frame}\frametitle{Stationary distribution:} +\label{sec:stat-distr} +The \emph{stationary distribution} of a homogeneous Markov chain is a +distribution $\pi=[\pi_1,\pi_2,..., \pi_N$] such that +\begin{equation} +M \pi = \pi +\end{equation} +where +\begin{align} +p(x_{t+1}=j) &= \sum_i p(x_{t+1}=j,x_{t}=i) \\ +&= \sum_i p(x_{t+1}=j|x_{t}=i)\,p(x_{t}=i)\\ +&= M \pi +\end{align} +and can be derived from the eigen-decomposition of the transition matrix +(given certain conditions on the Markov chain $\rightarrow$ irreducibility, recurrence etc.) + +If the Markov chain is \emph{reversible}, the stationary distribution is +characterized by a \emph{detailed balance} between going from one +state to another one, i.e.\ +$$ +\pi_i p_{ij} = \pi_j p_{ji} +$$ +\end{frame} + +\begin{frame}\frametitle{Monte Carlo} +\notesonly{ +Monte Carlo methods +have been used to } evaluate certain integrals with stochastic methods.\\ + +Example:\\ +Estimate $\pi$ via random sampling from $[0,1]$ and counting how +many of the points have distance smaller than 1. Using the fact that +$A = \pi r^2$ and $4 A/F_\mathrm{square} = 4 N_\mathrm{A}/N$ gives $\approx \pi$. +\\ +\end{frame} + +\begin{frame} +\slidesonly{ +\frametitle{Monte Carlo} +Evaluating probability of states (e.g.\ to compute integrals, expectation values +etc.) is difficult. +Easier to sample from conditional distributions.\\ + +Approaches: +\begin{enumerate} +\item Gibbs sampler:\\ + Sample from conditional distribution to produce a Markov chain with the joint +posterior density as its stationary distribution. +\item Metropolis algorithm:\\ + sampling strategy with which to accept or reject transitions +\end{enumerate} +} +\end{frame} + +While it can be difficult to evaluate the +probability of states (e.g.\ to compute integrals, expectation values +etc.) it is possible to sample from the joint distribution by +sequential sampling from conditional distributions that are easier to +compute. There are two approaches: the \emph{Gibbs sampler} and +\emph{Metropolis} type algorithms (Note: this material is taken nearly +verbatim from Kass et. al. 1997, roundtable discussion??). + +\subsection{Gibbs sampler:} +\label{sec:gibbs-sampler} + +The Gibbs sampler \citep{GemanGeman1984} samples from the collection +of full (or complete) conditional distributions and it does, under +fairly broad conditions, produce a Markov chain with the joint +posterior density as its stationary distribution. A tutorial can be +found in \citep{CasellaGeorge1992}. + +\newpage +\begin{frame}\frametitle{Metropolis algorithm} + +\notesonly{ +When it is difficult to directly sample from the conditionals, one may +instead simulate from a different Markov chain with a different +stationary distribution, but then modify it in such a way so that a +new Markov chain is constructed which has the targeted posterior as its +stationary distribution. This is done by the Metropolis-Hastings +algorithm -- It samples from a prespecified candidate (proposal) +distribution for and subsequently uses an accept-reject step (see Kass +et. al. 1997). +} + +The \emph{Metropolis algorithm} describes the sampling strategy with which to accept or reject transitions: +\begin{equation} +P(Y_n = x_j | X_{n} = x_i) = P(Y_n = x_i | X_{n} = x_j) +\end{equation} + +IF $\Delta E < 0$ (minimization) then \\ + +\oident accept transition \\ + +ELSE \\ + +\oident Sample $\varepsilon$ from $\mathcal{U} \sim \lbrack 0, 1 \rbrack$ \\ + +IF $\varepsilon < \exp \left( - \beta \Delta E \right)$ then \\ + +\oident accept transition + +\oident the new state is similar/``close'' to the prev. state (e.g. bit flip)\\ + + +ELSE\\ + +\oident reject transiton and remain at the same state\\ + +\vspace{0.5cm} +Simulated annealing follows the Metropolis algorithm. + +\end{frame} + +\begin{frame}\frametitle{Metropolis sampling} +\label{sec:metropolis-sampling} +The MCMC strategy of Metropolis sampling uses a 2-step approach +\begin{enumerate} +\item Markov Chain $\rightarrow$ Proposal Density +\item Monte Carlo $\rightarrow$ Acceptance Probability +\end{enumerate} + +\textbf{Proposal Distribution:} +The Proposal density determines which nodes to \emph{poll} (``select +and test'') next and needs to fulfil the following properties: +\begin{itemize} +\item nonnegative: $\tau_{ij} \geq 0$ +\item normalized: $\sum_j \tau_{ij} =1$ +\item symmetric: $\tau_{ij}=\tau_{ji}$ +\end{itemize} + +\end{frame} + +\newpage + +\subsection{Acceptance Probability:} +This probability specifies how likeli it is to go from state $i$ to +$j$. In our scenario, this depends only on their energy levels (and the current value of +temperature). + +\subsection{Example:} Using the difference in energy levels +$\Delta_{ij} = E_j -E_i$, the sigmoidal acceptance rule is given as +$$ +p_\mathrm{switch}(i \rightarrow j) = p_{ij} = \frac{1}{1+e^{\Delta_{ij}}} = \frac{1}{1+e^{(E_j -E_i)}}. +$$ +From the assumption of detailed balance then follows +\begin{eqnarray*} + \tau_i p_{ij} & = & \tau_j p_{ji}\\ + \frac{\tau_i}{\tau_j}& = & \frac{p_{ji}}{ p_{ij}} += \frac{1+\exp(\Delta_{ij})}{1+\exp(\Delta_{ji})} += \frac{1+\exp(\Delta_{ij})}{1+\exp(-\Delta_{ij})}\\ +& = & \exp(\Delta_{ij}) \frac{1+\exp(\Delta_{ij})}{1+\exp(\Delta_{ij})} += \exp(\Delta_{ij})\\ +&= &\exp(E_j-E_i) = \frac{Z \exp(E_i)}{Z \exp(E_j)} +\end{eqnarray*} +demonstrating that +$p(X_i)= \frac{1}{Z} \exp \big(-\frac{E_i}{k_b T}\big)$ +is a consistent choice. +\\ + +\begin{itemize} +\item Determination of the \emph{transition probability} $p(X_T|X_{j})$ + requires only knowledge of $E_i - E_j$. Applying the Metropolis + algorithm with these parameters will lead to a stationary + distribution $\pi$ with $\pi_i= p(x_i)$. +\item This means, we can sample from the distribution $p_\beta(x)$ + without knowing $Z=\sum_{i \in I} \exp(-\beta E_i)$ which is + difficult to estimate for large $I$ as e.g.\ $2^N$. +\item Acceptance depends only on the \emph{energy difference} between + the current and suggested state. This difference depends only on the + \emph{neighborhood} of the polled unit an not necessarily the full + system! +\end{itemize} + +\question{How does this all tie back to Simulated Annealing?} + +Making use of the temperature parameter (and an annealing schedule), +Metropolis sampling can be used to optimize cost functions. + +\textbf{Motivation:} At each temperature, the MCMC relaxes into the +stationary distribution. For lower and lower temperatures (higher $\beta$), the entropy +of this distribution becomes lower and lower, more ``peaked'', i.e.\ +only the \emph{most} probable states have nonzero probability. +The probability of states that initial started with moderate or lower probability values are pulled towards near-zero values. +\textbf{Also}: The average +Energy becomes lower and lower. This motivates simulated annealing as +an optimization approach. + +\begin{itemize} +\item For each temperature, relax towards stationary distribution +\item gradually decrease temperature +\item[$\leadsto$] observed states represent approximately ``optimal solutions'' +\end{itemize} +If the shape of distribution changes smoothly and cooling is ``slow +enough'', this will result in a distribution concentrated on the +minimum-energy state, i.e. will end up in an optimal solution with +probability 1. + +\subsection{Simulated (stochastic) annealing (revisited)} \label{sec:stochastic-annealing} +``classical'' MCMC approach: select variable, stochastically accept +change in state for that variable. + +Depending on the distribution, sampling might be the only feasible +solution. There are different strategies (Gibbs sampling, +Proposal-Reject) depending on whether direct sampling from the +conditional distributions makes Gibbs-Sampling possible. + + +\textbf{Caveat:} Sampling takes a lot of time and depends on the +annealing schedule. In some cases, deterministic strategies can make +use of efficient approximations of the true distribution and thereby +significantly speed up the optimization proces. + + +\section{Deterministic annealing} \label{sec:determ-anne} + +\begin{frame} + +Mean field methods provide a specific type of variational approach for +optimization problems (\cite{Bishop2006,Murphy2012}). This is useful +more generally for density estimation ($\rightarrow$ Variational +Bayes) where it often provides a more efficient option than sampling +techniques if the posterior distribution can be well approximated by a +factorizing distribution. The following description of the variational +approach closely follows \citep[ch. 21.3]{Murphy2012}. + +\textbf{more about this in the notes.}\\ +\textbf{switch to mean field lecture slides} + +\end{frame} + +\begin{eqnarray*} + \label{eq:1} + L(q_i) & = & \sum_x \prod_i q_i(x_i) \Big( \log \tilde{p}(x) - \sum_k \log q_k(x_k) \Big) \\ + & = & \sum_{x_j} \sum_{x_{-j}} q_j (x_j) \prod_{i \neq j} q_i(x_i) \Big( \log \tilde{p}(x) - \sum_k \log q_k(x_k) \Big) \\ + & = & \sum_{x_j} q_j(x_j) \sum_{x_{-j}} \prod_{i \neq j} q_i(x_i) \log \tilde{p}(x) \\ +& & - \sum_{x_j} q_j(x_j) \sum_{x_{-j}} \prod_{i \neq j} q_i(x_i) \Big(\log q_j(x_j)+ \sum_{k \neq j} \log q_k(x_k) \Big) \\ +& = & \sum_{x_j} q_j(x_j) \log f_j(x_j) - \sum_{x_j} q_j(x_j) \log q_j(x_j) + const +\end{eqnarray*} +where we introduced the definition +$$ +\log f_j(x_j):= \sum_{x_{-j}} \prod_{i \neq j} q_i(x_i) \log +\tilde{p}(x). +$$ +Although $f_j$ is not a proper distribution (not normalized), the last +term can be interpreted as a KL-divergence +$$ +L(q_j) \sim - \dkl(q_j||f_j). +$$ +Therefore, $L(q) = - \dkl(q||\tilde{p})$ can be maximised by +minimizing $\dkl(q_j||f_j)$ i.e. setting $q_j = f_j$ by +$$ +q_j(x_j) = \frac{1}{Z_j} \exp\big(\E_{-q_j}[\log \tilde{p}(x)]\big) +$$ +Ignoring the normalisation constant we get as the optimal component $q_j$ +\begin{equation} + \label{eq:MeanField} +\log q_j(x_j) = \E_{-q_j}[\log \tilde{p}(x)] +\end{equation} + +\subsection{Alternative motivation of the mean-field annealing + approach} \label{sec:motivation} at each step we actually know +$p_{\Delta E}(\mathrm{flip})$ which enables us to estimate +$E[X_i|X_{-i}]$ i.e.\ the average value of $X_i$ given its parents. + +Mean field annealing generalizes this approach by making use of the fact that +the marginal distributions of $X_i$ are known in this way and then using the relation: +i.e.\ with $p(v_j) = \frac{1}{1+e^{-v_j/t}}$ we can write: +$$ +\langle x_j \rangle = (+1) P(v_j) +(-1)(1-P(v_j)) = 2 P(v_j) -1 = \tanh(v_j/2T) +$$ +Although v is not known exactly (depends on the exact states of the +other $X$ and their interactions) we can approximate it as +$$ +v_j \approx \langle v_j \rangle = \Big\langle \sum_i w_{ji} x_i \Big\rangle = + \sum_i w_{ji} \langle x_i \rangle +$$ +where $\langle v_j \rangle$ is called the ``mean field'' and gives the average value of $x_j$ as +$$\langle x_j \rangle \approx \tanh\frac{\langle v_j\rangle}{2T}$$ + +\textbf{Relation between the sigmoidal and tanh:} Using the sigmoidal acceptance probability is equivalent to using $\tanh$ for \emph{states} $x \in \{-1,+1\}$ of the RV i.e.\ $\tanh(z) \in (-1,1)$ for $z \in (-\infty,+\infty)$. + +\begin{equation} +\tanh(x) = \quad\frac{e^{2x}-1}{e^{2x}+1} + \quad = \frac{1-e^{-2x}}{1+e^{-2x}} \quad = \frac{2-1-e^{-2x}}{e^{-2x}+1} \quad = \frac{2}{1+e^{-2x}} -1 +\end{equation} +or the other way round: +$$ +\frac{1}{1+e^{-x}} = \frac{1}{2}\left(1+\tanh(\frac{1}{2}x)\right) +$$ + +\section{Boltzmann machines} +\label{sec:boltzmann-machines} + +Approach can be generalized to hidden units and details on Boltzmann +machines in \cite{AartsKorst1990}. Restricted BMs (RBMs) do not have connections between observable or hidden units, which simplifies computation of conditional probabilities necessary for learning considerably. RBMs have been developed (among others) by P. Smolensky, who called them "Harmoniums" (\cite{Smolensky1986}). +\begin{itemize} +\item General model of distributions +\item parameters (W) can be learned from samples +\item works for Systems with nonobservable (latent) variables +\item interesting model of cognitive processes +\item structured boltzmann machines as efficient pattern recognition systems +\end{itemize} + diff --git a/notes/07_stochopt/Makefile b/notes/07_stochopt/Makefile new file mode 100644 index 0000000..2c4af1b --- /dev/null +++ b/notes/07_stochopt/Makefile @@ -0,0 +1,40 @@ +all: slides notes clean +#all: handout + +projname = tutorial +targetname = $(projname)_$(shell basename $(CURDIR)) +compile = pdflatex +projnameS = $(projname).slides +projnameH = $(projname).handout +projnameA = $(projname).notes + +slides: $(projname).slides.tex $(projname).tex + $(compile) $(projname).slides.tex +# bibtex $(projname).slides +# $(compile) --interaction=batchmode $(projname).slides.tex +# $(compile) --interaction=batchmode $(projname).slides.tex + mv $(projname).slides.pdf $(targetname).slides.pdf + +handout: $(projname).handout.tex $(projname).tex + $(compile) $(projname).handout.tex + mv $(projname).handout.pdf $(targetname).handout.pdf + +# Repeat compilation for the references to show up correctly +notes: $(projname).notes.tex $(projname).tex + $(compile) $(projname).notes.tex +# bibtex $(projname).notes +# $(compile) --interaction=batchmode $(projname).notes.tex +# $(compile) --interaction=batchmode $(projname).notes.tex + mv $(projname).notes.pdf $(targetname).notes.pdf + +clean: cleans cleanh cleana + +cleans: + rm -f $(projnameS).aux $(projnameS).bbl $(projnameS).log $(projnameS).out $(projnameS).toc $(projnameS).lof $(projnameS).glo $(projnameS).glsdefs $(projnameS).idx $(projnameS).ilg $(projnameS).ind $(projnameS).loa $(projnameS).lot $(projnameS).loe $(projnameS).snm $(projnameS).nav + +cleanh: + rm -f $(projnameH).aux $(projnameH).bbl $(projnameH).log $(projnameH).out $(projnameH).toc $(projnameH).lof $(projnameH).glo $(projnameH).glsdefs $(projnameH).idx $(projnameH).ilg $(projnameH).ind $(projnameH).loa $(projnameH).lot $(projnameH).loe $(projnameH).snm $(projnameH).nav + +cleana: + rm -f $(projnameA).aux $(projnameA).bbl $(projnameA).log $(projnameA).out $(projnameA).toc $(projnameA).lof $(projnameA).glo $(projnameA).glsdefs $(projnameA).idx $(projnameA).ilg $(projnameA).ind $(projnameA).loa $(projnameA).lot $(projnameA).loe $(projnameA).snm $(projnameA).nav + diff --git a/notes/07_stochopt/beamercolorthemetub.sty b/notes/07_stochopt/beamercolorthemetub.sty new file mode 100644 index 0000000..c41d22a --- /dev/null +++ b/notes/07_stochopt/beamercolorthemetub.sty @@ -0,0 +1,48 @@ +% Copyright 2004 by Madhusudan Singh +% +% This file may be distributed and/or modified +% +% 1. under the LaTeX Project Public License and/or +% 2. under the GNU Public License. +% +% See the file doc/licenses/LICENSE for more details. + +%\ProvidesPackageRCS $Header: beamercolorthemetub.sty, v a01 2011/11/18 09:11:41 tujl $ + +\mode + +\definecolor{darkred}{rgb}{0.8,0,0} + +\setbeamercolor{section in toc}{fg=black,bg=white} +\setbeamercolor{alerted text}{fg=darkred!80!gray} + +\setbeamercolor*{palette primary}{fg=darkred!60!black,bg=gray!30!white} +\setbeamercolor*{palette secondary}{fg=darkred!70!black,bg=gray!15!white} +\setbeamercolor*{palette tertiary}{bg=darkred!80!black,fg=gray!10!white} +\setbeamercolor*{palette quaternary}{fg=darkred,bg=gray!5!white} + +\setbeamercolor*{sidebar}{fg=darkred,bg=gray!15!white} + +\setbeamercolor*{palette sidebar primary}{fg=darkred!15!black} +\setbeamercolor*{palette sidebar secondary}{fg=white} +\setbeamercolor*{palette sidebar tertiary}{fg=darkred!50!black} +\setbeamercolor*{palette sidebar quaternary}{fg=gray!15!white} + +%\setbeamercolor*{titlelike}{parent=palette primary} +\setbeamercolor{titlelike}{parent=palette primary,fg=darkred} +\setbeamercolor{frametitle}{bg=gray!15!white} +\setbeamercolor{frametitle right}{bg=gray!60!white} + +%\setbeamercolor{Beispiel title}{bg=white,fg=black} + +\setbeamercolor*{separation line}{} +\setbeamercolor*{fine separation line}{} + +%\setbeamercolor{itemize item}{fg=darkred,bg=white} +%\setbeamercolor{itemize subitem}{fg=darkred!60!white,bg=white} +%\setbeamercolor{local structure}{fg=darkred,bg=white} +\setbeamercolor{local structure}{fg=gray,bg=white} +\setbeamercolor{structure}{fg=darkred!80!black,bg=white} +\setbeamercolor{block title}{bg=gray!10!white} +\mode + diff --git a/notes/07_stochopt/beamerthemeTUBerlin.sty b/notes/07_stochopt/beamerthemeTUBerlin.sty new file mode 100644 index 0000000..1ce3fd7 --- /dev/null +++ b/notes/07_stochopt/beamerthemeTUBerlin.sty @@ -0,0 +1,22 @@ +% Copyright 2004 by Madhusudan Singh +% +% This file may be distributed and/or modified +% +% 1. under the LaTeX Project Public License and/or +% 2. under the GNU Public License. +% +% See the file doc/licenses/LICENSE for more details. + +%\ProvidesPackageRCS $Header: beamerthemeTUBerlin.sty, v a01 2011/11/18 09:11:41 tujl $ +\mode + +\useinnertheme[shadow=true]{rounded} +\useoutertheme{infolines} +\usecolortheme{tub} + +\setbeamerfont{frametitle}{size=\normalsize} +\setbeamerfont{block title}{size={}} +%\setbeamerfont{structure}{series=\bfseries} +\setbeamercolor{titlelike}{parent=structure,bg=white} +\mode + diff --git a/notes/07_stochopt/bibliography.bib b/notes/07_stochopt/bibliography.bib new file mode 100644 index 0000000..948691f --- /dev/null +++ b/notes/07_stochopt/bibliography.bib @@ -0,0 +1,29 @@ +@book{sutton1998introduction, + title={Introduction to reinforcement learning}, + author={Sutton, Richard S and Barto, Andrew G and others}, + volume={135}, + year={1998}, + publisher={MIT press Cambridge} +} +@Book{Bertsekas07, + author = {D. P. Bertsekas}, + title = {Dynamic Programming and Optimal Control}, + publisher ={Athena Scientific}, + year = {2007}, + volume = {2}, + edition = {3rd}, + url = {http://www.control.ece.ntua.gr/UndergraduateCourses/ProxTexnSAE/Bertsekas.pdf} +} +@Article{Watkins92, + author = {C. Watkins and P. Dayan}, + title = {Q-learning}, + journal = {Machine Learning}, + year = {1992}, + OPTkey = {}, + volume = {8}, + OPTnumber = {}, + pages = {279--292}, + OPTmonth = {}, + OPTnote = {}, + OPTannote = {} +} diff --git a/notes/07_stochopt/img/Markov_chain.pdf b/notes/07_stochopt/img/Markov_chain.pdf new file mode 100644 index 0000000..f1cee7e Binary files /dev/null and b/notes/07_stochopt/img/Markov_chain.pdf differ diff --git a/notes/07_stochopt/img/clustering.pdf b/notes/07_stochopt/img/clustering.pdf new file mode 100644 index 0000000..4e3bd7d Binary files /dev/null and b/notes/07_stochopt/img/clustering.pdf differ diff --git a/notes/07_stochopt/img/exploitation.pdf b/notes/07_stochopt/img/exploitation.pdf new file mode 100644 index 0000000..8b4321e Binary files /dev/null and b/notes/07_stochopt/img/exploitation.pdf differ diff --git a/notes/07_stochopt/img/exploitation.svg b/notes/07_stochopt/img/exploitation.svg new file mode 100644 index 0000000..bff7c89 --- /dev/null +++ b/notes/07_stochopt/img/exploitation.svg @@ -0,0 +1,122 @@ + +image/svg+xml1 +2 +3 +4 + \ No newline at end of file diff --git a/notes/07_stochopt/img/exploration.pdf b/notes/07_stochopt/img/exploration.pdf new file mode 100644 index 0000000..7324a28 Binary files /dev/null and b/notes/07_stochopt/img/exploration.pdf differ diff --git a/notes/07_stochopt/img/gradient-descent.pdf b/notes/07_stochopt/img/gradient-descent.pdf new file mode 100644 index 0000000..e033a1c Binary files /dev/null and b/notes/07_stochopt/img/gradient-descent.pdf differ diff --git a/notes/07_stochopt/img/gradient-descent.svg b/notes/07_stochopt/img/gradient-descent.svg new file mode 100644 index 0000000..6eb5a67 --- /dev/null +++ b/notes/07_stochopt/img/gradient-descent.svg @@ -0,0 +1,369 @@ + + + +image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/notes/07_stochopt/img/gradient-descent_local.pdf b/notes/07_stochopt/img/gradient-descent_local.pdf new file mode 100644 index 0000000..926a24c Binary files /dev/null and b/notes/07_stochopt/img/gradient-descent_local.pdf differ diff --git a/notes/07_stochopt/img/section3_fig1.pdf b/notes/07_stochopt/img/section3_fig1.pdf new file mode 100644 index 0000000..8e89b96 Binary files /dev/null and b/notes/07_stochopt/img/section3_fig1.pdf differ diff --git a/notes/07_stochopt/img/section3_fig4.pdf b/notes/07_stochopt/img/section3_fig4.pdf new file mode 100644 index 0000000..cef216a Binary files /dev/null and b/notes/07_stochopt/img/section3_fig4.pdf differ diff --git a/notes/07_stochopt/img/section3_fig5.pdf b/notes/07_stochopt/img/section3_fig5.pdf new file mode 100644 index 0000000..b8ac802 Binary files /dev/null and b/notes/07_stochopt/img/section3_fig5.pdf differ diff --git a/notes/07_stochopt/img/section3_fig6.pdf b/notes/07_stochopt/img/section3_fig6.pdf new file mode 100644 index 0000000..025ed6b Binary files /dev/null and b/notes/07_stochopt/img/section3_fig6.pdf differ diff --git a/notes/07_stochopt/img/section3_fig7.pdf b/notes/07_stochopt/img/section3_fig7.pdf new file mode 100644 index 0000000..3cbe81c Binary files /dev/null and b/notes/07_stochopt/img/section3_fig7.pdf differ diff --git a/notes/07_stochopt/img/switchfuns.pdf b/notes/07_stochopt/img/switchfuns.pdf new file mode 100644 index 0000000..9ec93b5 Binary files /dev/null and b/notes/07_stochopt/img/switchfuns.pdf differ diff --git a/notes/07_stochopt/img/switchfuns.svg b/notes/07_stochopt/img/switchfuns.svg new file mode 100644 index 0000000..54415e0 --- /dev/null +++ b/notes/07_stochopt/img/switchfuns.svg @@ -0,0 +1,454 @@ + + + +image/svg+xml +E +−6 +−3 +0 +3 +6 +0.0 +0.5 +1.0 +W + +E +−6 +−3 +0 +3 +6 + +E +−6 +−3 +0 +3 +6 + \ No newline at end of file diff --git a/notes/07_stochopt/tutorial.handout.tex b/notes/07_stochopt/tutorial.handout.tex new file mode 100644 index 0000000..c016f5c --- /dev/null +++ b/notes/07_stochopt/tutorial.handout.tex @@ -0,0 +1,14 @@ +\documentclass[handout,ignorenonframetext]{beamer} +\newcounter{baslide} +\setcounter{baslide}{1} + +\let\oldframe +\frame +\let\oldendframe +\endframe + +\def\frame{\oldframe \label{baslide\roman{baslide}}% +\addtocounter{baslide}{1}} +\def\endframe{\oldendframe} + +\input{tutorial} diff --git a/notes/07_stochopt/tutorial.notes.tex b/notes/07_stochopt/tutorial.notes.tex new file mode 100644 index 0000000..c5da1a8 --- /dev/null +++ b/notes/07_stochopt/tutorial.notes.tex @@ -0,0 +1,17 @@ +\documentclass{../../latex/minotes} +\input{../../latex/customcommands} + +\numberwithin{equation}{section} +\numberwithin{figure}{section} + +\let\oldframe\frame +\let\oldendframe\endframe + +\newcommand{\notesonly}[1]{#1} + +\newcommand{\mystackrel}[2]{\stackrel{\mathmakebox[\widthof{#1}]{#2}}{=}} + +% frame titles only effective in presentation mode +\renewcommand{\frametitle}[1]{} + +\input{tutorial} diff --git a/notes/07_stochopt/tutorial.slides.tex b/notes/07_stochopt/tutorial.slides.tex new file mode 100644 index 0000000..5a3735c --- /dev/null +++ b/notes/07_stochopt/tutorial.slides.tex @@ -0,0 +1,11 @@ +\input{../../latex/headerMIslides} +\input{../../latex/customcommands} + +\subtitle{1.1 Intro \& 1.2 Connectionist Neuron} +\mathtoolsset{showonlyrefs} + +\newcommand{\slidesonly}[1]{#1} + +\newcommand{\mystackrel}[2]{\stackrel{\mathmakebox[\widthof{#1}]{#2}}{=}} + +\input{tutorial} diff --git a/notes/07_stochopt/tutorial.tex b/notes/07_stochopt/tutorial.tex new file mode 100644 index 0000000..cd9d56e --- /dev/null +++ b/notes/07_stochopt/tutorial.tex @@ -0,0 +1,80 @@ +\usepackage[authoryear,round]{natbib} +\usepackage{multirow} + +\newcommand{\sheetnum}{% + 07 +} +%\setcounter{section}{\sheetnum-3} +\newcommand{\tutorialtitle}{% + Stochastic Optimization +} +\newcommand{\tutorialtitleshort}{% + Stochastic Optimization +} +% for slides +\subtitle{\sheetnum \tutorialtitle} + +\maxdeadcycles=1000 % Workaround for ! Output loop---100 consecutive dead cycles because of too many figures + +% The following use of algroithms does not work well with the notes: +% +% +% +% +% instead use the following for your algorithms: +% +%\begin{figure}[!t] +%\removelatexerror +%\begin{algorithm}[H] + % your algo here + %\label{alg:algolabel} + %\caption{algocaption} +%\end{algorithm} +%\end{figure} +%\begin{algorithm} +% Below is the definition for the command \removelatexerror: +\makeatletter +\newcommand{\removelatexerror}{\let\@latex@error\@gobble} +\makeatother + +\begin{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\sheet{\sheetnum}{\tutorialtitleshort} + +\ttopic{\tutorialtitle} + +\columnratio{0.2,0.8}\textbf{} +\begin{paracol}{2} +%\setlength{\columnseprule}{0.1pt} +%\setlength{\columnsep}{5em} + +\begin{rightcolumn} + +% notes version will ignore it +\begin{frame} +\titlepage +\end{frame} + +\begin{frame} +\tableofcontents +\end{frame} + +\newpage + +\mode +\input{./1_stochopt} +\mode* + +\clearpage + +%\section{References} +%\begin{frame}[allowframebreaks] \frametitle{References} + %\scriptsize + %\bibliographystyle{plainnat} + %\bibliography{bibliography} +%\end{frame} + +\end{rightcolumn} +\end{paracol} + +\end{document} diff --git a/notes/08_clustering/1_clustering.tex b/notes/08_clustering/1_clustering.tex new file mode 100644 index 0000000..bc35874 --- /dev/null +++ b/notes/08_clustering/1_clustering.tex @@ -0,0 +1,964 @@ + + +\begin{frame} +\underline{Outline} +\begin{itemize} +\item K-means clustering + \begin{itemize} + \item the batch algorithm + \item the online algorithm + \item mini-batch learning + \item the soft clustering algorithm + \end{itemize} +\item pairwise clustering +\end{itemize} +\end{frame} + +\begin{frame} +\section{K-means} + +\underline{In plain English:}\\ + +Data is observed. We want to be able to describe some ``structure'' in the data. +We base our approach on something very simple and intuitive: +A set of +objects (points) that share common features tend to fall in some region. A second set of objects that also share common features but which are different from the first set fall in another region. +This ``structure'' we are describing groups, ``clusters'' a collection of points based on their proximity to a region. +A point that is further away from this collection and closer to another is grouped with the points associated with that other region. +We do have to decide on how many clusters we think exist in a dataset. + +\question{What does clustering give us?} + +- Eventually, instead of describing each point by its absolute location, we will be able to describe it by the cluster it is assigned to. +We will also be able to describe the entire dataset by the partitions we've found that separate the clusters. + +\end{frame} +\begin{frame} + +\underline{Problem:}\\ + +Group the dataset $\left\{ \vec x^{(1)}, \vec x^{(2)}, \ldots, \vec x^{(p)} \right\}$, where $\vec x \in \R^N$, into $M$ clusters\footnote{$M$ is actually equivalent to the $K$ in $K$-means.}. + +\underline{The K-means approach:}\\ + +Determine $M$ ``good'' prototypes $\vec w_1, \ldots, \vec w_M \in R^N$ and assign each data point $\vec x^{(\alpha)}$ to the prototype closest to it.\\ + +TODO: image + +The assignment is formalized by introducing the following assignment variable $m_q^{(\alpha)}$ for each point $\alpha$ and each cluster $q$: + +\begin{equation} +\label{eq:assignment} + m_q^{(\alpha)} := \left\{ \begin{array}{ll} + 1, & \text{if } \vec{x}^{(\alpha)} \text{ belongs to cluster } q + \\\\ + 0, & \text{else} + \end{array} \right. +\end{equation} + +$q$ is used as the cluster index. We have a total of $p \cdot M$ assignment variables for a dataset with $p$ points and choice of number of clusters $M$. + +$m_q^{(\alpha)}$ is a binary variable and with the normalization: + +\begin{equation} + \label{eq:assignmentnormalization} + \sum\limits_q m_q^{(\alpha)} = 1, +\end{equation} +we limit the assignment of each point to only one cluster. We refer to this as a ``hard'' assignment\footnote{This will be relaxed when we talk about ``soft'' K-means.}. + +A solution to our problem is one that finds finds the location of the prototypes $\vec w_q$ and assigns all points that are optimal in terms of the following cost which measures the average distance between a prototype and the subset of points assigned to it: + +\begin{align} +\label{eq:kmeanscost} +E_{ \big[ \big\{ m_q^{(\alpha)} \big\}, \big\{ \vec{w}_q \big\} + \big] }^T &= \frac{1}{p} \sum\limits_{q,\alpha} m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} - \vec{w}_q \big)^2\\ + &= \frac{1}{p} \sum_{\alpha=1}^{p} \sum_{q=1}^{M} m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} - \vec{w}_q \big)^2 \eqexcl \underset{\big\{ m_q^{(\alpha)} \big\}, \big\{ \vec{w}_q \big\}}{\min} +\end{align} + +\notesonly{ +When we consider the definition of $m_q^{(\alpha)}$ as \emph{binary variables} as provided by \eqref{eq:assignment} and their normalization as in \eqref{eq:assignmentnormalization}, we realize that for each point $\alpha$ only one of its $m_q^{(\alpha)}$ is non-zero (and =1). Therefore, we are effectively evaluating the distance $\big( \vec{x}^{(\alpha)} - \vec{w}_q \big)^2$ only \textbf{once} for that point $\alpha$, namely it's distance only to the one cluster $q$ that point is assigned to. + +This simplifies our expression for the cost function: + + +} +\begin{align} +\label{eq:kmeanscostsimple} +E_{ \big[ \big\{ m_q^{(\alpha)} \big\}, \big\{ \vec{w}_q \big\} + \big] }^T + &= \frac{1}{p} \sum_{\alpha=1}^{p} + \big( \vec{x}^{(\alpha)} - \vec{w}_q \big)^2 \eqexcl \underset{\big\{ m_q^{(\alpha)} \big\}, \big\{ \vec{w}_q \big\}}{\min} +\end{align} + + +\notesonly +{ +Where the cost is minimized over the set of discrete variables $\big\{ m_q^{(\alpha)} \big\}$ as well as the set of continuous variables $\big\{ \vec{w}_q \big\}$. + +Minimizing the average distance between each prototype $\vec w_q$ and the points assigned to it leads to partitioning the data into $M$ clusters such that a point is assigned to a cluster $q$ because it is closest to $\vec w_q$. +} +\end{frame} + +\begin{frame} +\section{Batch K-means: Algorithm overview} + +\notesonly{ +The way we've approached previous optimization problems is to derive how the parameters are updated to minimize the cost function and then we look at the algorithm that make use of the derived update rules to find a solution to the problem. Given the simplicity of the K-means algorithm, we will opt to first look at the algorihtm that implements K-means to understand how it operates and interesting aspects of the solution that it finds. We will then follow this by discussing the derivations we encountered in the algorithm (e.g. the update rule) + +\underline{How does K-means work?} +} + +An iterative two-step procedure after initializing $\big\{ \vec{w}_q \big\}$: +\begin{enumerate} +\item \textbf{assign} each point to its \emph{nearest} prtototype (nearest $\corresponds$ smallest Euclidiean distance.) +\item \textbf{update} all cluster prototypes by moving them their cluster's center of mass. +\end{enumerate} + + + +\end{frame} + +\begin{frame} +\section{Batch K-means: Algorithm details} + +\vspace{-0.4cm} +\begin{figure}[!th] +\footnotesize +\removelatexerror +\begin{algorithm}[H] +\DontPrintSemicolon + random initialization of prototypes, e.g.\ $\vec{w}_{q} = \langle \vec{x} \rangle +\vec{\eta}_{q}, \hspace{0.2cm} \vec{\eta}_{q} \text{ small random vector}$\; + \Begin(loop){ + (1) choose $m_q^{(\alpha)}$ such that $E^T$ is minimal for the given prototypes\; +\[ m_q^{(\alpha)} = \left\{ \begin{array}{ll} + 1, & \text{if } q = \argmin_{\gamma} \big| \vec{x}^{(\alpha)} + - \vec{w}_{\gamma} \big| \\ + 0, & \text{else} +\end{array} \right. \] +$\Rightarrow$ assign \textbf{every} data point to its nearest prototype \; +\; +(2) choose $\vec{w}_q$ such that $E^T$ is minimal for the \textbf{new} assignments\; +\[ \vec{w}_q = \frac{\sum\limits_{\alpha} m_q^{(\alpha)} \vec{x}^{(\alpha)}}{ + \sum\limits_{\alpha} m_q^{(\alpha)}} +\] +$\Rightarrow$ set $\vec{w}_q$ to the center of mass of its assigned data +} + \label{alg:batch-k-means} + \caption{batch K-means} +\end{algorithm} +\end{figure} +\end{frame} + +The steps for finding the optimal prototype locations and assignments are described in Algorithm \ref{alg:batch-k-means}. +It is important to understand that a single iteration, takes \textbf{all} points into account. This is not a loop that iterates over the individual points. Each iteration deals with assignment of \textbf{all} points and updating the location of the prototypes considering all data points collectively within that iteration. +The stopping criterion for the batch K-means algorithm\ref{alg:batch-k-means}, i.e. when to stop the loop, can be chosen by detecing that the location of prototypes are no longer moving. Alternatively, one could keep track of the difference in the value of $E^T$ between two successive iterations until it reaches some desired threshold. + +Things to note about \emph{batch K-means}: + +\begin{enumerate} +\item $E^T$ is \emph{non-increasing}\footnote{Non-increasing $\corresponds$ decreasing OR unchanged.}. We either move to an improved solution with lower cost or the cost remains unchanged. +\item This is a nonconvex optimization problem. This means that we may arrive at local optima which yield bad solutions. Additionally, the global minimum is not unique. There exist $M! = \prod_{i=1}^{M} i$ different assignments with the same lowest cost. This is a result of permutation symmetry. If we suddenly start labeling all points assigned to cluster $4 \rightarrow 7$, and simultaneously label all points originally assigned to $7 \rightarrow 4$, the cost remains unchanged. +The algorithm is indifferent to the index value of a cluster. The cluster indices are simply ``names'' to differntiate between the clusters and changing their names does not alter the solution obtained. + +\end{enumerate} + +\begin{frame} + +\question{How are is K-means an optimziation algorithm?} + +\slidesonly{ +Show how updating the protptypes by $\vec{w}_q = \frac{\sum\limits_{\alpha} m_q^{(\alpha)} + \vec{x}^{(\alpha)}}{\sum\limits_{\alpha} m_q^{(\alpha)}}$ minimizes $E^T$. +} + +\end{frame} + +We've seen how one would implement K-means. More specifically, we've seen that the prototypes $\vec w_q$ are updated every iteration such that they represent the mean of the points assigned to them in that iteration. However, we have yet to show that this iterative method effecitvley minimizes $E^T$. We do so by evaluating the gradient, finding the zero-crossing. We are effectively looking for the extrema of the cost function $E^T$. + +Recall the definition for the cost function: + +\begin{frame} +\begin{equation} +E_{ \big[ \big\{ m_q^{(\alpha)} \big\}, \big\{ \vec{w}_q \big\} + \big] }^T = \frac{1}{p} \sum\limits_{q,\alpha} m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} - \vec{w}_q \big)^2 \eqexcl \min +\end{equation} + +The condition for an extremum is as follows: +\begin{align} + \frac{\partial}{\partial \vec{w}_q} E^T + &= + \frac{\partial}{\partial \vec{w}_q} \bigg\{ \frac{1}{p} + \sum\limits_{q', \, \alpha} m_{q^{'}}^{(\alpha)} + \big( \vec{x}^{(\alpha)} - \vec{w}_{q^{'}} \big)^2 \bigg\} \\ + & = -\frac{2}{p} \sum\limits_{\alpha} m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} - \vec{w}_q \big) \eqexcl 0 +\end{align} +\slidesonly{ + +Solve for $\vec w_q$... +} + +\end{frame} +\begin{frame} + +Solve for $\vec w_q$: + + +\begin{align} + -\frac{2}{p} \sum\limits_{\alpha} m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} - \vec{w}_q \big) &= 0 \quad (\text{\small omit the constant}\; -2/p),\\ + \sum\limits_{\alpha} m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} - \vec{w}_q \big) &= 0\\ + \sum\limits_{\alpha} m_q^{(\alpha)} + \vec{x}^{(\alpha)} &= \sum\limits_{\alpha} m_q^{(\alpha)} \vec{w}_q \\ + \vec w_q \;\text{\small does not depend on}\; &\alpha, \;\text{\small use it as a normalization factor},\\ + \leadsto \vec{w}_q &= \frac{\sum\limits_{\alpha} m_q^{(\alpha)} + \vec{x}^{(\alpha)}}{\sum\limits_{\alpha} m_q^{(\alpha)}} +\end{align} + +\end{frame} +\begin{frame} + +\slidesonly{ +$$ + \leadsto \vec{w}_q = \frac{\sum\limits_{\alpha} m_q^{(\alpha)} + \vec{x}^{(\alpha)}}{\sum\limits_{\alpha} m_q^{(\alpha)}} +$$ +} +Now we've found the extremum. We still need to identify if this corresponds to a maximum or minimum: + + +The condition for a minimum is that the second-order partial derivatives, the \emph{Hessian}, is positive: + +\begin{align} + \frac{\partial^2}{\partial \mathrm{w}_{qi} \partial \mathrm{w}_{ + q'j}} \big\{ E^T\big\} + &= + \frac{\partial^2}{\partial \mathrm{w}_{qi} \partial \mathrm{w}_{ + q'j}} \bigg\{ \frac{1}{p} \sum\limits_{q^{''}, \alpha} + m_{q^{''}}^{(\alpha)} \big( \vec{x}^{(\alpha)} - \vec{w}_{q^{''}} + \big)^2 \bigg\} \\ + & = \frac{\partial}{\partial \mathrm{w}_{q^{'}j}} \bigg\{ + -\frac{2}{p} \sum\limits_{\alpha} m_q^{(\alpha)} + \big( \mathrm{x}_i^{(\alpha)} - (\vec{w})_{qi} + \big) \bigg\} \\ +\notesonly{ + & = \left( \frac{2}{p} \sum\limits_{\alpha} m_q^{(\alpha)} \right) + \delta_{ij} \delta_{qq^{'}} + } +\end{align} +\end{frame} + +\begin{frame} +\slidesonly{ +\begin{align} + \frac{\partial^2}{\partial \mathrm{w}_{qi} \partial \mathrm{w}_{ + q'j}} \big\{ E^T\big\} + &= + \frac{\partial^2}{\partial \mathrm{w}_{qi} \partial \mathrm{w}_{ + q'j}} \bigg\{ \frac{1}{p} \sum\limits_{q^{''}, \alpha} + m_{q^{''}}^{(\alpha)} \big( \vec{x}^{(\alpha)} - \vec{w}_{q^{''}} + \big)^2 \bigg\} \\ + & = \left( \frac{2}{p} \sum\limits_{\alpha} m_q^{(\alpha)} \right) + \delta_{ij} \delta_{qq^{'}} +\end{align} +} + +where $\delta_{ij}$ and $\delta_{qq^{'}}$ are the dirac-delta functions ($\delta_{ij}=1$ iff $i=j$ and $=0$ otherwise). +Since $\frac{2}{p} \sum\limits_{\alpha} m_q^{(\alpha)}$ is always positive \notesonly{(c.f. \eqref{eq:assignmentnormalization})}: + +\begin{itemize} + \itR The Hessian is a diagonal matrix with all positive entries $\,\to\,$ condition for minimum is always satisfied. + \itR however: minimizing $E^{T}$ is not a convex optimization problem (because of the combination of steps 1 and 2). +\end{itemize} + +\end{frame} + +% -------------------------------------------------------------------------- +\begin{frame} \frametitle{Intepreting the solution} +\begin{figure}[h] +\includegraphics[width=5.0cm]{img/section4_fig2_withincluster} +\end{figure} +$$ + E_{ \big[ \big\{ m_q^{(\alpha)} \big\}, \big\{ \vec{w}_q \big\} + \big] }^T = \frac{1}{p} \sum\limits_{q,\alpha} m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} - \vec{w}_q \big)^2 +$$ +\begin{itemize} + \itR If $\vec{w}_q$ is center of mass $\implies$ $E^T = \mathrm{variance}$. + \itR $E^T$ is non-increasing in every step and $E^T$ is bounded from below $\,\to\,$ K-means clustering converges to a (local) optimum of $E^T$. +\itR $E^T$ at the solution can be interpreted as the ``size'' (variance) of the clusters. +\end{itemize} +\end{frame} + +\begin{frame} +\section{On-line K-means} + +We now look at a variant of K-means that partitions the data in an \emph{online} fashion. This allows the clustering to: + +\begin{figure}[!th] +\footnotesize +\removelatexerror +\begin{algorithm}[H] + \DontPrintSemicolon + random initialization of prototypes, e.g.\ $\vec{w}_{q} = \langle \vec{x} \rangle +\vec{\eta}_{q},\hspace{0.2cm} \vec{\eta}_{q} \text{ small random vector}$\; + select learning step: $0 < \varepsilon \ll 1$\; + \Begin(loop){ + choose a data point $\vec{x}^{(\alpha)}$ \; + assign data point to its closest prototype $q$\; + \[ q = \argmin_{\gamma} \big| \vec{x}^{(\alpha)} - \vec{w}_{\gamma} \big| \] + change corresponding prototype according to\; + \[ \Delta \vec{w}_q = \varepsilon \big( \vec{x}^{(\alpha)} - \vec{w}_{q} \big) \] + change $\varepsilon$ \; + } + \label{alg:on-line-k-means} + \caption{On-line K-Means} +\end{algorithm} +\end{figure} + +\end{frame} +% -------------------------------------------------------------------------- + +% -------------------------------------------------------------------------- +\begin{frame} +\frametitle{Further differences between batch and online K-means:} + +Online K-means... +\begin{enumerate} +\item adapts to non-stationary data (streaming data) +\slidesonly{ +\item less memory +\item faster +\begin{itemize} + \item Step 1 updates assignment for a single point instead of all + \item Step 2 no longer iterates through all points +\end{itemize} +} +\notesonly{ +\item mitigates the memory footprint of the algorithm by avoiding having to keep the entire dataset in memory +\item mitigates the time complexity of the algorithm in that + \begin{itemize} + \item we no longer have to update the assignments of all data points (step 1) and that + \item updating the prototypes no longer requires iterating through all points (step 2). + \end{itemize} + } +\slidesonly{ +\item ``Noisiness''of online K-means allows it to escape local minima. + +} + \notesonly{ +\item Online K-means is more robust than batch-learning w.r.t convergence to local minima: +\begin{itemize} +\item The noisy nature of online K-means gives it a better chance at ``escaping'' local minima. +\item In batch K-means, $E^T$ either decreases or remains unchanged. Therefore, batch K-means from a local minimum. +\end{itemize} +} +\notesonly{ +\item The quality of the solution found by online K-means depends on choosing an appropriate schedule for $\varepsilon$: Robbins-Monro conditions +} +\slidesonly{ +\item Choose learning rate schedule for $\varepsilon$: Robbins-Monro conditions +} +\notesonly{(c.f. Fig \ref{fig:annealingScheduleKMeans2})} +\begin{figure}[h!] + \centering +\includegraphics[height=3cm]{img/section4_fig4} + \caption{Decaying learning rate schedule to satisfy Robbins-Monro conditions.} + \label{fig:annealingScheduleKMeans2} +\end{figure} +\end{enumerate} + +A compromise between batch and online K-means is \emph{mini-batch K-means}. +One would modify online K-means to operate on a small subset/mini-batch of the data at a time. + +\end{frame} + +\begin{frame} +\section{Pairwise clustering} + +\textbf{Recall} that K-means clusters points based on their proximity to some prototype.\\ + +\notesonly{ +Another approach to describing a similar ``structure'' in the data +can be based on the following:} +\slidesonly{ +\begin{figure}[h!] + \centering +\includegraphics[height=3cm]{img/clustering} +\end{figure} +But also:\\ +} +Points that are ``close'' to one another have more in common than points that are far away from one another. +\notesonly{ +We clusters points based on their \emph{proximity to one another}. +A point that is further away from this collection is grouped with other points that are closer to it. Pairwise clustering is about grouping points based on their pairwise relations. \ +} +We will first discuss clustering based on \emph{pairwise distances} and then extend this to \emph{soft clustering}. + +\end{frame} + +\begin{frame} +\section{Pairwise Clustering: The data} + +\notesonly{ +Each data point $\vec x^{(\alpha)}$ with $\alpha = 1, \ldots, p$ is represented through its relation to all other points in the dataset by measuring pairwise distances. +} + +Let $d_{\alpha \alpha^{'}}$ be the pairwise distance between any two points $\vec x^{(\alpha)}$ and $\vec x^{(\alpha')}$. Computing all pairwise distances yields the \emph{distance matrix} $\big\{ d_{\alpha \alpha^{'}} \big\}$: + +\begin{equation} +\label{eq:pairwisedistdef} +d: \mathbb{R}^N \times \mathbb{R}^N + \rightarrow \mathbb{R}_0^+ \quad\text{ i.e.}\;\; d \ge 0 +\end{equation} + +The components of the distance matrix are subject to the following constraints: +\begin{itemize} +\item The distance of a point to itself is zero. +\item The distance matrix is symmetric. +\end{itemize} + +\end{frame} +\begin{frame}\frametitle{Choice of distance measure} + +A simple and common choice of measure is squared Euclidean distance: + +\begin{equation} +\label{eq:pairwisedisteuclidean} +d_{\alpha \alpha^{'}} := \frac{1}{2} \big( + \vec{x}^{(\alpha)} - \vec{x}^{(\alpha^{'})} \big)^2 +\end{equation} + +\notesonly{ +Another possiblility for populating the components of the distance matrix is to base this high-dimensional relation of one point to another on \emph{scalar product}, the ``kernel trick''\footnote{As encountered in Kernel-PCA}. +The distance is measured for a pair of points in some high-dimensional space $\phi$ +} +\slidesonly{ +Or elements derived via a ``kernel trick'': +} + +$$ + \vec{\phi}: \vec{x}^{(\alpha)} \rightarrow + \vec{\phi}_{\big( \vec{x}^{(\alpha)} \big)} + \equiv \vec{\phi}^{(\alpha)} +$$ + +\begin{align} + d_{\alpha \alpha^{'}} + & = \frac{1}{2} \big( \vec{\phi}^{(\alpha)} + - \vec{\phi}^{(\alpha^{'})} \big)^2 \\ + & = \frac{1}{2} \Big\{ \big( \vec{\phi}^{(\alpha)} + \big)^2 - 2\big( \vec{\phi}^{(\alpha)} \big)^\top + \vec{\phi}^{(\alpha^{'})} + \big( + \vec{\phi}^{(\alpha^{'})} \big)^2 \Big\} \\ + & = \frac{1}{2} \bigg\{ k_{\big( \vec{x}^{(\alpha)}, + \vec{x}^{(\alpha)} \big)} + - 2k_{\big(\vec{x}^{(\alpha)}, + \vec{x}^{(\alpha^{'})} \big)} + + k_{\big(\vec{x}^{(\alpha^{'})}, + \vec{x}^{(\alpha^{'})} \big)} + \bigg\} +\end{align} + +\end{frame} + +\begin{frame} +\slidesonly{\frametitle{Other sources for pairwise distances}} + +A further alternative is to not measure pairwise distance explicitly. The data can be represented in a pairwise fashion as a result of an algroithm. Example: Sequence alignment proceducres and graph-similarity mmeasures produce pairwise representations.\\ + +We will rely on the \emph{squared Eucldiean distance} for our pairwise clustering algorithm. + +\end{frame} + +\begin{frame} +\frametitle{Pairwise Clustering: Problem statement} + +\begin{itemize} +\itr set of clusters (partitions): $q = 1, \ldots, M$ +\itr observations (feature vectors): $\vec{x}^{(\alpha)}, \; +\alpha = 1, \ldots, p; \; \vec{x}^{(\alpha)} = \mathbb{R}^N$ +\itr binary assignment variable: +$$ m_q^{(\alpha)} := \left\{ \begin{array}{ll} + 1, & \text{if object } \alpha \text{ belongs to cluster } q \\\\ + 0, & \text{otherwise} + \end{array} \right. +$$ +\itr distance matrix $d_{\alpha \alpha^{'}}$ populated using squared Euclidean distance: +$$ + d_{\alpha \alpha^{'}} := \frac{1}{2} \big( \vec{x}^{(\alpha)} + - \vec{x}^{(\alpha^{'})} \big)^2 +$$ +\end{itemize} +\end{frame} +\begin{frame} +\frametitle{Cost function \& model selection} +\begin{align} +E_{ \big[ \big\{ m_q^{(\alpha)} \big\} \big] } + &= \frac{1}{2p} \sum\limits_q \sum\limits_{\alpha} + \overbrace{ \frac{\sum\limits_{\alpha^{'}} + m_q^{(\alpha)}m_q^{(\alpha^{'})} d_{\alpha \alpha^{'}}}{ + \sum\limits_{\alpha'} m_q^{(\alpha')} + }}^{ \substack{ \text{avg. distance between} \\ + \alpha \text{ and \textbf{all other} objects $\alpha'$} \\ + \text{from the \textbf{same} cluster } q}}\\ +&= \frac{1}{2p} \sum\limits_q \frac{ \sum\limits_{\alpha \alpha^{'}} + m_q^{(\alpha)} m_q^{(\alpha^{'})} \big( \vec{x}^{(\alpha)} + -\vec{x}^{(\alpha^{'})} \big)^2 }{ + \sum\limits_{\alpha} m_q^{(\alpha)}} + \eqexcl \min +\end{align} + +\end{frame} + +\begin{frame}\frametitle{Relation of pairwise clustering to K-means} + +When we choose squared Euclidean distance as the distance measure for pairwise clustering, we can show that this choice let's pairwise clustering effectively find the same solution as K-means clustering. + +\end{frame} + +\begin{frame}\slidesonly{\frametitle{Relation of pairwise clustering to K-means}} + +\begin{equation} + \begin{array}{ll} + E_{\big[ \big\{ m_q^{(\alpha)} \big\} \big]} +\only<1> { + & = \frac{1}{2p} \sum\limits_q \frac{ \sum\limits_{\alpha \alpha^{'}} + m_q^{(\alpha)} m_q^{(\alpha^{'})} \big( \vec{x}^{(\alpha)} + -\vec{x}^{(\alpha^{'})} \big)^2 }{ + \sum\limits_{\alpha} m_q^{(\alpha)}} \\\\ + & = \frac{1}{2p} \sum\limits_q \frac{ + \sum\limits_{\alpha} \sum\limits_{\alpha^{'}} + m_q^{(\alpha)} m_q^{(\alpha^{'})} \Big\{ \big( + \vec{x}^{(\alpha)} \big)^2 - 2\big( \vec{x}^{(\alpha)} \big)^\top + \vec{x}^{(\alpha^{'})} + \big( \vec{x}^{(\alpha^{'})} \big)^2 + \Big\} + } + { \sum\limits_{\alpha} m_q^{(\alpha)} } \\\\ +} +\only<1,2> { + & = \frac{1}{2p} \sum\limits_q \Bigg\{ + \frac{ + \sum\limits_{\alpha} {\color{blue}\sum\limits_{\alpha^{'}}} + m_q^{(\alpha)} {\color{blue}m_q^{(\alpha^{'})} } + \big( \vec{x}^{(\alpha)} \big)^2 } + { \color{red}\sum\limits_{\alpha} {m_q^{(\alpha)}} + } \\ + &\qquad- 2 + \frac{ \sum\limits_{\alpha} \sum\limits_{\alpha^{'}} + m_q^{(\alpha)} m_q^{(\alpha^{'})} + \big( \vec{x}^{(\alpha)} \big)^\top + \vec{x}^{(\alpha^{'})} } + { \sum\limits_{\alpha} m_q^{(\alpha)} } + + + \frac{ {\color{red}\sum\limits_{\alpha}} \sum\limits_{\alpha^{'}} + {\color{red}m_q^{(\alpha)}}m_q^{(\alpha^{'})} + \big( \vec{x}^{(\alpha^{'})} \big)^2 + } + {\color{red}\sum\limits_{\alpha} m_q^{(\alpha)} } \Bigg\} + \\ +} + \pause +\only<2> { + & + {\scriptscriptstyle + \text{with} \; + {\color{red}\sum\limits_{\alpha} m_q^{(\alpha)}} = {\color{blue}\sum\limits_{\alpha'} m_q^{(\alpha')}} + \; \text{follows:} + } + \\ + & = \frac{1}{2p} \sum\limits_q \Bigg\{ + \frac{ + \sum\limits_{\alpha} + m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} \big)^2 {\color{blue}\sum\limits_{\alpha^{'}} m_q^{(\alpha^{'})}} } + { \color{blue} \sum\limits_{\alpha^{'}} m_q^{(\alpha^{'})} } \\ + &\qquad - 2 + \frac{ \sum\limits_{\alpha} \sum\limits_{\alpha^{'}} + m_q^{(\alpha)} m_q^{(\alpha^{'})} + \big( \vec{x}^{(\alpha)} \big)^\top + \vec{x}^{(\alpha^{'})} } + { \sum\limits_{\alpha^{'}} m_q^{(\alpha^{'})} } + + + \frac{ \sum\limits_{\alpha} m_q^{(\alpha)^{'}} + \big( \vec{x}^{(\alpha^{'})} \big)^2 {\color{red}\sum\limits_{\alpha} m_q^{(\alpha)}} + } + { \color{red}\sum\limits_{\alpha} m_q^{(\alpha)} } \Bigg\} + \\\\ +} + \pause +\only<3> { + & = \frac{1}{2p} \sum\limits_q \Bigg\{ + { + \sum\limits_{\alpha} + m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} \big)^2 } + \\ + &\quad + - 2 \Big( \sum\limits_{\alpha} m_q^{(\alpha)} \big( + \vec{x}^{(\alpha)} \big)^\top \Big) + \underbrace{ \frac{ \sum\limits_{\alpha^{'}} m_q^{(\alpha^{'})} + \vec{x}^{(\alpha^{'})} }{ \sum\limits_{\alpha^{'}} + m_q^{(\alpha^{'})} } }_{ + \substack{ \eqexcl \vec{w}_q \\ + \substack{\text{centroid =} \\ + \text{center of mass}\\ + %\text{ ({\it cf. \ref{kmeans_modelselection}})} + } + } + } + + + { \sum\limits_{\alpha^{'}} m_q^{(\alpha^{'})} + \big( \vec{x}^{(\alpha^{'})} \big)^2 } + \Bigg\} + \\ + &{\scriptscriptstyle + \text{with} + \; { \sum\limits_{\alpha} m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} \big)^2 } + = { \sum\limits_{\alpha^{'}} m_q^{(\alpha^{'})} + \big( \vec{x}^{(\alpha^{'})} \big)^2 } + \; \text{follows:} + } + \\ + & = \frac{1}{2p} \sum\limits_q \Big\{ + 2\, \sum\limits_{\alpha} + m_q^{(\alpha)} + \big( \vec{x}^{(\alpha)} \big)^2 + - 2 + \Big( + \sum\limits_{\alpha} m_q^{(\alpha)} \big( + \vec{x}^{(\alpha)} \big)^\top + \Big) \, + \vec{w}_q + \Big\} \\\\ +} +\pause +\only<4>{ + & = \frac{1}{p} \sum\limits_{q, \alpha} m_q^{(\alpha)} \Big\{ + \big( \vec{x}^{(\alpha)} \big)^2 - \big( \vec{x}^{(\alpha)} + \big)^\top \vec{w}_q \Big\} \\\\ + & = \frac{1}{p} \sum\limits_{q, \alpha} m_q^{(\alpha)} \Big\{ + \big( \vec{x}^{(\alpha)} \Big)^2 - \big( \vec{x}^{(\alpha)} + \big)^\top \vec{w}_q - \vec{w}_q^2 + + \vec{w}_q^2 \Big\} \\ + + + &{\scriptscriptstyle\text{with} \; + \vec w_q^2 = \vec w_q^\top \vec w_q = \frac{ \sum\limits_{\alpha} m_q^{(\alpha)} \big( + \vec{x}^{(\alpha)} \big)^\top }{ + \sum\limits_{\alpha} m_q^{(\alpha)}} + \cdot \vec{w}_q \; + \; \text{follows:} + } + \\ + & = \frac{1}{p} \sum\limits_{q, \alpha} m_q^{(\alpha)} \Big\{ + \big( \vec{x}^{(\alpha)} \big)^2 - 2 \big( \vec{x}^{(\alpha)} + \big)^\top \vec{w}_q + \vec{w}_q^2 \Big\} \\\\ + & = \frac{1}{p} \sum\limits_{q, \alpha} m_q^{(\alpha)} \big( + \vec{x}^{(\alpha)} - \vec{w}_q \big)^2 \\\\ + & = E_{\big[ \big\{ m_q^{(\alpha)} \big\}, \big\{ \vec{w}_q \big\} \big]} + \corresponds \text{ cost function for K-Means} + } + \end{array} +\end{equation} +\end{frame} + +\begin{frame} +We've seen that Pairwise Clustering based on squared Euclidean distance boils down to K-means. \notesonly{Pairwise clustering can therefore be considered a generelization of K-means. One can also think of K-means as a special case of pairwise clustering.} +\slidesonly{ +K-menas is a special case of pairwise clustering. +} + +Recall that the assignment variables a defined as binary to reflect hard assignments. Therefore, the optimization problem for pairwise clustering in the general case is a \emph{discrete optimization problem}. + +\notesonly{ +The consequence of this is that gradient-based methods are no longer applicable and that rather methods of combinatorial optimization are needed. We've encountered two variants for such: +} + +\begin{itemize} +\item simulated (stochastic) annealing. Fairly simple, robust to local minima but slow. +\item mean-field (deterministic) annealing. An effective approximation for stochastic optimization which can be computed much faster. +\end{itemize} + +\end{frame} +\begin{frame} + +Recall from mean-field annealing: +\notesonly{ +The individual state variables $s_k$ in the state vector $\vec s$ were assumed to be \emph{independent}. This implies the factorization of their moments under the approximated distribution $Q$ (i.e. +\begin{equation} +\label{eq:factorizingmoments} +\implies + \langle \Pi_k s_k \rangle_Q = \Pi_k \langle s_k\rangle_Q. +\end{equation} +When we use mean-field annealing in the context of pairwise clustering it is the assignment variables $m_q^{(\alpha)}$ that stand for the state variables. The assignment is based on \textbf{pairwise} distance, this violates our assumption of having independent state variables. $m_q^{(\alpha)}$ is not completly independent of $m_q^{(\alpha')}$. + Therefore, the +calculation of moments and mean-fields must be adapted. +} +\slidesonly{ +\begin{itemize} +\item $s_k$ in the state vector $\vec s$ were assumed to be \emph{independent} +\item $\implies + \langle \Pi_k s_k \rangle_Q = \Pi_k \langle s_k\rangle_Q + \quad \!\! (\substack{\text{moments} \\ \text{factorize}})$ +\item mean-field annealing in the context of pairwise clustering: $m_q^{(\alpha)}$ for state variables +\item \textbf{but} $m_q^{(\alpha)}$ is \textbf{not} completly independent of $m_q^{(\alpha')}$ +\end{itemize} +} + +\end{frame} + +\begin{frame} +\section{The mean-field approximation for pairwise clustering} + Nomenclature ($\otimes$ $\rightarrow$ \emph{set-product, cartesian product})\\ + + \begin{tabular}{r l p{9cm}} +$\big\{ \vec{m}^{(\alpha)} \big\}$: & & set of all $M$-dimensional binary vectors $\big( m_1^{(\alpha)}, m_2^{(\alpha)}, \ldots, + m_M^{(\alpha)} \big)^\top$ which fulfill the normalization condition (exactly one element equals 1). \\\\ +$\mathscr{M}$: & & $\big\{ \vec{m}^{(1)} \big\} \otimes \big\{ \vec{m}^{(2)} \big\} \otimes \ldots \otimes \big\{ \vec{m}^{(p)} \big\}$\\ +& & set-product (cartesian product) between all possible binary assignment variables i.e.\ all possible valid assignments for the full dataset\\\\ +$\mathscr{M}_{\gamma}$:& & $\big\{ \vec{m}^{(1)} \big\} \otimes \ldots \otimes \big\{ \vec{m}^{(\gamma - 1)} \big\} \otimes + \big\{ \vec{m}^{(\gamma + 1)} \big\} \otimes \ldots \otimes + \big\{ \vec{m}^{(p)} \big\}$\\ +& &\ set of all possible assignments for all data points \\& & \hspace{0.03cm} except $\gamma$ +\end{tabular} + +\end{frame} + +\begin{frame}[t] +\slidesonly{\frametitle{The mean-field approximation for pairwise clustering}} +\begin{block}{assignment noise $\rightarrow$ Gibbs distribution} +$$ + P_{ \big( \big\{ m_q^{(\alpha)} \big\} \big) } + = \frac{1}{Z_p} \exp \Big\{ -\beta + \overbrace{ + E_{\big[ \big\{ m_q^{(\alpha)} \big\} \big]} + }^{= \, E_p} + \Big\} +$$ +where +$$ + Z_p = \sum\limits_{\mathscr{M}} \exp \Big\{ -\beta + E_p + \Big\} +$$ +\end{block} +\notesonly{ +This is approximated by the mean-fields: +} +\begin{block}{factorizing distribution} +$$ + Q_{ \big[ \big\{ m_q^{(\alpha)} \big\} \big] } + = \frac{1}{Z_Q} \exp \Big\{ -\beta \sum\limits_{q, \gamma} + m_q^{(\gamma)} \underbrace{ e_q^{(\gamma)} }_{ + \text{{\tiny mean-fields}} } \Big\} +$$ +where: +$$ + Z_Q = \sum\limits_{\mathscr{M}} \exp \Big\{ -\beta \sum\limits_{q, + \gamma} m_q^{(\gamma)} e_q^{(\gamma)} \Big\} +$$ +\end{block} +\end{frame} + +\begin{frame}\frametitle{Recap calculation of the moments (general mean-field case)} +The factorization of the distribution $Q$ simplifies the calculation of the moments. +This is based on the individual state variables being \emph{uncorrelated}. + +\begin{equation}\label{eq:factorizingMoments} + \Big< f_{(\vec{s}/s_l)} g_{(s_l)} \Big>_Q + = \frac{1}{Z_Q} \sum\limits_{\vec{s}} f_{(\vec{s}/s_l)} + g_{(s_l)} \exp \Big( -\beta \sum\limits_k e_k s_k \Big) +\end{equation} + +\end{frame} +\begin{frame} +\slidesonly{ +\frametitle{Factorizing moments (general mean-field case)} +\vspace{-0.5cm} +\begin{equation}\label{eq:factorizingMoments} + \Big< f_{(\vec{s}/s_l)} g_{(s_l)} \Big>_Q + = \frac{1}{Z_Q} \sum\limits_{\vec{s}} f_{(\vec{s}/s_l)} + g_{(s_l)} \exp \Big( -\beta \sum\limits_k e_k s_k \Big) +\end{equation} +} +\begin{eqnarray*} + & = & \frac{1}{Z_Q} \bigg[ \sum\limits_{\vec{s}/s_l} f_{(\vec{s}/s_l)} + \exp \Big( -\beta \sum\limits_{k \neq l} e_k s_k \Big) \bigg] + \bigg[ \sum\limits_{s_l} g_{(s_l)} \exp \Big( -\beta e_l + s_l \Big) \bigg] \\\\ + & = & \frac{1}{Z_Q} \bigg[ \sum\limits_{\vec{s}/s_l} f_{(\vec{s}/s_l)} + \exp \Big( -\beta \sum\limits_{k \neq l} e_k s_k \Big) \bigg]\\ + &&\qquad\qquad + \frac{\color{blue}{\sum\limits_{s_l} \exp(-\beta e_l s_l)}}{\sum\limits_{s_l} + \exp(-\beta e_l s_l)} + \bigg[ {\color{blue}\sum\limits_{s_l}} g_{(s_l)} \color{blue}{\exp \Big( -\beta e_l + s_l \Big)} \bigg] \\\\ + & = &\big< f_{(\vec{s}/s_l)} \big>_Q \frac{\sum\limits_{s_l} + g_{(s_l)} \exp(-\beta e_l s_l)}{\sum\limits_{s_l} + \exp(-\beta e_l s_l)} = + \underbrace{ \big< f_{(\vec{s}/s_l)} \big>_Q \cdot \big_Q }_{ \substack{ \text{factorization of moments} \\ + \rightarrow \text{uncorrelated variables}} } +\end{eqnarray*} + +\end{frame} + +\begin{frame} +\slidesonly{\frametitle{Calculation of moments}} +$$ + \begin{array}{lll} + \big< m_q^{(\gamma)} \big>_Q + & = \frac{1}{Z_Q} \sum\limits_{\mathscr{M}} m_q^{(\gamma)} + \exp \Big\{ -\beta \sum\limits_{r, \delta} + m_{r}^{(\delta)} e_{r}^{(\delta)} \Big\} + \end{array} +$$ + +\begin{itemize} +\itr The factorization of $Q$ simplifies the calculation of the moments. +\end{itemize} + +\end{frame} + +\begin{frame} +\slidesonly{\frametitle{Calculation of moments (derivation)} +The factorization in \eqref{eq:factorizingMoments}) regarding valid assignments $\big\{ \vec{m}^{(\gamma)} \big\}$ for observation $\gamma$ and the rest of the variables excluding $\gamma$ (i.e. $\mathscr{M}_{\gamma}$) is valid for any functions $f,g$: +} +\begin{align} + \sum\limits_{\mathscr{M}} \Big[ f_{ \big( \big\{ m_p^{(\delta)} + \big| \delta \neq \gamma \big\} \big) } + \cdot \, g_{ \big( \big\{ m_p^{(\delta)} \big| \delta = \gamma + \big\} \big) } + \Big]\\ + \qquad + \qquad\qquad= \Big[ \sum\limits_{\mathscr{M}_{\gamma}} f_{ \big( \big\{ + m_p^{(\delta)} \big| \delta \neq \gamma \big\} \big) } + \Big] \cdot \Big[ \sum\limits_{\big\{ \vec{m}^{(\gamma)} + \big\} } g_{ \big( \big\{ m_p^{(\delta)} \big| \delta = + \gamma \big\} \big) } \Big] +\end{align} +this gives +%% this formula has been seriously confusing / wrong (indices?) so double check +\begin{equation} + \begin{array}{ll} + \big< m_q^{(\gamma)} \big>_Q + & = \frac{ \Bigg[ \sum\limits_{\mathscr{M}_{\gamma}} \exp \Big\{ -\beta + \sum\limits_{r, \delta \neq \gamma} m_{r}^{(\delta)} + e_{r}^{(\delta)} \Big\} \Bigg] \cdot \Bigg[ + \overbrace{ \sum\limits_{ \big\{ \vec{m}^{(\gamma)} \big\} } + m_q^{(\gamma)} }^{\substack{ \text{only term with} \\ + m_q^{(\gamma)} = 1 \\ + \text{remains} }} + \exp \Big\{ -\beta \sum\limits_{r} m_{r}^{(\gamma)} + e_{r}^{(\gamma)} \Bigg] }{ + \underbrace{ + \Bigg[ \sum\limits_{\mathscr{M}_{\gamma}} \exp + \Big\{ -\beta \sum\limits_{r, \delta \neq \gamma} + m_{r}^{(\delta)}e_{r}^{(\delta)} \Big\} \Bigg] + }_{ \text{first terms cancel} } + \cdot \Bigg[ \sum\limits_{ \big\{ \vec{m}^{(\gamma)} + \big\} } \exp \Big\{ -\beta + \underbrace{ \sum\limits_{r} m_{r}^{(\gamma)} + e_{r}^{(\gamma)} }_{ + \substack{ \text{only one term of this} \\ + \text{sum remains for every} \\ + \text{term of the qrevious } + \text{sum}} } + \Big\} \Bigg] } \\\\ + & = \frac{ \exp \big\{ -\beta\, m_q^{(\gamma)} e_q^{(\gamma)} \big\} }{ + \sum\limits_{r} \exp \big\{ -\beta \, + m_{r}^{(\gamma)} e_{r}^{(\gamma)} \big\} } + \; \underbrace{=}_{\substack{\text{only the } \\ m_r^{(\gamma)} = 1 \\\text{ stays}}} \; \underbrace{\frac{ \exp \big\{ -\beta e_q^{(\gamma)} \big\} }{ + \sum\limits_{r} \exp \big\{ -\beta + e_{r}^{(\gamma)} \big\} }}_{\text{soft-max of the mean-fields}} + \end{array} +\end{equation} +\end{frame} + +\begin{frame} +\frametitle{Solution for calculating the moments} +$$ + \begin{array}{lll} + \big< m_q^{(\gamma)} \big>_Q + & = \frac{ \exp \big\{ -\beta\, m_q^{(\gamma)} e_q^{(\gamma)} \big\} }{ + \sum\limits_{r} \exp \big\{ -\beta \, + m_{r}^{(\gamma)} e_{r}^{(\gamma)} \big\} } + \; = \underbrace{\;\; \frac{ \exp \big\{ -\beta e_q^{(\gamma)} \big\} }{ + \sum\limits_{r} \exp \big\{ -\beta + e_{r}^{(\gamma)} \big\} }\;\;}_{\text{soft-max of the mean-fields}} + \end{array} +$$ +\begin{block}{Intuition for above result: ``soft'' clustering} +\begin{itemize} +\itr $\sum_r \langle m_r^{(\gamma)} \rangle = 1$ and $\big< m_q^{(\gamma)} \big>_Q \in [0, 1]$ => assignment probabilities +\itr $\beta \rightarrow \infty: \big< m_q^{(\gamma)} \big>_Q \in \{0, 1\} $ => ``hard assignments'' (cmp.\ K-means) +\end{itemize} +\end{block} +\end{frame} + +\begin{frame} +\frametitle{Soft clustering} +\notesonly{ +We have so far considered the case of assigning each data point to exactly \textbf{one} cluster. This is enforced by having used a binary definition for +the assignments variables $m_q^{(\alpha)}$. This ``hard'' assignment is relaxed leading to a so-called ``soft'' or ``fuzzy'' assignment. +} +Eadh data point $\vec x^{(\alpha)}$ is assigned to \textbf{all} clusters simultaneously but with different strengths: +The definition of the assignment variable for some point $\alpha$ becomes: +\begin{equation} +\label{eq:assignvarsoft} +\big< m_q^{(\alpha)} \big> \in [0, 1] +\end{equation} + +such that + +\begin{equation} +\label{eq:assignvarsoftnormalize} +\sum_{q=1}^{M}\big< m_q^{(\alpha)} \big> = 1. +\end{equation} + +\notesonly{ +The normalization in \eqref{eq:assignvarsoftnormalize} ensures that point $\alpha$ is completly assigned and allows us to interpret the assignment variables as \emph{assignment probabilities. See Fig\ref{fig:clusteringsoft}} for an example. +The purpose of defining the assignment probabilities as ``expectations'' $\big< \cdot \big>$ will be clarified, when we discuss the soft clustering algorithm. +} +\begin{figure}[h!] + \centering +\includegraphics[height=4cm]{img/clustering_soft} + \caption{Example for soft clustering. Point (9) is assigned to cluster $1$ with a higher probability than to cluster 2: $\big< m_1^{(9)} \big> = 0.51$, $\big< m_2^{(9)} \big> = 0.49$} + \label{fig:clusteringsoft} +\end{figure} +\end{frame} + +Going back to mean-field approximation: + +The approximation is achieved by minimizing the KL-divergence between the distribution $P$ and $Q$. + +\begin{frame}[t] \slidesonly{\frametitle{Minimization of the KL-divergence}} +\begin{block}{Mean Field equation (c.f. section on Stochastic Optimization for how to arrive at this result)} +$$ + \fbox{$ \frac{\partial}{\partial e_l}\big_Q + - \sum\limits_k e_k \frac{\partial}{\partial e_l} \big_Q = 0 + $} +$$ +\end{block} +$$ \frac{\partial \big< E_p \big>_Q}{\partial e_q^{(\alpha)}} + - \sum\limits_{r, \gamma} \frac{ + \overbrace{ \partial \big< m_r^{(\gamma)} \big>_Q }^{ + \substack{ \text{depends only on} \\ + \text{data point } \gamma }}}{ + \partial e_q^{(\alpha)}} + e_r^{(\gamma)} \eqexcl 0 +$$ +$$ + \frac{\partial \big< E_p \big>_Q}{\partial e_q^{(\alpha)}} + - \sum\limits_r \frac{\partial \big< m_r^{(\alpha)} \big>_Q}{ + \partial e_q^{(\alpha)}} + e_r^{(\alpha)} \eqexcl 0 +$$ + +switch to lecture slides.... +\end{frame} diff --git a/notes/08_clustering/Makefile b/notes/08_clustering/Makefile new file mode 100644 index 0000000..dfe51f7 --- /dev/null +++ b/notes/08_clustering/Makefile @@ -0,0 +1,40 @@ +all: slides notes clean +#all: handout + +projname = tutorial +targetname = $(projname)_$(shell basename $(CURDIR)) +compile = pdflatex +projnameS = $(projname).slides +projnameH = $(projname).handout +projnameA = $(projname).notes + +slides: $(projname).slides.tex $(projname).tex + $(compile) $(projname).slides.tex +# bibtex $(projname).slides +# $(compile) --interaction=batchmode $(projname).slides.tex +# $(compile) --interaction=batchmode $(projname).slides.tex + mv $(projname).slides.pdf $(targetname).slides.pdf + +handout: $(projname).handout.tex $(projname).tex + $(compile) $(projname).handout.tex + mv $(projname).handout.pdf $(targetname).handout.pdf + +# Repeat compilation for the references to show up correctly +notes: $(projname).notes.tex $(projname).tex + $(compile) $(projname).notes.tex +# bibtex $(projname).notes +# $(compile) --interaction=batchmode $(projname).notes.tex + $(compile) --interaction=batchmode $(projname).notes.tex + mv $(projname).notes.pdf $(targetname).notes.pdf + +clean: cleans cleanh cleana + +cleans: + rm -f $(projnameS).aux $(projnameS).bbl $(projnameS).log $(projnameS).out $(projnameS).toc $(projnameS).lof $(projnameS).glo $(projnameS).glsdefs $(projnameS).idx $(projnameS).ilg $(projnameS).ind $(projnameS).loa $(projnameS).lot $(projnameS).loe $(projnameS).snm $(projnameS).nav + +cleanh: + rm -f $(projnameH).aux $(projnameH).bbl $(projnameH).log $(projnameH).out $(projnameH).toc $(projnameH).lof $(projnameH).glo $(projnameH).glsdefs $(projnameH).idx $(projnameH).ilg $(projnameH).ind $(projnameH).loa $(projnameH).lot $(projnameH).loe $(projnameH).snm $(projnameH).nav + +cleana: + rm -f $(projnameA).aux $(projnameA).bbl $(projnameA).log $(projnameA).out $(projnameA).toc $(projnameA).lof $(projnameA).glo $(projnameA).glsdefs $(projnameA).idx $(projnameA).ilg $(projnameA).ind $(projnameA).loa $(projnameA).lot $(projnameA).loe $(projnameA).snm $(projnameA).nav + diff --git a/notes/08_clustering/beamercolorthemetub.sty b/notes/08_clustering/beamercolorthemetub.sty new file mode 100644 index 0000000..c41d22a --- /dev/null +++ b/notes/08_clustering/beamercolorthemetub.sty @@ -0,0 +1,48 @@ +% Copyright 2004 by Madhusudan Singh +% +% This file may be distributed and/or modified +% +% 1. under the LaTeX Project Public License and/or +% 2. under the GNU Public License. +% +% See the file doc/licenses/LICENSE for more details. + +%\ProvidesPackageRCS $Header: beamercolorthemetub.sty, v a01 2011/11/18 09:11:41 tujl $ + +\mode + +\definecolor{darkred}{rgb}{0.8,0,0} + +\setbeamercolor{section in toc}{fg=black,bg=white} +\setbeamercolor{alerted text}{fg=darkred!80!gray} + +\setbeamercolor*{palette primary}{fg=darkred!60!black,bg=gray!30!white} +\setbeamercolor*{palette secondary}{fg=darkred!70!black,bg=gray!15!white} +\setbeamercolor*{palette tertiary}{bg=darkred!80!black,fg=gray!10!white} +\setbeamercolor*{palette quaternary}{fg=darkred,bg=gray!5!white} + +\setbeamercolor*{sidebar}{fg=darkred,bg=gray!15!white} + +\setbeamercolor*{palette sidebar primary}{fg=darkred!15!black} +\setbeamercolor*{palette sidebar secondary}{fg=white} +\setbeamercolor*{palette sidebar tertiary}{fg=darkred!50!black} +\setbeamercolor*{palette sidebar quaternary}{fg=gray!15!white} + +%\setbeamercolor*{titlelike}{parent=palette primary} +\setbeamercolor{titlelike}{parent=palette primary,fg=darkred} +\setbeamercolor{frametitle}{bg=gray!15!white} +\setbeamercolor{frametitle right}{bg=gray!60!white} + +%\setbeamercolor{Beispiel title}{bg=white,fg=black} + +\setbeamercolor*{separation line}{} +\setbeamercolor*{fine separation line}{} + +%\setbeamercolor{itemize item}{fg=darkred,bg=white} +%\setbeamercolor{itemize subitem}{fg=darkred!60!white,bg=white} +%\setbeamercolor{local structure}{fg=darkred,bg=white} +\setbeamercolor{local structure}{fg=gray,bg=white} +\setbeamercolor{structure}{fg=darkred!80!black,bg=white} +\setbeamercolor{block title}{bg=gray!10!white} +\mode + diff --git a/notes/08_clustering/beamerthemeTUBerlin.sty b/notes/08_clustering/beamerthemeTUBerlin.sty new file mode 100644 index 0000000..1ce3fd7 --- /dev/null +++ b/notes/08_clustering/beamerthemeTUBerlin.sty @@ -0,0 +1,22 @@ +% Copyright 2004 by Madhusudan Singh +% +% This file may be distributed and/or modified +% +% 1. under the LaTeX Project Public License and/or +% 2. under the GNU Public License. +% +% See the file doc/licenses/LICENSE for more details. + +%\ProvidesPackageRCS $Header: beamerthemeTUBerlin.sty, v a01 2011/11/18 09:11:41 tujl $ +\mode + +\useinnertheme[shadow=true]{rounded} +\useoutertheme{infolines} +\usecolortheme{tub} + +\setbeamerfont{frametitle}{size=\normalsize} +\setbeamerfont{block title}{size={}} +%\setbeamerfont{structure}{series=\bfseries} +\setbeamercolor{titlelike}{parent=structure,bg=white} +\mode + diff --git a/notes/08_clustering/bibliography.bib b/notes/08_clustering/bibliography.bib new file mode 100644 index 0000000..948691f --- /dev/null +++ b/notes/08_clustering/bibliography.bib @@ -0,0 +1,29 @@ +@book{sutton1998introduction, + title={Introduction to reinforcement learning}, + author={Sutton, Richard S and Barto, Andrew G and others}, + volume={135}, + year={1998}, + publisher={MIT press Cambridge} +} +@Book{Bertsekas07, + author = {D. P. Bertsekas}, + title = {Dynamic Programming and Optimal Control}, + publisher ={Athena Scientific}, + year = {2007}, + volume = {2}, + edition = {3rd}, + url = {http://www.control.ece.ntua.gr/UndergraduateCourses/ProxTexnSAE/Bertsekas.pdf} +} +@Article{Watkins92, + author = {C. Watkins and P. Dayan}, + title = {Q-learning}, + journal = {Machine Learning}, + year = {1992}, + OPTkey = {}, + volume = {8}, + OPTnumber = {}, + pages = {279--292}, + OPTmonth = {}, + OPTnote = {}, + OPTannote = {} +} diff --git a/notes/08_clustering/img/clustering.pdf b/notes/08_clustering/img/clustering.pdf new file mode 100644 index 0000000..4e3bd7d Binary files /dev/null and b/notes/08_clustering/img/clustering.pdf differ diff --git a/notes/08_clustering/img/clustering_soft.pdf b/notes/08_clustering/img/clustering_soft.pdf new file mode 100644 index 0000000..54e4498 Binary files /dev/null and b/notes/08_clustering/img/clustering_soft.pdf differ diff --git a/notes/08_clustering/img/section4_fig2_nocaption.pdf b/notes/08_clustering/img/section4_fig2_nocaption.pdf new file mode 100644 index 0000000..5e852b7 Binary files /dev/null and b/notes/08_clustering/img/section4_fig2_nocaption.pdf differ diff --git a/notes/08_clustering/img/section4_fig2_withincluster.pdf b/notes/08_clustering/img/section4_fig2_withincluster.pdf new file mode 100644 index 0000000..a05aada Binary files /dev/null and b/notes/08_clustering/img/section4_fig2_withincluster.pdf differ diff --git a/notes/08_clustering/img/section4_fig4.pdf b/notes/08_clustering/img/section4_fig4.pdf new file mode 100644 index 0000000..ef9bc6d Binary files /dev/null and b/notes/08_clustering/img/section4_fig4.pdf differ diff --git a/notes/08_clustering/tutorial.handout.tex b/notes/08_clustering/tutorial.handout.tex new file mode 100644 index 0000000..c016f5c --- /dev/null +++ b/notes/08_clustering/tutorial.handout.tex @@ -0,0 +1,14 @@ +\documentclass[handout,ignorenonframetext]{beamer} +\newcounter{baslide} +\setcounter{baslide}{1} + +\let\oldframe +\frame +\let\oldendframe +\endframe + +\def\frame{\oldframe \label{baslide\roman{baslide}}% +\addtocounter{baslide}{1}} +\def\endframe{\oldendframe} + +\input{tutorial} diff --git a/notes/08_clustering/tutorial.notes.tex b/notes/08_clustering/tutorial.notes.tex new file mode 100644 index 0000000..c5da1a8 --- /dev/null +++ b/notes/08_clustering/tutorial.notes.tex @@ -0,0 +1,17 @@ +\documentclass{../../latex/minotes} +\input{../../latex/customcommands} + +\numberwithin{equation}{section} +\numberwithin{figure}{section} + +\let\oldframe\frame +\let\oldendframe\endframe + +\newcommand{\notesonly}[1]{#1} + +\newcommand{\mystackrel}[2]{\stackrel{\mathmakebox[\widthof{#1}]{#2}}{=}} + +% frame titles only effective in presentation mode +\renewcommand{\frametitle}[1]{} + +\input{tutorial} diff --git a/notes/08_clustering/tutorial.slides.tex b/notes/08_clustering/tutorial.slides.tex new file mode 100644 index 0000000..5a3735c --- /dev/null +++ b/notes/08_clustering/tutorial.slides.tex @@ -0,0 +1,11 @@ +\input{../../latex/headerMIslides} +\input{../../latex/customcommands} + +\subtitle{1.1 Intro \& 1.2 Connectionist Neuron} +\mathtoolsset{showonlyrefs} + +\newcommand{\slidesonly}[1]{#1} + +\newcommand{\mystackrel}[2]{\stackrel{\mathmakebox[\widthof{#1}]{#2}}{=}} + +\input{tutorial} diff --git a/notes/08_clustering/tutorial.tex b/notes/08_clustering/tutorial.tex new file mode 100644 index 0000000..54339bb --- /dev/null +++ b/notes/08_clustering/tutorial.tex @@ -0,0 +1,80 @@ +\usepackage[authoryear,round]{natbib} +\usepackage{multirow} + +\newcommand{\sheetnum}{% + 08 +} +%\setcounter{section}{\sheetnum-3} +\newcommand{\tutorialtitle}{% + K-means Clustering and Pairwise Clustering +} +\newcommand{\tutorialtitleshort}{% + Clustering +} +% for slides +\subtitle{\sheetnum \tutorialtitle} + +%\maxdeadcycles=1000 % Workaround for ! Output loop---100 consecutive dead cycles because of too many figures + +% The following use of algroithms does not work well with the notes: +% +% +% +% +% instead use the following for your algorithms: +% +%\begin{figure}[!t] +%\removelatexerror +%\begin{algorithm}[H] + % your algo here + %\label{alg:algolabel} + %\caption{algocaption} +%\end{algorithm} +%\end{figure} +%\begin{algorithm} +% Below is the definition for the command \removelatexerror: +\makeatletter +\newcommand{\removelatexerror}{\let\@latex@error\@gobble} +\makeatother + +\begin{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\sheet{\sheetnum}{\tutorialtitleshort} + +\ttopic{\tutorialtitle} + +\columnratio{0.2,0.8}\textbf{} +\begin{paracol}{2} +%\setlength{\columnseprule}{0.1pt} +%\setlength{\columnsep}{5em} + +\begin{rightcolumn} + +% notes version will ignore it +\begin{frame} +\titlepage +\end{frame} + +\begin{frame} +\tableofcontents +\end{frame} + +\newpage + +\mode +\input{./1_clustering} +\mode* + +\clearpage + +%\section{References} +%\begin{frame}[allowframebreaks] \frametitle{References} + %\scriptsize + %\bibliographystyle{plainnat} + %\bibliography{bibliography} +%\end{frame} + +\end{rightcolumn} +\end{paracol} + +\end{document}