-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #17 from kashefy/stochopt
Stochopt
- Loading branch information
Showing
23 changed files
with
1,335 additions
and
869 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
\section{Limitations in how we've been optimizing so far} | ||
|
||
\begin{frame} | ||
|
||
Our iterative gradient-based optimizations: | ||
\begin{itemize} | ||
\item assumes the optimum ``up ahead'' is the best solution, | ||
|
||
\begin{figure}[ht] | ||
\centering | ||
\begin{tabular}{c c} | ||
\includegraphics[height=3.5cm]{img/gradient-descent.pdf} & | ||
\includegraphics[height=3.5cm]{img/gradient-descent_local.pdf} | ||
\end{tabular} | ||
\notesonly{ | ||
\caption{Learning by gradient descent}\label{fig:graddescent} | ||
} | ||
\end{figure} | ||
|
||
\item doesn't handle discrete optimization. | ||
\end{itemize} | ||
|
||
\end{frame} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
\section{Exploration vs. Exploitation} | ||
|
||
\notesonly{ | ||
%An | ||
%introduction illustrating the underlying analogy can be found in | ||
%\textcite[ch. 7]{DudaEtAl2001}. \textcite{Murphy2012} gives a good | ||
%and extensive discussion of undirected graphical models (Markov Random | ||
%fields, ch~19), variational inference (ch~21; mean field for the ISING | ||
%model, ch~21.3), and Monte Carlo Methods (ch~23), as well as MCMC | ||
%methods (ch~24). Further information regarding variational methods can | ||
%be found in \textcite{Bishop2006}. | ||
|
||
Learning is about tuning model parameters to fit some objective given training data. | ||
For simple models with only few parameters one can formulate an analytic solution that optimizes the objective and yields the optimal parameters directly. | ||
A soon as the number of parameters increases we opt for iterative gradient-based methods for finding the extrema of the objective function. If we were trying to minimize some cost function $E$, | ||
iteratively updating the parameters $\vec w$ by moving them in the direction of where the gradient points steepest leads to the location of an extremum. | ||
However, the cost function may contain multiple extrema, and there is no guarantee gradient-based learning will take us to a \emph{global} or \emph{local} optimum. | ||
Following the gradient assuming that it will lead to a solution that represents the global optimum is considered a \emph{greedy} approach to learning. | ||
|
||
Completely abandoning such assumptions will lead to a \emph{random search} for the optimal parameters. When the previous set of parameters has no influence on the choice of weights in the next iteration, | ||
then our learning approach is dominated by \emph{exploration} as opposed to \emph{exploitation} when we learn in a greedy fashion. | ||
} | ||
|
||
\begin{frame}{Exploration vs. Exploitation} | ||
|
||
\begin{figure}[ht] | ||
%\centering | ||
\begin{tabular}{c c} | ||
\visible<2->{exploration} & exploitation\\ | ||
\visible<2->{\includegraphics[height=3.0cm]{img/exploration.pdf}} & | ||
\includegraphics[height=3.0cm]{img/exploitation.pdf} \\ | ||
\visible<2->{random search} & greedy search/hill ``climbing'' | ||
\end{tabular} | ||
\notesonly{ | ||
\caption{exploration vs. exploitation} | ||
} | ||
\label{fig:exploration-exploitation} | ||
\end{figure} | ||
|
||
\pause | ||
|
||
\question{What are the advantages and disadvantages of exploration?} | ||
|
||
\notesonly{ | ||
The advantage of exploration is that we always discard a set of parameters if the next set yields a better cost value. Therefore, exploration is not prone to getting stuck inside a local optimum. The obvious disadvantage is that there are no guarantees on how long it will take to find the global optimum. We never know if we've converged or not. Exploitation, or the greedy approach (with the appropriate learning rate schedule) is able to converge. However, whether it reaches a global or local solution depends on the starting position. | ||
} | ||
|
||
|
||
\end{frame} | ||
|
||
%\underline{Motivation 1:} We will look at how stochastic optimization can find a tradeoff between the two modes exploration and exploitation. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
\section{Discrete Optimization} | ||
|
||
\begin{frame} | ||
|
||
\slidesonly{ | ||
\begin{block}{Supervised \& unsupervised learning $\rightarrow$ evaluation of cost function $E$} | ||
Find the arguments the optimize $E$. | ||
\begin{itemize} | ||
\item real-valued arguments: gradient based techniques (e.g. ICA unmixing matrices) | ||
\item discrete arguments: ??? (e.g. for cluster assignment) | ||
\end{itemize} | ||
\end{block} | ||
} | ||
\notesonly{ | ||
So far we've been concerned with optimization problems with real valued arguments. The free parameters form a continuous space. The direction of the principle components in PCA and the unmixing matrix we are trying to find in ICA are real-valued arguments to their respective problems. | ||
Gradient-based solutions are well suited for real-valued arguments as we tune the weights to minimize to optimize the cost function. | ||
|
||
But what if the problem we are trying to optimize operate on discrete arguments. This could be the case if we were tackling a problem such as K-Means clustering. K-Means clustering involves finding arguments with which to assign an observation to one of multiple clusters. The cost function that measures the quality of the assignments is continuous but a the arguments we optimize over, which effectively assign each observation to one cluster instead of another cluster, are discrete variables. Below is an example of such an assignment: | ||
|
||
|
||
} | ||
|
||
\only<1>{ | ||
|
||
\begin{figure}[ht] | ||
\centering | ||
\includegraphics[height=3.5cm]{img/clustering.pdf} | ||
\caption{Clustering involves discrete-valued arguments.} | ||
\label{fig:clustering} | ||
\end{figure} | ||
} | ||
|
||
\only<2>{ | ||
|
||
\begin{center} | ||
\includegraphics[height=3.5cm]{img/cyberscooty-switch_1-5} | ||
\notesonly{\captionof{figure}{A combinatorial problem}} | ||
\end{center} | ||
} | ||
|
||
|
||
\end{frame} | ||
|
||
%\newpage | ||
|
||
\subsection{Formalization of the discrete optimization problem} | ||
|
||
\begin{frame}{\subsecname} | ||
|
||
\begin{block}{Setting} | ||
\begin{itemize} | ||
\item discrete variables $s_i, \ i = 1, \ldots, N\quad$ (e.g. $s_i \in \{+1, -1\}$ \notesonly{``binary units''} or $s_i \in \mathbb N$) | ||
% $s_i \in \{1, 2, \dots 9 \} $ | ||
\item \indent short-hand notation: $\vec{s}$ (``state'') -- { $\{\vec{s}\}$ is called state space } | ||
\item {cost function:} $E: \vec{s} \mapsto E_{(\vec{s})} \in \mathbb{R}$ -- { not restricted to learning problems} | ||
\end{itemize} | ||
\end{block} | ||
|
||
We will focus on \textbf{minimization} problems. | ||
|
||
\begin{block}{Goal: find state $\vec{s}^*$, such that:} | ||
\begin{equation*} | ||
E \eqexcl \min \qquad (\text{desirable global minimum for the cost}) | ||
\end{equation*} | ||
Consequently, | ||
\begin{equation*} | ||
s^* := \argmin E. | ||
\end{equation*} | ||
\end{block} | ||
\end{frame} | ||
|
||
\subsubsection{Strategies for discrete optimization} | ||
|
||
\begin{frame}{\subsubsecname} | ||
|
||
\notesonly{ | ||
We want to find the best configuration of a discrete system (e.g.\ | ||
state of interacting binary units). Evaluating full search space works only for | ||
very small problems ($\rightarrow 2^N$ possibilities for a problem with $N$ | ||
binary units, search space is the set of corners of a | ||
hypercube). A greedy strategy will often get trapped if there are | ||
multiple local minima. | ||
} | ||
|
||
\begin{itemize} | ||
\item Evolutionary and genetic algorithms (GA) have been | ||
motivated by \emph{biological} phenomena of adaptation through | ||
\emph{mutation} and \emph{selection}. | ||
|
||
\vspace{5mm} | ||
|
||
\item GA and Monte-Carlo-Markov-Chain methods based on a \emph{proposal} and | ||
\emph{acceptance} strategy ($\rightarrow$ Metropolis) can be interpreted | ||
as ``learning via trial and error''. \emph{Simulated annealing} falls within MCMC. | ||
\end{itemize} | ||
\end{frame} | ||
|
Oops, something went wrong.