Merge pull request #13 from kashefy/dt

density transformation makeover
kashefy · May 14, 2020 · 20d72a4 · 20d72a4
2 parents 95d8d87 + 6694b38
commit 20d72a4
Show file tree

Hide file tree

Showing 26 changed files with 2,127 additions and 312 deletions.
diff --git a/notes/00_lagrange/1_lagrange.tex b/notes/00_lagrange/1_lagrange.tex
@@ -59,7 +59,7 @@ \subsection{Gradient ascent - unconstrained optimization}
 \begin{equation}
 \vec \nabla f = 
 \frac{\partial f}{\partial \vec w} = 
-\rmat{\frac{\partial f}{\partial w_1} \\[0.2cm] \frac{\partial f}{\partial w_1} }
+\rmat{\frac{\partial f}{\partial w_1} \\[0.2cm] \frac{\partial f}{\partial w_2} }
 \end{equation}
 
 \end{frame}

diff --git a/notes/03_kernel-pca/5_apply.tex b/notes/03_kernel-pca/5_apply.tex
@@ -53,3 +53,24 @@ \subsection{A note on implementation}
 The kernel function is symmetric. $k(\vec x^{(\alpha)}, \vec x^{(\beta)}) = k(\vec x^{(\beta)}, \vec x^{(\alpha)})$. One can exploit this by reducing how many times the kernel function is actually applied while traversing the training samples when computing $\widetilde {\vec{K}}$.
 
 \end{frame}
+
+%TL;DR: Picking the right parameters for the kernel depends on how good that parameter solves your problem (good dim. reduction, reflects certain assumptions about the data - e.g. expected degree of polynomial). We treat them as hyperparameters, so "tuning" them is usually done through cross-validation than gradient-based optimization.
+
+%The long version. This is actually how I end up with more content for the tutorial notes, not always but every now and then it happens. So thank you for the question:
+
+%Finding good parameters for the kernels depends on the data and the task you are trying to solve.
+%This comes across as a very generic answer but let's look at two examples. I will use the RBF Kernel which is parameterized by a single parameter sigma. You an easily extend this to other kernels. It also doesn't matter if we're using this kernel for an unsupervised method (e.g. Kernel PCA) or a supervised one (e.g. SVM - don't feel left out if you don't know what an SVM is - just don't tell anyone - people are so judgmental these days tsts):
+
+%Example A: We want to reduce the dimensionality of some high dimensional data (i.e. compression). We start of with N dimensions (N is large), p points (lots of points). Kernel PCA is going to give us p PCs. We perform Kernel PCA once with sigma = sigma_1 and a second time with sigma = sigma_2.
+%How do we know which sigma value to use for dimensionality reduction?
+
+%Suggestion 1: Create a scree plot for the eigenvalues obtained from using sigma_1 and compare it to the scree plot from using sigma_2. The ones that gives you "more variance explained" with the least amount of PCs is an indication that one sigma could be better than the other. A scree plot that shows a lot of variance explained for the first couple of PCs and then suddenly drops for all the rest is an indicator that those PCs are enough for good dimensionality reduction. You'll sometimes run into situations where you don't think the comparison shows a clear winner. Therefore,...
+
+%Suggestion 2: Measure and compare reconstruction error between the two sigmas. It's exactly what we're looking for when doing dimensionality reduction. More intuitive than looking at a scree plot but it involves reconstructing and that involves approximations (c.f. lecture slide 1.4 #23) so there's a drawback.
+%Something to keep in mind when using reconstruction error is that for the same sigma you get some value for your reconstruction error when measure it for data that you used in training (i.e. solving the eigenvalue problem) vs. the error value you get when reconstructing test data. This is where cross validation comes in. We're going to talk about cross-validation later in the course when we talk about density estimation. Basically, you want to know how well your model does when you feed it data that it's never seen before. If it does well on training data but badly on unseen data, then you can't really deploy this model. So before deploying it you need to measure this performance (e.g. reconstruction error) on unseen data. This is called cross validation and you can use this to compare how well one sigma does vs another. How well does one sigma do on unseen data vs. the other sigma? The sigma that has better cross validation performance is the one you go with.
+
+%Example B: Using Kernel PCA to preprocess data before feeding it into a classifier:
+%You've picked a kernel (e.g. RBF) you're trying different values for the sigma. You do Kernel PCA and project the data onto the PCs. You feed the projections into a classifier to tell you if this is an image of a cat or a dog (TODO: cat vs. dog is boring, replace with something that has more bling). You measure the classifier's performance and you find out it gives the correct answer 80% of the time. Redo the above with another sigma and you get 87%. You can use the classifier's performance as a measure for deciding which sigma to use. It's still Kernel PCA but you're hyperparameter selection is based on something completely different. But it's a well justified criterion for that task that you are trying to solve. Calling a parameter a "hyperparameter" implies that you don't used gradient-based optimization to tune it but rather something like cross-validation. The reason for not using gradient-based optimization could be that you performance measure or cost function w.r.t. to that parameter is not necessarily differntiable or it would make the optimization too complicated from all the free parameters.
+
+%A note on the Neural Network kernel (aka tanh Kernel, aka sigmoid kernel):
+%The kernel itself is not a neural network. when people saw the expression tanh( xTx + ) resembled an expression we are used to seeing in neural networks. What value do you pick for and ? Scroll up, the same applies and it depends on the data and the task.
diff --git a/notes/04_density-transform/0_ica_problem.tex b/notes/04_density-transform/0_ica_problem.tex
@@ -0,0 +1,55 @@
+
+\section{The ICA problem}
+
+\begin{frame}{\secname}
+
+Let $\vec s = (s_1, s_2,...,s_N)^\top$ denote the concatenation of independent sources 
+and $\vec x \in \R^N$ describe our observations. $\vec x$ relates to $\vec s$ through a 
+\emph{linear transformation} $\vec A$:
+
+\begin{equation}
+\label{eq:ica}
+\vec x = \vec A \, \vec s
+\slidesonly{\hspace{3cm}\text{(the ICA problem)}\hspace{-4cm}}
+\end{equation}
+
+We refer to $\vec A$ as the \emph{mixing matrix}\notesonly{ and Eq.\ref{eq:ica} as the \emph{ICA problem}}.
+Solving the ICA problems is to recover $\vec s$ from only observing $\vec x$.
+
+\end{frame}
+
+\subsection{Example scenario}
+
+\begin{frame}{\subsecname}
+
+\begin{center}
+	\includegraphics[width=0.7\textwidth]{img/setting}
+\notesonly{	\captionof{figure}{Example scenario} }
+\end{center}
+
+\notesonly{
+
+Two speakers are placed in a room and emit signals $s_1$ and $s_2$. 
+The speakers operate independent of one another.  
+Two microphones are placed in the room and start recording. 
+The first microphone is placed slightly closer to speaker 2, while 
+the second microphone is placed slightly closer to speaker 1.
+$x_1$ and $x_2$ denote the recordings of the first and second microphone respectively. 
+When we listen to the recordings we expect to hear a mix of $s_1$ and $s_2$. 
+Since microphone 1 was placed closer to speaker 2, when we only listen to $x_1$ we hear more of $s_2$ than $s_1$. 
+The opposite can be said when we listen only to $x_2$.
+
+Acoustic systems are linear. This means that $x_1$ is a superposition of \emph{both} sources $s_1$ and $s_2$. 
+We will assume here that the contribution of a source $s_i$ 
+to an observation $x_j$ is inversely proportional to the distance between the source and the microphone. 
+The distance-contribution relationship is \emph{linear}. We don't need this to be any more realistic. 
+
+If we had a measurement of the distance between each microphone and each speaker, 
+we would tell exactly what the contribution of each of $s_1$ and $s_2$ is to each recorded observation. 
+If we know the exact contribution of a source to an observation, we can look at both observations and recover each source in full.
+}
+
+\pause 
+This is what ICA tries to solve, except that it does not have any knowledge about the spatial setting. It is blind.
+
+\end{frame}