diff --git a/src/q7/ml-ELEC2870/summary/images/MLP.png b/src/q7/ml-ELEC2870/summary/images/MLP.png new file mode 100644 index 000000000..206898a58 Binary files /dev/null and b/src/q7/ml-ELEC2870/summary/images/MLP.png differ diff --git a/src/q7/ml-ELEC2870/summary/images/RBFN.png b/src/q7/ml-ELEC2870/summary/images/RBFN.png new file mode 100644 index 000000000..1e034329c Binary files /dev/null and b/src/q7/ml-ELEC2870/summary/images/RBFN.png differ diff --git a/src/q7/ml-ELEC2870/summary/images/VQ-eval.png b/src/q7/ml-ELEC2870/summary/images/VQ-eval.png new file mode 100644 index 000000000..878cfa1d1 Binary files /dev/null and b/src/q7/ml-ELEC2870/summary/images/VQ-eval.png differ diff --git a/src/q7/ml-ELEC2870/summary/images/activation-fct.png b/src/q7/ml-ELEC2870/summary/images/activation-fct.png new file mode 100644 index 000000000..cfa764004 Binary files /dev/null and b/src/q7/ml-ELEC2870/summary/images/activation-fct.png differ diff --git a/src/q7/ml-ELEC2870/summary/biasvariancecompromise.png b/src/q7/ml-ELEC2870/summary/images/biasvariancecompromise.png similarity index 100% rename from src/q7/ml-ELEC2870/summary/biasvariancecompromise.png rename to src/q7/ml-ELEC2870/summary/images/biasvariancecompromise.png diff --git a/src/q7/ml-ELEC2870/summary/images/linear-classification.png b/src/q7/ml-ELEC2870/summary/images/linear-classification.png new file mode 100644 index 000000000..63ff636b4 Binary files /dev/null and b/src/q7/ml-ELEC2870/summary/images/linear-classification.png differ diff --git a/src/q7/ml-ELEC2870/summary/images/linear-regression.png b/src/q7/ml-ELEC2870/summary/images/linear-regression.png new file mode 100644 index 000000000..2557ae2ca Binary files /dev/null and b/src/q7/ml-ELEC2870/summary/images/linear-regression.png differ diff --git a/src/q7/ml-ELEC2870/summary/images/uncorr-vs-indep.png b/src/q7/ml-ELEC2870/summary/images/uncorr-vs-indep.png new file mode 100644 index 000000000..5289b34cc Binary files /dev/null and b/src/q7/ml-ELEC2870/summary/images/uncorr-vs-indep.png differ diff --git a/src/q7/ml-ELEC2870/summary/images/vq-effects.png b/src/q7/ml-ELEC2870/summary/images/vq-effects.png new file mode 100644 index 000000000..2c53bb8ca Binary files /dev/null and b/src/q7/ml-ELEC2870/summary/images/vq-effects.png differ diff --git a/src/q7/ml-ELEC2870/summary/ml-ELEC2870-summary.tex b/src/q7/ml-ELEC2870/summary/ml-ELEC2870-summary.tex index f2fb1bc73..6b7b2c270 100644 --- a/src/q7/ml-ELEC2870/summary/ml-ELEC2870-summary.tex +++ b/src/q7/ml-ELEC2870/summary/ml-ELEC2870-summary.tex @@ -3,626 +3,1296 @@ \usepackage{esdiff} -\hypertitle{Machine Learning : regression dimensionality reduction and data visualization}{7}{ELEC}{2870} -{Guillaume Derval\and Benoît Legat} -{Michel Verleysen and John Lee} - -\section{Introduction} -\subsection{What is machine learning} - -The role of machine learning is to build a machine that learns a model from the data. -In statistics, one often supposes that the data follow a certain distribution and try to approximate the value of those parameters. -On the other hand, in machine learning we try to make the least amount of assumption on the data -to let enough degrees of freedom for the machine to learn a good model. - -Machine Learning can be decomposed in 5 steps, -only the last 3 are covered in this course: -\begin{enumerate} - \item - Preprocessing - \item - Feature generation - \item - Feature selection (Section~\ref{sec:featuresel}) - \item - Model generation (Section~\ref{sec:modelgen}) - \item - Validation -\end{enumerate} - -\subsection{History} -\begin{center} - \begin{tabular}{lll} - 1940--1965 & Hebb & Biological learning rule\\ - & McCulloch \& Pitts & Binary decision units\\ - & Rosenblatt & Perceptron, learning\\ - 1969 & Minsky \& Papert & Limits to Perceptron\\ - 1974 & Werbos & Back-propagation\\ - 1980s & Hopfield & Recurrent networks\\ - & Parker, LeCun, Rumelhart, McClelland & Back-propagation\\ - & Kohonen & Self-Organizing Maps\\ - 1990s & Vapnik & Support Vector Machines\\ - & Many & Feature selection, model selection, evaluation... - \end{tabular} -\end{center} +\usepackage{color} +\usepackage{mathtools} +\usepackage{wrapfig} -\subsection{Curse of dimensionality} -We need at least as much examples (data) as dimensions. -If we cut each dimension by $x$, you have $x^n$ data samples. -We need at least some data in each region. -It increases exponentially. -So we need feature selection! +\DeclareMathOperator*{\plim}{plim} -All our algorithms are sensitive to this problem, just differently. +%to display a warning sign (code from https://tex.stackexchange.com/a/159701) +\usepackage{newunicodechar} +\newcommand\warn{% + \makebox[1.2em][c]{% + \makebox[0pt][c]{\raisebox{.1em}{\tiny!}}% + \makebox[0pt][c]{\color{red}$\bigtriangleup$}}}% -\subsection{Learning with a Teacher: Supervised learning} -Learning with a labelled set, i.e. input-output examples; see \cite[p.~64]{haykin2009neural}. - -Examples: -\begin{itemize} - \item Regression: function approximation. - \item Classification: heteroassociation \cite[p.~68]{haykin2009neural}, pattern recognition \cite[p.~69]{haykin2009neural}. -\end{itemize} -Note that classification can be seen as a particular case of regression, -e.g. a regression with discrete ``height'' values. - -\subsection{Learning without a Teacher} -No labelled set. -\subsubsection{Reinforcement Learning} -Learning with a critic and an environment; see \cite[p.~66]{haykin2009neural}. -The output of the learning machine is perceived by the environment. -The critic extracts the input vector $x$ and a \emph{primary reinforcement signal} from the environment -and gives to the learning system a \emph{heuristic reinforcement signal}. - -The learning system has therefore an input vector from the environment and a heuristic reinforcement signal from the critic that is used in place of the know output of the labelled set to learn. +\hypertitle{Machine Learning : regression dimensionality reduction and data visualization}{7}{ELEC}{2870} +{Guillaume Derval\and Benoît Legat \and Valentin Lemaire \and Gauthier de Moffarts \and Brieuc de Voghel \and Séverine Mersch-Mersch} +{Michel Verleysen and John Lee} -\subsubsection{Unsupervised Learning or self-organized learning} -No labelled set and no teacher; see \cite[p.~67]{haykin2009neural}. +\newcommand{\ex}[1]{\\ • exemple: #1} +\newcommand{\newpar}{\vspace{\baselineskip} \noindent} -\begin{itemize} - \item Clustering: autoassociation \cite[p.~68]{haykin2009neural} - \item Adaptative filtering. - \item Visualization. -\end{itemize} +\section{Introduction} + \subsection{What is machine learning} + + The role of machine learning is to build a machine that learns a model from the data. + In statistics, one often supposes that the data follow a certain distribution and try to approximate the value of those parameters. + On the other hand, in machine learning we try to make the least amount of assumption on the data + to let enough degrees of freedom for the machine to learn a good model. + + Machine Learning can be decomposed in 5 steps, + only the last 3 are covered in this course: + \begin{enumerate} + \item + Preprocessing + \item + Feature generation + \item + Feature selection (Section~\ref{sec:featuresel}) + \item + Model generation (Section~\ref{sec:modelgen}) + \item + Validation + \end{enumerate} + + \subsection{History} + \begin{center} + \begin{tabular}{lll} + 1940--1965 & Hebb & Biological learning rule \\ + & McCulloch \& Pitts & Binary decision units \\ + & Rosenblatt & Perceptron, learning \\ + 1969 & Minsky \& Papert & Limits to Perceptron \\ + 1974 & Werbos & Back-propagation \\ + 1980s & Hopfield & Recurrent networks \\ + & Parker, LeCun, Rumelhart, McClelland & Back-propagation \\ + & Kohonen & Self-Organizing Maps \\ + 1990s & Vapnik & Support Vector Machines \\ + & Many & Feature selection, model selection, evaluation... + \end{tabular} + \end{center} + + \subsection{Overfitting} + How to detect overfitting: + \begin{itemize} + \item by ploting a projection (hard) + \item by looking at the errors of the training set and test set. If they are very different, the model overfits. + \item by looking at the model parameters. They are usually large when it overfits. + \end{itemize} + + Having a few sample or a large model order increases the risk of overfitting. + + We can limit this risk by using {\color{blue} regularization} in addition to the MSE (mean of squared errors): + $$ + \tilde{E}(\mathbf{w})=\frac{1}{N} \sum_{n=1}^{N}\left\{y\left(x_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}{\color{blue} +\frac{\lambda}{N}\|\mathbf{w}\|^{2}} + $$ + + \subsection{Curse of dimensionality} + % trying to write a better explanation + % https://www.mygreatlearning.com/blog/understanding-curse-of-dimensionality/ + % \subsubsection{Problem} + % \paragraph{} The quantity of data needed increases exponentially with the number of features (dimensions). + % \paragraph{} With 2 features (young-old and male-female) a population can be divided into 4 groups (young males, young females, ...). When increasing the number of features (adding rich-poor for example), the number of groups is multiplied (8 groups : young rich males, young poor males, ...). + % \paragraph{Data Sparsity} When the training samples are missing some of the combinations of features. + % \paragraph{Distance Concentration} + % \subsubsection{Dimensionality reduction techniques} + % \paragraph{} + + + % old explanation + % \subsubsection{Data sparsity} + We need at least as much examples (data) as dimensions. + If we cut each dimension by $x$, you have $x^n$ data samples. + We need at least some data in each region. + It increases exponentially. + So we need feature selection! + + All our algorithms are sensitive to this problem, just differently. + \subsection{Supervised learning} + Building an in-out relation thanks to data with know desired output (label, value, etc.). + It is used for \textbf{classification} and \textbf{regression} problems. + \subsection{Unsupervised learning} + Extracting some useful information from data without supplementary knowledge. + It is used for \textbf{clustering}, \textbf{adaptive filtering}, and \textbf{visualization} problems. \section{Principal Component Analysis (PCA)} -Suppose the features are stored by rows in $X$ -and each column represents a sample. In the following, the data are considered centered: $\bar{X} \rightarrow X$. -\subsection{Background} -\begin{itemize} -\item \textbf{Decorrelated}: Data $(X,Y)$ are decorrelated iff $C_{X,Y}$ is in the form of $\begin{pmatrix} - \sigma_X^2 & 0 \\ - 0 & \sigma_Y^2 -\end{pmatrix}$ (diagonal). -The covariance of $N$ variables is computed using resp. $C_X = XX^\Tr / N$. -\item \textbf{Whiteness}: Data $(X,Y)$ is white iff $C_{X,Y}$ is the identity matrix (and $\bar{X}=\bar{Y}=0$). -\item \textbf{EVD}: Let $V$ be the matrix whose columns contain the eigenvectors of $C_X$, we have $C_X = V \Lambda V^\Tr$. -\item \textbf{Decorrelation}: Find a transformation $W$ such that $WX$ is decorrelated. - Works for $W = V^\Tr$. -\item \textbf{Whitening}: Find a transformation $W$ such that $WX$ is white. - Works for $W = \Lambda^{-\frac{1}{2}} V^\Tr$. -\end{itemize} -\subsection{PCA Axes} -\begin{itemize} -\item Find axes that maximise the variance after projection -\item Best choice at each step is the eigenvector with the maximum eigenvalue -\item \textbf{Variance kept}: sum of the eigenvalues taken / sum of all the eigenvalues -\item \textbf{Error}: sum of the eigenvalues not taken / sum of all the eigenvalues -\item \textbf{Whiteness invariance}: rotations are white too (rotation matrix $T$ is orthogonal) -\end{itemize} + Suppose the features are stored by rows in $X$ + and each column represents a sample. The goal of PCA is to find a new coordinate system where variables are decorrelated or even white (using a linear transformation of data $Y = WX$ and to reduce the number of variables in the new coordinate system to minimize the loss of information when backprojecting to the initial coordinate system. + \subsection{Background} + \paragraph{Moments} $\mu_X = \mathbb{E}_X[X^n] = \int_{-\infty}^\infty x^n f_X(x) dx$. 1st order moment is the expectation of a Random Variable, second-order central moment is the variance. + Standard Deviation is $$\sigma_X = \sqrt{\mathbb{E}_X[(X-\mu_X)^2]}$$ + \paragraph{Covariance and Correlation} + $$C_{X,Y} = Cov[X_i, X_j] = \mathbb{E}_X[(X_i - \mu_{X_i})(X_j - \mu_{X_j})] = \mathbb{E}[X_iX_j] - \mathbb{E}_{X_i}[X_i] \mathbb{E}_{X_j}[X_j]$$ + $$Cor[X_i, X_j] = \frac{Cov[X_i, X_j]}{\sigma_{X_i}\sigma_{X_j}}$$ + Correlation coefficient r ($\in [-1, 1]$) tells us how well a regression line fits the data. A null correlation coefficient means there is no \textbf{linear} relationship between x and y. + \paragraph{Statistical Independence} Having all elements of x independent means that : + $$f_X(X) = \prod_{i=1}^M f_{X_i}(x_i)$$ + Which means the covariance matrix is diagonal ($\mathbb{E}_X[X_i, X_j] = \mathbb{E}_{X_i}[X_i]\mathbb{E}_{X_j}[X_j]$ for $i \neq j$. It is a sufficient condition to get a diagonal covariance but not necessary. + \paragraph{Decorrelated}: Data $(X,Y)$ are decorrelated iff $C_{X,Y}$ is in the form of $\begin{pmatrix} + \sigma_X^2 & 0 \\ + 0 & \sigma_Y^2 + \end{pmatrix}$ (diagonal). + The covariance of $N$ variables is computed using resp. $C_X = XX^\Tr / N$. + \paragraph{Whiteness}: Data $(X,Y)$ is white iff $C_{X,Y}$ is the identity matrix (and $\bar{X}=\bar{Y}=0$). + \paragraph{Sample Mean and Covariance} + Since Data are sampled from a Random Vector: $X = [x_k]_{1\leq k \leq N}$, we have: + $$\mathbb{E}_X[g(x)] \approx \sum_{k=1}^N g(x_k)f_X(x_k) \approx \frac{1}{N}\sum_{k=1}^N g(x_k)$$ + Which implies: + $$\hat{\mu}_{X_i} = \frac{1}{N}\sum_{k=1}^N x_{ik} \approx \mathbb{E}_X[X_i] = \mathbb{E}_{X_i}[X_i]$$ + And (with $1 = [1 ... 1]^\Tr$) $$\hat{\mu}_X = \left[\frac{1}{N}\sum_{k=1}^N x_{ik}\right]_{1 \leq i \leq N} = \frac{1}{N} X1$$\\ + In the same way: $$\hat{\Sigma}_X = \left[\frac{1}{N}\sum_{k=1}^N(x_{ik}- \hat{\mu}_{X_i})(x_{jk}- \hat{\mu}_{X_j})\right]_{1\leq i, j \leq N} = \frac{1}{N} (XC)(XC)^\Tr = \frac{1}{N} XCCX^\Tr $$\\ + Where $C = I - \frac{1}{N}1 1^\Tr$ is the centering matrix. + + \paragraph{Positive Semi Definiteness (PSD)} A square, symmetric matrix A is positive semi definite (PSD) iff $x^\Tr A x \geq 0$ for every non-zero column vector $x$. It is noted : $A \succeq 0$. A covariance matrix is always PSD as any matrix $A$ that can be written in the form $A = BB^\Tr$ is PSD. + + \paragraph{Orhtogonal matrices} U is orthogonal iff $U^\Tr U = I = UU^\Tr$. It implies $U^{-1} = U^\Tr$. The columns of $U$ are perpendicular to each other and have unit length. + + \paragraph{Eigenvalue Decomposition (EVD)} + If a matrix $A$ is square, symmetric and real-valued, it can be written as : $A = V \Lambda V^\Tr = \sum_{i=1}^M \lambda_i v_i v_i^\Tr$ with $V$ orhtogonal and $\Lambda$ diagonal. The columns of $V$ are eigenvectors and the elements of $\Lambda$ are eigenvalues. Since $C_X$ is PSD : $C_X = V \Lambda V^\Tr$ with $\lambda_i \geq 0 \;\;\forall i$\\ + + Since the relation is linear, eigenvalues and vectors can be switched in any order, convention is to sort them in descending order of eigenvalues (eigenvalues and vectors must be in the same order) : $\lambda_1 \geq \lambda_2 \geq … \geq \lambda_M$.\\ + + The EVD solves optimization problems : + \begin{align*} + \max_x x^\Tr Ax &\text{ subject to } x^\Tr x = 1\;\;\; \rightarrow\text{ solution: } x^* = v_1\\ + \min_x x^\Tr Ax &\text{ subject to } x^\Tr x = 1\;\;\; \rightarrow\text{ solution: } x^* = v_M\\ + \end{align*} + \subsection{Decorrelation} + For the following, we assume zero-mean: $\hat{\mu}_X = 0$, $XC = X$ and $\hat{\Sigma}_X = \frac{1}{N}XX^\Tr$. + We search for a linear transformation that $Y = WX$ has a diagonal covariance. Since Covariance is symmetric and PSD : $\hat{\Sigma}_X = V\Lambda V^\Tr$.\\ + + One possible solution is then $W = V^\Tr$ (up to axes permutations and sign flips). Indeed we have (because $V$ is orthogonal) : + $$\hat{\Sigma}_Y = \frac{1}{N} YY^\Tr = \frac{1}{N}(V^\Tr X) (V^\Tr X)^\Tr = V^\Tr V \Lambda V^\Tr V = \Lambda$$ + Meaning we have a diagonal covariance. + + \paragraph{Whitening} If we want the new variables to be white, we need $W$ such that $Y = WX$ has identity covariance. This works for $W = \Lambda^{-\frac{1}{2}} V^\Tr$. + + \subsection{Dimensionality Reduction} + We wish to create $P$ new features with $P \leq M$. Let us consider an orthogonal matrix $U$ and $U_P$ its restriction : $U_P = [u_i]_{1\leq i\leq P}$ and $U_M = U$. We see that $U^\Tr_P U_P = I$ but in most cases : $U_P U^\Tr_P \neq I$.\\ + + Defining $y = U^\Tr_P x$ projects $x$ from $\mathbb{R}^M$ to $\mathbb{R}^P$. If we take $U = V^\Tr$ and truncate it, we will have a projection that maximizes variance along $P$ orthogonal directions. The solution is then $W = V^\Tr_P$\\ + + We wish to quantify how much variance of the original distribution is preserved from the original space $X$ to the projected (reduced) space $Y$. This is computed as the ratio: + $$\rho = \frac{\sum_{i=1}^P \lambda_i}{\sum_{i=1}^M \lambda_i}$$ + \paragraph{PCA Backprojection} + If we wish to send projection $Y = V_P^\Tr X$ back to $\mathbb{R}^M$: + $$\tilde{X} = V_PY = V_P V_P^\Tr X$$. + There is information loss, so $V_P V_P^\Tr \neq I$ except if $P=M$ in which case $\tilde{X} = X$ + + \paragraph{Residues} + The Frobenius norm for matrices (extension of Euclidean norm for matrices) is defined as: + $$\|A\|_F^2 = \sum_{i,j} a^2_{ij} = Tr(A^\Tr A)$$ + PCA minimzes $\|\hat{\Sigma}_X - \hat{\Sigma}_{\tilde{X}}\|^2_F$ but also the reconstruction error (and therefore the residues) : $\|X-\tilde{X}\|_F^2$ + + \paragraph{Singular Value Decomposition (SVD)} + Instead of computing the Covariance matrix $\hat{\Sigma}_X$ and then apply the Eigenvalue Decomposition, we can use the Singular Value Decomposition for rectangular matrices on $X$: + $$X = USV^\Tr$$ where $U$ and $V$ are orthogonal matrices and $S$ is a diagonal matrix of the same shape as $X$.\\ + Indeed, + $$XX^\Tr = (USV^\Tr)(USV^\Tr)^\Tr = USV^\Tr VS^\Tr U^\Tr = U(SS^\Tr)U^\Tr$$ + Where $SS^\Tr$ is square and diagonal and $U$ is orthogonal. Hence $U$ contains eigenvectors of $XX^\Tr$ and $SS^\Tr$ contains its eigenvalues : $\lambda_i = s_i^2$. And we can do the PCA dimensionality reduction with $Y = \sqrt{N}S^{-1}_{PP}U_P^\Tr$. \\ + + \paragraph{Gram Matrix} + The Gram matrix is the dual of the covariance matrix. For centered data (must use $CX$ otherwise) it is equal to $\Gamma_X = X^\Tr X$ (opposed to $\hat{\Sigma}_X = \frac{1}{N}XX^\Tr$). We can once again have the same PCA transformation by applying EVD to $\Gamma_X = U\Lambda U^\Tr$ and having : $Y = \sqrt{N}\Lambda_{P,P}^{-\frac{1}{2}}U^T_P$. + + \paragraph{PCA Steps} + \begin{enumerate} + \item Sample centering. + \item Rotations/reflections around the origin with orthogonal projection matrix. + \item Dimensionality reduction by keeping in the projection matrix the eigenvectors associated with the larges eigenvalues (variance). + \item \textit{(Optional)} Whitening to obtain unit variance. + \end{enumerate} + + \paragraph{Conclusions} + \begin{itemize} + \item We wish to find axes that maximise the variance after projection. + \item Best choice at each step is the eigenvector with the maximum eigenvalue. + \item \textbf{Variance kept}: $\rho = \frac{\sum_{i=1}^P \lambda_i}{\sum_{i=1}^M \lambda_i}$ + \item \textbf{Error}: $1 - \rho$ + \item \textbf{Whiteness invariance}: rotations are white too (rotation matrix $T$ is orthogonal) + \item PCA is linear and thus suboptimal for data with strongly nonlinear dependencies between variables. + \item PCA can be made adaptive for streaming data with the use of a stochastic gradient. + \end{itemize} \section{Independent Component Analysis (ICA)} -\begin{itemize} -\item \textbf{Make features independent} -\item \textbf{Independance} between $X,Y$: $E[f(X)g(Y)]=E[f(X)]E[g(X)] \forall\ f,g$, $P(X|Y)=P(X)P(Y)$ -\item Try to find independent unknown signal $s(t)$ by applying a matrix $W$ to the measured signals $x(t)=As(t)$ ($A$ unknown). We must find $W\approx A^{-1}$. -\item \textbf{Hypothesis}: mixture linear and additive (matrix $A$), random variables are signals, no delays, $E[s_is_i^\Tr ]=1$, $A$ constant over time. -\item To find $W$, we try to make $y=Wx$ independent. -\item \textbf{Indeterminacies}: order of signals, scaling factor. -\item \textbf{Whitening}: whitening $x$ allows to reduces $W$ to an orthogonal matrix ($V^\Tr A$ is orthogonal), reducing the number of parameters from $n^2$ to $n(n-1)/2$. -\item \textbf{ICA} is thus the successive application of \textbf{PCA} and a \textbf{rotation}. -\end{itemize} -\subsection{Non-gaussian approach} -\begin{itemize} -\item PDF of a sum of $n$ independent random variables tends to be Gaussian. -\item Try to find $W$ such that output of PDF are as different as possible from Gaussian. -\item \textbf{Minimum differential entropy} Trying to minimise entropy of $y$. -\item \textbf{Negentropy}: Difference of entropy with a gaussian: $J(x)=\int p_x(u)\log\frac{p_x(u)}{p_{x_G}(u)} \dif u$. Maximise $J(y)$. -\item \textbf{Kurtosis}: $\kappa_4(X)=E[X^4]-3E[X^2]^2$. For Gaussian functions, $\kappa_4(X_G)=0$. For others, $\kappa_4(X_G)\neq 0$. Try to maximise $\sum_{i=1}^m|\kappa_4(Y_i)|$. -\item \textbf{Gram-Charlier Expansion} approximate PDF around a gaussian PDF. -\end{itemize} -\subsection{Minimum dependance approach} -\begin{itemize} -\item \textbf{Minimise mutual information} (need joint PDF) -\item \textbf{Minimise sum of entropy} -\end{itemize} -\subsection{In practice} -\begin{itemize} -\item We do not know PDF. -\item Density estimation to approximate PDF. -\item Direct estimation of the independance (cross-cumulants). -\end{itemize} - + \subsection{Background} + In PCA we want find a transformation that decorrelates features. In ICA we want to create a transformation that makes the variables as independent as possible. + \begin{itemize} + \item \textbf{Uncorellation} between $X$, $Y$ : $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ + \item \textbf{Independence} between $X$, $Y$ : $\mathbb{E}[f(X)g(Y)] = \mathbb{E}[f(X)]\mathbb{E}[g(Y)]$ for any non-linear functions $f$ and $g$. + \end{itemize} + Correlation measures the existence of a linear relation between variables, dependence measures the existence of any relation between variables.\\ + + \begin{wrapfigure}{r}{0.3\textwidth} + \begin{center} + \includegraphics[width=0.18\textwidth]{images/uncorr-vs-indep.png} + \end{center} + \caption{Uncorrelation vs Independence} + \label{fig:uncorr-vs-indep} + \end{wrapfigure} + + The goal of ICA is to measure independence between signals and to maximize independence (tune a transformation to get independent outputs). Independence can be written as: + \begin{align*} + P_X(X|Y) = & \;\;\frac{P_{X,Y}(X,Y)}{P_Y(Y)} \text{\,\,(general case , conditional distribution)}\\ + \text{(given independence)}\,\,\Rightarrow &\;\;P_{X,Y}(X,Y) = P_X(X)P_Y(Y) = P_X(X|Y)P_Y(Y) + \end{align*} + + Figure \ref{fig:uncorr-vs-indep} shows the difference between PCA that creates a maximum variance projection (top image) where there is no correlation but there is dependence and ICA (bottom image) which has minimum dependence directions (cannot say anything on Y knowing X).\\ + + Whitening is preserved for any rotation while independence is conserved only for rotations of $k\frac{\pi}{2}$ with $k$ integer, up to permutation and sign. + + \paragraph{Difference between ICA and PCA} + \begin{itemize} + \item Independence is stronger than uncorrelation. + \item If $V^\Tr$ is a whitening matrix, $UV^\Tr$ with $U$ orthogonal is a whitening matrix too. A whitening matrix is highly non-unique (up to any rotation matrix $U$) + \item $W^\Tr$ of ICA is unique, up to a few indeterminacies. + \end{itemize} + + \subsection{ICA Problem} + We have independent (unknown) source signals + $$s(t) = [s_1(t), s_2(t), …, s_n(t)]^\Tr$$ + Measured signals + $$x(t) = [x_1(t), x_2(t), …, x_n(t)]^\Tr$$ + And a linear mixing : $x(t) = As(t)$ with $A$ unknown. We wish to determine $W^\Tr \approx A^{-1}$ such that $y = W^\Tr x = W^\Tr A s \approx s$ and use the independence hypothesis to find $W^\Tr$. + + \paragraph{Identifiability Theorem} ICA solves this by measuring independance between the signals $y_i(t)$ and when it is reached, it is known that $y(t) \approx s(t)$. + + \paragraph{Indeterminacies} ICA yields the original signals up to these indeterminacies: the order of signals as independence is symmetric and a scaling factor multiplying each signal : + $$x_i = \sum_{j=1}^n a_{ij}s_j = \sum_{j=1}^n \alpha a_{ij} \frac{s_j}{\alpha}$$ + The true solution is $W^\Tr = P D A^{-1}$ where D is a diagonal matrix to re-scale signals, and P is a permutation matrix (not very important in real applications). + + \paragraph{Solution assumptions and problem hypotheses} + \begin{itemize} + \item Mixtures are linear and additive + \item Random variables are seen as signals in time + \item Without (phase) delay + \item Since the magnitude of $s_i$ cannot be known it is arbitrarily fixed such that $\mathbb{E}[s_i s_i^\Tr] =1$, meaning that $\mathbb{E}[s s^\Tr] = I$. + \end{itemize} + + \paragraph{PCA before ICA} PCA is used as a preprocessing step to ICA. Indeed we unmix $z$, the whitened signals of $x$ rather than directly $x$. Indeed, if $z$ is white, $V^T A$ is orthogonal : + $$I = \mathbb{E}[zz^\Tr] = V^\Tr A \mathbb{E}[s s^\Tr] A^\Tr V = (V^\Tr A) (V^\Tr A)^\Tr$$ + And if $V^\Tr A$ is orthogonal, $W$ is orthogonal as well since we want white outputs : $$I = \mathbb{E}[yy^\Tr] = W^\Tr \mathbb{E}[zz^\Tr]W $$ + And if $W$ is orthogonal, it is symmetric and there are only $\frac{n(n-1)}{2}$ parameters to find rather than $n^2$. + + \paragraph{Gaussian Case} When signals are Gaussians, it is impossible to find original signals, indeed, in the Gaussian case, since density is perfectly described by mean and variance, uncorrelation and independence are identical, therefore, rotation after whitening is useless. In that case other information is needed (temporal structure, etc.). An additional assumption to ICA is then that no source is Gaussian. + + \noindent According to the central limit theorem, the PDF of a sum of n independent random variables tends to be Gaussian. + + \subsection{Non-gaussian approach} + \paragraph{General Idea} if we consider an output of the ICA process: $y_i = c_i^\Tr s$ with $c_i = w_i^\Tr V^\Tr A$. If $c_i$ is not a solution to the ICA problem then $y_i$ is still a mixture and $c_i \approx \alpha [\pm 1/n,\pm 1/n, …, \pm 1/n]^\Tr$. This is close to the mean of several RV and by the CLT $y_i$ will be closer to a Gaussian than to a source. However, if $c_i$ is a solution to the ICA problem, $c_i = \alpha [0,…, \pm 1, …, 0]^\Tr $, which is far from the assumptions of the CLT (no sum of RV) and we will be far from Gaussianity since no source is Gaussian. The following methods measure closeness to a gaussian. + + \paragraph{Method 1 : Minimum differential entropy} For this method we will use entropy: + \begin{itemize} + \item Discrete case: $H(x) = -\sum_{i=1}^K p_i log(p_i)$, which is minimum when $\exists i | p_i = 1, p_j = 0 \text{ for } j \neq i$ : $H(x) = 0$ and maximum when $p_i = 1/K$ : $H(x) = log(K)$. + \item Continuous case (differential entropy) : $h(U) = -\int_{-\infty}^\infty p_U(u)log(p_U(u))du$ which is maximum when $U$ follows a Gaussian : $h(u) \leq h(u_G) = \frac{1}{2}log(2\pi e \sigma^2)$ where $\sigma^2$ is the variance of $u$. And in $n$ dimensions: + $$h(X) \leq \frac{1}{2}log\left((2\pi e)^n det \Sigma\right)$$ + Where $\Sigma$ is the covariance matrix, $\Sigma = I$ for $h(Z)$. + \end{itemize} + The goal is to find $W$ such that output entropies are low, i.e. finding $w_i$ such that the pdfs of $y_i$ are far from Gaussians (with condition that $y_i$ have unit variance). + + \paragraph{Method 2 : Negentropy} Negentropy is defined as the difference with respect to the cross-entropy of a Gaussian with same variance : + $$J(X) = h(X, X_G) - h(X) \geq 0$$ + $$J(X) = \int p_X(u) log\left(\frac{p_X(u)}{p_{X_G}(u)}\right)du$$ + The goal is to find $W$ such that $J(y)$ is maximum (with condition that $y_i$ have unit variance). + + \noindent The difficulty of this method is that we don't know the samples's pdf's. + + \paragraph{Method 3 : Moments and cumulants} Rather than using PDFs for $y_i$ we use their distribution properties: moments and cumulants. And in particular the curtosis (4-th order centered moment): + $$\kappa_4(X) = \mathbb{E}[X^4] - 3(\mathbb{E}[X^2])^2$$ + For a Gaussian PDF, we have $\kappa_4(X_G) = 0$ and for (almost) any non-Gaussian PDF: $|\kappa_4(X)|>0$.\\ + The goal is then to find $W$ such that $\sum_{i=1}^m|\kappa_4(Y_i)|$ is maximum (with condition that $y_i$ have unit variance).\\ + + This method is based on the Gram-Charlier Expansion (analog to Taylor expansion) which approximates a PDF around the Gaussian PDF $\varphi$ (4th order approximate) : + $$p_X(\xi) \approx \varphi(\xi)\left(1 + \kappa_3(X)\frac{H_3(\xi)}{3!} + \kappa_4(X)\frac{H_5(\xi)}{4!}\right)$$ + Where $H(\cdot)$ are the Chebyshev-Hermite polynomials. This can be done because Kurtosis is linear if $s_i$ are independent. + + \subsection{Minimum dependence approach} + \paragraph{Method 1 : Minimum Mutual Information} Mutual information is defined as: + $$I(X) = \int P_X(u) log\left(\frac{P_X(u)}{\prod_{i=1}^n P_{X_i}(u_i)}\right)du \geq 0$$ + Which is exactly null iff each $x_i$ is independent. It can be expressed as: + \begin{align*} + I(y) &= \sum_{i=1}^m h(y_i) - h(y)\\ + &= \sum_{i=1}^m h(y_i) - h(Wz)\\ + &= \sum_{i=1}^m h(y_i) - h(z) - log(|det(W)|) + \end{align*} + And $|det(W)| = 1$ as $W$ is orthogonal. + The goal is to find $W$ such that $I(y)$ is minimum. But it is difficult to estimate the joint PDF of y and the computational cost is high. An alternative is to minimize the sum of output marginal entropies : $\sum_{i=1}^n h(y_i)$ where there is no need to estimate a joint PDF. + + \paragraph{Method 2 : Cancelling cross-cumulants} We extend the idea of PCA (diagonalization of the covariance matrix and scaling) to take into account moments of order higher than 2. It is then a diagonalization of the 4-th order tensor (4 dimensions) of cross-cumulants in order to go further than (linear) decorrelation. + $$cum(X_i, X_j, X_k, X_l) = \mathbb{E}[X_i X_j X_k X_l] - \mathbb{E}[X_i X_j]\mathbb{E}[X_k X_l] - \mathbb{E}[X_i X_k]\mathbb{E}[X_j X_l] - \mathbb{E}[X_i X_l]\mathbb{E}[X_j X_k]$$ + To reach independence, all cross-cumulants must be equal to 0, i.e. find a rotation matrix via the JADE algorithm (Joint Approximate Diagonalization of Eigenmatrices) + + \subsection{In practice} + Independence metrics require the knowledge of the PDF which is unknown. In practice we use density estimation (difficult) or measure independence indirectly (cross-cumulants). + + \paragraph{PCA in ICA} pre-whitening reduces the number of elements to estimate but can lead to computational problems and errors if PCA failed. + + \paragraph{Objective functions} It is also possible to minimze objective functions (non-Gaussianity and independence measures) at the risk of getting a local minima. Minimization algorithms include (natural) gradient descend, diagonalization of cross-cumulants, fixed-point iteration, etc. + + \paragraph{Criterions} The criterions evaluated are local criterions that can lead to local minima. The true measure of independence only leads to a global minima. + + \subsection{Extensions of ICA} + The basic model is : $x_i(t) = \sum_{j=1}^N a_{ij} s_j(t)$ for $i = 1, …, N$ + \paragraph{N observed mixtures, M sources} + \begin{itemize} + \item $N > M$, $M$ must be estimated and during the PCA stage, there is a reduction from $N$ to $M$ signals and then source separation is applied (ICA). + \item $M > N$, the $N$ ``most powerful'' sources will be estimated by ICA but will be corrupted by the $M - N$ other sources. + \item $N >> M$, projection in signal subspace must be used. + \end{itemize} + + \paragraph{Noisy observations} if the observations are noisy: + $$x_i(t) = \sum_{j=1}^N a_{ij}s_j(t) + n_i(t) \text{ for } i = 1, …, N$$ + $$x(t) = As(t) + n(t)$$ + Even if $W$ is estimated perfectly: + $$y(t) = W^\Tr x(t) = W^\Tr A s(t) + W^\Tr n(t)$$ + + \paragraph{Other extensions} The following problems require \emph{specific algorithms}: + \begin{itemize} + \item \textbf{Convolutive mixtures} + $$x(t) = A(t) * s(t)$$ + $$x_i(t) = \sum_{j=1}^N \sum_{k=0}^{p-1} a_{ij}(k) s_j(t-k) \text{ for } i = 1, …, N$$ + \item \textbf{Post non-linear mixtures} + $$x_i(t) =f_i\left(\sum_{j=1}^N a_{ij} s_j(t)\right) \text{ for } i = 1, …, N$$ + \item \textbf{Non-linear mixtures} + $$x_i(t) =f_i(s_1(t), …, s_N(t)) \text{ for } i = 1, …, N$$ + \item \textbf{Ill-conditioned mixtures} the rows of $A$ are identical ($A$ not invertible) or very similar . + \end{itemize} \section{Feature selection} -\label{sec:featuresel} -\begin{itemize} -\item Most of the time, we have lots of dimensions, and not enough data -\item Distance means ``nothing'' in high-dimensional data -\item \textbf{Reducing the dimensionality} -\end{itemize} - -\subsection{Filters} -\begin{itemize} -\item Does not use models -\item \textbf{Correlation}: $r=\frac{\sum_{j=1}^N((x_j-\bar{x})(y_j-\bar{y}))}{\sqrt{\sum_{j=1}^N((x_j-\bar{x})^2)\sum_{j=1}^N((y_j-\bar{y})^2)}}$ - -\item \textbf{Entropy}: measure of uncertainty of a random variable: $H(x) = -E[\log(P[x])]$. -\item \textbf{Mutual information}: $I(y;x) = H(y)-H(y|x)=H(x)-H(x|y)$ -\item \textbf{Relevance}: $I(x_i,t)$ (trying to maximise) -\item \textbf{Redundancy}: $I(x_i,x_j)$ (trying to minimise) -\item \textbf{Multi-objective optimisation} -\end{itemize} -\subsection{Wrappers} -\begin{itemize} -\item Uses models (select features that makes the model behave better) -\item Slower than filters -\item Relevance criterion easy to estimate -\item Less intuitive than filters -\end{itemize} -\subsection{Greedy search method} -\begin{itemize} -\item Testing $2^d$ subset is too much -\item Be greedy -\item Initial subset, simple update strategy + stop strategy -\item \textbf{Forward search}: begin with the empty subset, add the best feature, then stop when it decreases the performance -\item \textbf{Backward search}: idem, but with all the features as initial subset, then remove features -\item \textbf{Forward-backward}: allow to add and remove variable -\item \textbf{Genetic algorithms} -\end{itemize} -\subsection{Embedded methods} -\begin{itemize} -\item Feature selection + Model selection together -\end{itemize} - + \label{sec:featuresel} + + \subsection{Background} + \paragraph{Motivation} Many data analysis tools suffer from high dimensionality: + \begin{itemize} + \item \textbf{Linear tools} PCA is based on the covariance matrix which grows in the square of the dimension and is poorly estimated on a finite number of data. Similar problems arise for Linear Discriminant Analysis (LDA), Partial Least Squares (PLS), etc. + \item \textbf{Non-Linear tools} Tools of the form $y = f(x_1, x_2, …, x_d, \theta)$ suffer as well. As $d$ increases, so does the size of $\theta$. And as $\theta$ results from the minimzation of a non-convex functions there is a higher risk of local minima, numerical problems (flats and high slopes), convergence issues, … This affects Multilayer Perceptrons (MLP), Gaussian Mixtures (RBF), Self-Organizing maps (SOM), etc. + \end{itemize} + + \paragraph{Curse of dimensionality examples} + \begin{itemize} + \item The number of points on a grid increases exponentially with the dimension. + \item \textit{Silverman, 1996} The number of Gaussian kernels necessary to approximate a Gaussian distribution grows exponentially with the dimension of the data space. To estimate parameters there is need of too much data. Histograms are used in higher dimension and are a compromise between accuracy and number of data in each bin. + \item The volume of a sphere in high dimensions rapidly tends to 0 with the dimension of the space. + $$V(d) = \frac{\pi^{d/2}}{\Gamma(d/2+1)} r^d$$ + \item The ratio of a sphere to a cube also rapidly tends to $0$ in high dimensions (all points are in the cube, but in the exterior of the sphere). + \item Volume ratio of embedded spheres. In high dimensions, volume is ``concentrated'' in the outer parts of the sphere. + \item The percentage of points following a multi-dimensional Gaussian inside a sphere of a fixed radius tends to zero. + \item The probability to find a point at distance $r$ of the center for a multi-dimensional multi-normal distribution is higher for higher dimensions. + \item Euclidean norms concentrate for random vectors with i.i.d. components in $[0, 1]$ are in $[0, \sqrt{d}]$ and concentrate around their expectations in higher dimensions (they don't disciminate anymore). + \item In the same way distances concentrate. Indeed pairwise distances seem nearly equal for all points. Relative contrast vanishes as dimension increases. If $\lim_{d\rightarrow\infty} \frac{\sqrt{Var[\|X\|_2}]}{\mathbb{E}[\|X\|_2]} = 0$ then $$ \plim_{d\rightarrow \infty}\frac{d_{max} - d_{min}}{d_{min}}=0$$ + \end{itemize} + \paragraph{Uses} In theory it is not useful, more information means easier task and models can ignore irrelevant features (weights set to 0). However, lots of inputs means lots of parameters and large input space. This implies the curse of dimensionality and the risk of overfitting. Feature selection can be used for many reasons. + \begin{itemize} + \item To visualize data in lower dimensions + \item To get insight about the causes of a problem + \item To reduce data collection time/cost, computation time, etc. + \end{itemize} + + \paragraph{Key elements} Subset relevance assessment (among $2^d - 1$ possible subsets of feature, which is the best) and Optimal subset search (how not to consider $2^d - 1$ subsets). + + \paragraph{Filter Approach} Filters are a model-free technique that use a relevance criterion that is not the final criterion to decide if a feature is useful or not. A variable (or set of variables) is relevant if it is statistically dependent on $y$: $P(y | x_i) \neq P(y)$ + + \paragraph{Wrapper approach} Wrappers use the (non)linear model to evaluate performance of feature selection (many models to design). A variable (or set of) is relevant if the model built on it shows good performances : $\min_f (y-f(x_i))^2 \approx 0$ + + + \subsection{Filters} + \subsubsection{Correlation} + \paragraph{Definition} Correlation between a random variable $X$ and a random variable $Y$ is + $$\rho_{XY} = \frac{\mathbb{E}[(X - \mathbb{E}[X]) (Y - \mathbb{E}[Y])]}{\sqrt{\mathbb{E}[(X - \mathbb{E}[X])^2] \cdot \mathbb{E}[(Y- \mathbb{E}[Y])]^2} }$$ + And is estimated as + $$r=\frac{\sum_{j=1}^N((x_j-\bar{x})(y_j-\bar{y}))}{\sqrt{\sum_{j=1}^N((x_j-\bar{x})^2)\sum_{j=1}^N((y_j-\bar{y})^2)}}$$ + + It measures linear dependencies, $r \in [-1, 1]$ and is null if there is no \textbf{linear} relation but low correlation does not mean the absence of relationship. High correlation does not mean causality either. + + \paragraph{Properties} It is linear and parametric (makes the hypothesis of a linear model), does not explain (correlation does not imply causation), is hard to define for more than two variables, and is very sensitive to outliers. + + \subsubsection{Mutual Information} + \paragraph{Definition} Mutual information between a random variable $X$ and a random variable $Y$ measures how the uncertainty of $Y$ is reduced when $X$ is known (and vice versa). + + \paragraph{Entropy} The entropy of a random variable is a measure on its uncertainty + \begin{align*} + H(Y) &= -\mathbb{E}[log(P[Y])]\\ + &= - \sum_{y \in \Omega} log(P[y])P[y] \text{ for $Y$ discrete}\\ + &= - \int log(P[y])dy \text{ for $Y$ continuous} + \end{align*} + It can be interpreted as the average number of bits needed to describe $y$. + + \paragraph{Conditional Entropy} $H(Y | X)$ measures the uncertainty on $Y$ when $X$ is known + $$H(Y | X) = H(Y, X) - H(X)$$ + If $Y$ and $X$ are independent, $H(Y | X) = H(Y)$. The uncertainty on $Y$ is the same if we know $X$ as if we don't. + + \paragraph{Mutual Information} The mutual information between a random variable $X$ and a random variable $Y$ is the difference between the entropy of $Y$ and the entropy of $Y$ when $X$ is known. + $$I(Y, X) = H(Y) - H(Y|X) = H(X) - H(X |Y)$$ + + \paragraph{Properties} $I(Y, Y) = H(Y)$ and $I(Y, X)$ is always non-negative and less than $min(H(Y), H(X))$. If $X$ and $Y$ are independent, then $I(Y, X) = 0$. It identifies non-linear relationships between variables. + + \paragraph{Relevance of a set of features} $X$ and $Y$ can be vectors. If $X$ is a subset of features, its relevance may still be evaluated. + + \paragraph{Issues} It is very difficult to estimate $I(Y, X)$; histograms, kernels and splines suffer from the curse of dimensionality as well. k-NN estimators are more robust. + + \paragraph{In practice} When dealing with bivariate mutual information only, the goal is to maximize relevance $I(X_i, Y)$ while reducing redundancy $I(X_i, X_j)$. This is a multi-objective criterion that depends on the weighing between the two criteria. + + \paragraph{Extensions} Some extensions to mutual information exist for tri-variate mutual information (interaction measure, etc.) + + \paragraph{Stopping criterion} In theory, mutual information of a set of variables can only increase when adding variables. It is a bad idea to stop when MI decreases (the estimation is not valid anymore). It is better to evaluate the statistical relevance (hypothesis test) of adding a new variable. Usually a fixed maximum number of variables is set. + \subsection{Wrappers} + \paragraph{General idea} Building models and evaluating them on the final criterion. However, each model needs to be trained and fitted for hyper-parameters (cross-validation) and its performances must be evaluated (double cross-validation) which is computationally expensive for some models. + + \subsubsection{Filter-Wrapper Comparison} + Table below shows a comparison between the filter and wrapper approach + \begin{center} + \begin{tabular}{|c|p{0.3\textwidth}|p{0.3\textwidth}|} + \hline + & \textbf{Filters} & \textbf{Wrappers} \\\hline + Advantages & \textbf{Fast} (only one model is built) and \textbf{intuitive} (identify statistical dependency) & Relevance criterion is \textbf{easy to estimate} and they are \textbf{model-aware} (identify optimal subset for optimal model)\\\hline + Disadvantages & Relevance criterion is \textbf{hard to estimate} and they are \textbf{model-ignorant} (most relevant subset might not be optimal for subsequent model) & \textbf{Slow} (must build lots of models) and \textbf{not intuitive} (features for best model might not actually be most explanatory variables)\\\hline + \end{tabular} + \end{center} + + \subsection{Greedy search method} + In theory all $2^d - 1$ subsets should be tried and evaluated but in practice, a greedy procedure is used to + \begin{enumerate} + \item Define an initial subset + \item Choose a strategy to update subset + \item Decide when to stop + \end{enumerate} + + \noindent It makes the hypothesis that the best subset can be constructed iteratively. + + \subsubsection{Forward Search} + \paragraph{Initial subset} An empty set is chosen. + \paragraph{Update strategy} Either filter approach (add feature that increases the most the relevance/redundancy compromise) or wrapper approach (add features that increases the most the performances of the model) + \paragraph{Stopping} If using a filter, a supplementary criterion is needed, if using a wrapper, stop when adding features increases the generalization error. + + \subsubsection{Backward search} + \paragraph{Initial subset} An full set is chosen. + \paragraph{Update strategy} Either filter approach (remove feature that decreases the most the relevance/redundancy compromise) or wrapper approach (remove features that decreases the most the performances of the model) + \paragraph{Stopping} If using a filter, a supplementary criterion is needed, if using a wrapper, stop when removing features increases the generalization error. + + \subsubsection{Forward-Backward Search} + At each step, consider all additions and removals of one variable and select the best result. This makes sense with a wrapper approach but with the filter approach the mutual information cannot increase with less variables. + + \subsection{Genetic Algorithms} + This method is a random exploration of the space of subsets. It is based on the following steps: + \begin{itemize} + \item Draw initial population (candidate subsets) + \item Select best individuals + \item Apply cross-overs and mutation on individuals + \item Repeat from 2 until new population is generated + \item Select best individuals and repeat from 1 + \end{itemize} + + \subsection{Embedded Methods} + \paragraph{General Idea} Creating the model and restrict the number of features used by the model together. Optimizing the whole can only be better than separating into 2 parts. + + \paragraph{Example} Lasso is a linear model that does this by regularizing its weights while optimizing the model: $y = WX$ and + $$\min_W \frac{1}{N}\sum_{j=1}^N (y^j - Wx^j)^2 + \mu \|W\|_1$$ + Usually (Ridge Regression) the regularization term is a 2-norm. Here, the 1-norm reduces the effective number of features used by the model, both in theory and in practice. + + \subsection{Applications and examples} + \paragraph{Very high dimensional data} For the infrared spectra (Tecator) the data is very high dimensional (sometimes 1000, 2000 wavelengths). Estimation of Mutual Information is very hard. Two solutions are possible : + \begin{itemize} + \item Ranking the $X_i$ according to their mutual information with the output (does not take into account the relations between the $X_i$ + \item Forward selection on $I((X_i, X_j, X_k, …), Y)$ but the curse of dimensionality prevents a good estimation of the mutual information + \end{itemize} + A solution is to take a set of features selected by forward search (until decrease) and a set of features selected by ranking (until the union of the two reaches $N$ elements) and try the $2^N$ evaluation (either by filter or wrapper approach) + + \paragraph{Time series} Time series can be modeled using different sets of data: + \begin{itemize} + \item Sampling frequency + \begin{align*} + x(t+1) &= f(x(t), x(t-1), x(t-2), x(t-3), …)\\ + x(t+1) &= f(x(t), x(t-1), x(t-4), x(t-7), …) + \end{align*} + \item Size of the regressor + \begin{align*} + x(t+1) &= f(x(t), x(t-1), x(t-2), x(t-3), …)\\ + x(t+1) &= f(x(t), x(t-1), x(t-2)) + \end{align*} + \item Non-contiguous values + \begin{align*} + x(t+1) &= f(x(t), x(t-1), x(t-2), x(t-3),x(t-4), …)\\ + x(t+1) &= f(x(t), x(t-1), x(t-7), x(t-8), x(t-14), …) + \end{align*} + \end{itemize} + If a time series has an intrinsic dimension of $q$ then the size of the regressor must be in $\{q, …, 2q+1\}$. Meaning that in the $2q+1$ space there exists a $q$ surface without intersecting points and it is possible to project from $2q+1$ to $q$ dimensions. \\ + + In practice, many inputs variables are taken, $2q+1$ are selected through feature selection and if needed they are projected on $q$ new variables. \section{Model generation} -\label{sec:modelgen} -\subsection{Linear regression} -\begin{itemize} -\item \textbf{Main notation}: $y = w^\Tr x$, with $x=(1,x_1,x_2,...,x_N)^\Tr $ and $w=(w_0,w_1,w_2,...,w_N)^\Tr $. -\item \textbf{Error criterion (most common)}: $\frac{1}{P}\sum_{p=1}^P(t_p - w^\Tr x_p)^2=\frac{1}{P}\|t-w^\Tr X\|^2$ -\end{itemize} -\subsubsection{Pseudo-inverse} -\begin{itemize} -\item \textbf{Derivative of error criterion}: $\left(\diffp{E}{w}\right)^{\!\Tr} = \frac{2}{P}(w^\Tr X-t^\Tr )X^\Tr $ -\item \textbf{Minimum error at}: $w=(XX^\Tr )^{-1}Xt$ -\end{itemize} -\subsubsection{Gradient descent} -\begin{itemize} -\item \textbf{Gradient descent}: $x(t+1) = x(t)-\alpha \diffp*{f}{x}{x(t)}$ -\item \textbf{Gradient descent on error criterion}: $w(t+1) = w(t) + \frac{2\alpha}{P}X(t-X^\Tr w(t))$ -\item \textbf{Stochastic gradient descent}: As we can write that $E=\frac{1}{P}\sum_{p=1}^PE_p$, we can optimize the $E_k$ one at a time. $w(t+1) = w(t) + \frac{2\alpha}{P}(t-x_k^\Tr w(t))x_k$ -\end{itemize} -\subsubsection{Linear associative memory} -\begin{itemize} -\item Learning: $w=\sum_{p=1}^P t_px_p$. -\item Magic: iff $x$'s are orthogonal and normalized. -\end{itemize} -\subsubsection{Perceptron} -\begin{itemize} -\item With $sgn$ as activation function, classification model ($-1$ or $1$) -\item Object is correctly classified if $w^\Tr x_kt_k > 0$ -\item \textbf{Ideal error criterion}: $E = \sum_{w^\Tr x_kt_k < 0} 1$ (not continuous) -\item \textbf{Perceptron criterion}: $E = -\sum_{w^\Tr x_kt_k < 0} w^\Tr x_kt_k$ (continuous, but gradient is only piece-wise linear) -\item \textbf{Perceptron convergence theorem}: If a set is linearly separable, then gradient descent on the perceptron criterion will converge in a finite number of steps. Proof: verify that $\|w\|$ grows slower than $\sqrt{n}$ and that $w^\Tr_{sol}w$ is bounded below by a linear function of $n$. -\end{itemize} -\subsection{Multi-layer perceptron} -\subsubsection{Single layer} -\begin{itemize} -\item \textbf{Activation function}: $\sigma(w^\Tr x)$ -\item \textbf{Error function}: $E = \frac{1}{P}\sum_{p=1}^P(t_p - \sigma(w^\Tr x_p))^2$ -\item \textbf{Stochastic gradient descent rule}: $w(t+1) = w(t) + \frac{2\alpha}{P}(t-x_k^\Tr w(t))x_k \diffp*{\sigma}{p}{p=w(t)^\Tr x_k}$ -\end{itemize} -\subsubsection{Multi-layer} -\begin{itemize} -\item \textbf{Model}: $y(x) = h(w^{(2)}g(w^{(1)}x))$, $h$ can be linear, but not $g$ (otherwise, this 2--layer network can be represented by a single--layer network). -\item \textbf{Universal approximation property}: 2--layer MLP can approximate everything to an arbitrary precision if there are enough hidden units. -\item \textbf{Learning}: Algorithm - \begin{enumerate} - \item Apply an input vector $x_k$ through the network, computing all values - \item Evaluate error on the last layer (difference with $t_k$) - \item Back-propagate the error on each layer - \item Compute all the derivatives of $E$ - \item Gradient descent step - \end{enumerate} -\end{itemize} - -\subsubsection{Weight adjustment} - -Improvement to gradient descent: -\begin{itemize} -\item \textbf{Momentum}: add $\beta(w(t)-w(t-1))$ after the gradient descent step ($\beta\approx 0.9$) -\item \textbf{Adaptive learning rate}: making $\alpha$ a function, increasing (resp. decreasing) by a constant when the new direction is nearly the same as previous one (resp. not the same) (scalar product is $> 0$) -\end{itemize} - -Second-order methods + adaptive step size. - -\subsection{Radial-basis function networks} -\begin{itemize} -\item \textbf{Cover's theorem}: The probability of $\phi$--separability ($\exists \textrm{ vector } w$ that linearly separates $\phi(X)$) tends to one if $\phi_i$ are not linear and more than the number of dimensions. -\item \textbf{RBFN}: We search for $F:R_D\rightarrow R$ such that $F(x_p)=t_p, \ \forall p \in P$. \\ -Solution (with regularization $\lambda$): $$F(x) = \sum_{p=1}^P w_pG(x,x_p)$$ $$w=(G+\lambda I)^{-1}t$$ ($G+\lambda I$ non-singular if $x_k$ distincts (Michelli)). -\item \textbf{Generalized RBFN}: $$F(x) = \sum_{i=1}^K w_p\phi(\|x-c_i\|)$$ $$\phi(\|x-c_i\|)=\exp\left(-\frac{\|x-c_i\|^2}{2\sigma_i^2}\right)$$ -\item \textbf{Park \& Sandberg}: For any function $f$, it exists a function $F$ with an unspecified $K$ that approximates $f$ as good as willed. -\item \textbf{Learning}: - \begin{enumerate} - \item \textbf{Centers} $c_i$: use vector quantization to find the centers ($c_i$ are centroids). - \item \textbf{Widths} $\sigma_i$: choose them according to the standard deviation around the centroids. - \item \textbf{Weigths} $w_i$: with $c_i$ and $\sigma_i$, finding $w_i$ becomes linear. Simply use pseudo-inverse or SVD. - \end{enumerate} -\item \textbf{Standard deviation of the widths}: - \begin{itemize} - \item $\sigma = \frac{d_{max}}{\sqrt{2K}}$ where $d_{max}$ is the max distance between centroids - \item $\sigma_i = \frac{1}{q}\sqrt{\sum_{j=1}{q}\|c_i-c_j\|^2}$ scans the $q$ nearest centroids - \item $\sigma_i = r \min_j(\|c_i-c_j\|)$ where $r$ is an overlap constant - \end{itemize} -\item \textbf{Improvements}: - \begin{itemize} - \item Optionnal computation step: gradient descent on all parameters - \item Add constant and linear terms to $F(x)$ - \item Normalize the $\phi_i$ to make them bounded to $[0,1]$ - \end{itemize} -\end{itemize} - -\subsection{Vector Quantization} -\subsubsection{Background} -\begin{itemize} -\item \textbf{Goal}: find $Q$ vectors(\textbf{centroids}) ($Q
least square error.
-\item \textbf{Weighted distances} $d_W(x_i,y_i) = (x_i-y_i)^\Tr W(x_i-y_i)$.
- \begin{itemize}
- \item \textbf{Mahalanobis distance} $W=C_{XY}$
- \item \textbf{$W$ symmetrical} $d_W(x_i,y_j) = d_2(Px_i,Py_j)$ with $W=P^\Tr P$
- \end{itemize}
-\end{itemize}
-\subsubsection{Lloyd's principle}
-\begin{itemize}
-\item \textbf{Lloyd's Principle} VQ is a two stage process: $x_i \rightarrow \text{encoder } \alpha \rightarrow j \rightarrow \text{decoder } \beta \rightarrow y_j$.
-\item \textbf{$\alpha$ known}: $\beta(j) = \argmin_{y_j}(E(d(x_i,y_j)|\alpha(x^i)=j))$ (center of mass)
-\item \textbf{$\beta$ known}: $\alpha(x_i) = \argmin_j(d(x_i,y_j)) \text{ where } y_j =\beta(j)$ (nearest neighbour)
-\item \textbf{Lloyd's algorithm}
- \begin{enumerate}
- \item Choose initial centroids
- \item Evaluate current error
- \item If error small enough, stop
- \item Replace all centroids $y_j$ with the center-of-mass of the values associated to it
- \item Go back to step 2
- \end{enumerate}
- \begin{itemize}
- \item \textbf{Hypothesis}: Probability of finding a point on the border of a voronoi zone is $0$
- \item \textbf{Limitation}: Local minimum of the error
- \end{itemize}
-\item \textbf{Initialisation of Lloyd's algo}:
- \begin{itemize}
- \item At random in the input space (no data taken into account)
- \item First $Q$ points of $X$
- \item Randomly chosen points in $X$
- \item Product codes: product of scalar quantizers
- \item Growing initial set (choose randomly in $X$, but keep only if distance is sufficient)
- \item Pairwise nearest neighbor (begin with all $X$, and at each step merge the two closest centroids). Variant: merge the two centroids giving the lowest increase of error.
- \item Splitting: begin with two centroids, apply Lloyd's, then disturb them with two new, etc.
- \end{itemize}
-\end{itemize}
-\subsubsection{Competitive learning}
-\begin{itemize}
- \item For each $x_i$, change the position of the nearest centroid: $y_k(t+1)=y_k(t)+\alpha(x_i-y_k)$
- \item $E = \int(x-y_{j(x)})^2 \, p(x) \dif x$
- \item Robbins-Monro conditions on $\alpha$: $\sum_{t=0}^\infty \alpha(t) = \infty$ and $\sum_{t=0}^\infty \alpha(t)^2 < \infty$
- \item Local minima, lose centroids
- \item \textbf{Frequency-sensitive learning}: multiply the distances by a function $u$ that is incremented each time this centroid is selected
-\end{itemize}
-\subsubsection{Learning Vector Quantization}
-\paragraph{LVQ1}
-\begin{itemize}
-\item With classes
-\item Same algo as competitive learning if centroid and $x_i$ are in the same class: $y_k(t+1)=y_k(t)+\alpha(x_i-y_k)$
-\item But increase distance of centroids of different classes than $x_i$: $y_k(t+1)=y_k(t)-\alpha(x_i-y_k)$
-\end{itemize}
-\paragraph{LVQ2}
-\begin{itemize}
-\item Try to reaches Bayes (optimal) boundary
-\item Selection of two winners
-\item Move the two winners if
- \begin{enumerate}
- \item One winner ($y_{a}$) is in the same class as $x_i$ AND
- \item The other one($y_{b}$) is not AND
- \item $x_i$ is in between the two centroids, in a window of fixed size
- \end{enumerate}
-\item Move: $y_a(t+1)=y_a(t)+\alpha(x_i-y_a(t))$ and $y_b(t+1)=y_b(t)+\alpha(x_i-y_b(t))$
-\end{itemize}
-\paragraph{Improvements}
-\begin{itemize}
-\item \textbf{Forgetting factor}: progressively ignore data from boundary
-\item \textbf{Dynamic Vector Quantization}: Add centroids if needed ($x_i$ of class 1 is the nearest point to centroid of class 2 or distance between point and centroid too wide)
-\end{itemize}
-\subsubsection{Winner-take-all/Winner-take-most}
-\begin{itemize}
-\item adapt the winner and other centroids as well
-\item Example: \textbf{Neural gas}: add a factor $e^{-h(j,x_i)/\lambda}$ on the updates, where $h$ is the rank of the centroid when ordered by distance.
-\end{itemize}
-
-\subsection{Self-Organizing Maps}
-\begin{itemize}
-\item VQ with a supplementary grid space
-\item \textbf{Topology conservation}: if centroids lie closely in the grid space, they are more likely to lie closely in data space
-\item \textbf{Adaptation of weights}: Take the winner $k=\argmax_i(w_i^\Tr x)$ and update the weights: $$w_j(t+1)= w_j(t) + \alpha(t)(x-w_j(t))\,V(k,j)$$
-\item $\alpha(t)$ and $r(t)$ decrease over time.
-\item $\alpha$ must follow Robbins-Monro conditions.
-\item \textbf{Hard neighbourhood}: $$V(k,j)=\left\{ \begin{array}{l l}
-1 & \text{if } d(k,j)