# Short Introduction to PCA

In Principal Component Analysis (PCA), we would like to convert our high-dimensional dataset onto a lower-dimensional space while keeping as much information as possible. Typically, this is done to avoid curse of dimensionality effects or for the purposes of data visualization.

In broad strokes, PCA reduces the dimensionality of our dataset in a way that minimizes (certain aspects of) the amount of information we throw away by projecting our $p$-dimensional feature set onto a lower-dimensional subspace.

We take the mean-centered, normalized $p$ by $n$ dataset $X$ with $p$ features and $n$ training examples. We apply principal component analysis (PCA) to project $p$-dimensional data into a q-dimensional sub-space where $q\leq p$.

PCA works by first computing the covariance matrix of our features (alternatively, computing the scatter matrix and scaling the eigenvalues later), the covariance matrix somewhat confusingly being denoted with sigma, not to be confused with summation notation:

$\Sigma = \sum\limits_{k=1}^{n} (x_k-m)(x_k-m)^T$

where $x_k$ is a datapoint and $m$ is the $p$-dimensional vector of mean values across the $n$ examples.

The diagonal values of the covariance matrix $\Sigma$ correspond to the variance of each feature, while the (symmetrical) off-diagonal values correspond to covariance between any two features. As a square, symmetric matrix, the covariance matrix can be readily used to find the set of $p$ scalar eigenvalues (corresponding to the new diagonal) and their corresponding eigenvectors, such that for all $v$ and $\lambda$:

$\Sigma v = \lambda v$

where $v$ is one of our eigenvectors and $\lambda$ its corresponding eigenvalue

Now we have a set of eigenvectors in our $p$-dimensional space that will serve as the new orthogonal “axes” of of the $q$-dimensional features subspace. The computed eigenvalues have a magnitude that corresponds to the variance along each new “axis” of its associated eigenvector, so we can see that each axis “captures” a different amount of the information and diversity in our dataset. In the graph below, note the difference in how much data the 1st and 2nd dimension components capture.

What we would like to do is reduce the dimensionality of our dataset while controlling and minimizing the amount of information lost in doing so. The eigenvectors with the lowest eigenvalues generally have the least information about our dataset, so it makes most sense to eliminate these eigenvectors first. The sum of the eigenvalues is the total explained variance, and accordingly the percentage of explained variance associated with each eigenvector/eigenvalue pair is eigenvalue $\lambda$ divided by the total variance $\Sigma = \sum\limits_{i=1}^{p} \lambda_i$. Looking at the eigenvalues sorted in descending order, we pick a subset such that the cumulative variance of the data is captured in our $q$-dimensional subspace up to a specified amount, e.g. we eliminate components such that at least 90% of the variance is retained, or until we are satisfied with the dimension of our new subspace.

We then transform the data by projecting it onto the q-dimensional subspace. Intuitively, we have created a new set of axes and kept the onces which do a good job of capturing the variance in our data. Each eigenvector is centered in our data “cloud,” and we have chosen the axes within that cloud pointing in a $p$-dimensional direction that both maximizes the captured variance of our data and minimizes the distance between our new axis and the datapoints. To perform this last step, we create our new dataset $Y$ by taking our new $q$-dimensional matrix $W$, transposing it, and multiplying it with our original dataset:

$Y = W^{T}X$

Note that this process is not the same as feature selection; the orthogonal basis vectors we created do not directly correspond to specific features of the original dataset. Nor should PCA be used as a substitute for feature selection because while it does eliminate the dimensionality of the data, techniques specifically designed for this purpose such as L1 regularization or library instances of feature selection should perform at least as well in doing so, and probably better. PCA is primarily employed to reduce the dimensionality of the data, agnostic of features, and for purposes of visualization.

In practice, singular value decomposition (SVD) also gets us eigenvectors and their corresponding eigenvalues, and is often used for computational precision due to possible rounding errors in the extra step of computing the covariance matrix.
___

[An excellent step-by-step walkthrough in Python by Sebastian Raschka](http://sebastianraschka.com/Articles/2014_pca_step_by_step.html#sections)