Dimensionality Reduction - A Visual Primer

We use these techniques for data reduction often sometimes blindly. Here I attempt to provide an intuitive explanation of how they work using images and examples.

A Crash Course in SVD and PCA

At their core PCA and SVD are methods to reduce the dimensionality of data.

On high order data sets (i.e. data sets where an observation can be described by many measurements),it is useful to simply the matrix to a “lower order”. The lower order matrix has fewer dimensions facilitating both visualization and further analysis in a couple of different ways:

In the most common example , dimensionality reduction allows you to take a high order data set to be plotted in 2 or 3 dimensions - this allows plotting onto a 2 dimensional plane ( like a sheet of paper ). This reduction allows mere mortals to find patterns in their data.

When generating training data for modeling / machine learning techniques this reduction increases performance. Increase in performance is two fold both in power by reducing problems with over fitting your data and in speed by limiting the number of calculations required.

A contrived example

I’m only going to walk through using SVD now - but you should just know that PCA and SVD are essentially the same operations on a matrix. PCA simply centers the data prior to transformations

Lets start with a random matrix , where 10 individuals (columns) , have 40 observations (rows).

1dataMatrix <- matrix(rnorm(400),10)
2image(dataMatrix , col=colors , axes = FALSE)
3axis(1, at = seq(0,1,length.out=10) , labels = c(1:10))
4axis(2, at = seq(0,1,length.out=40) , labels = c(1:40))

Now lets add some structure to the data:

Lets verify that structure with some hierarchical clustering.

It sure looks like there is a marked difference between samples on the left hand side and on the right hand side , and that the top half of the variable. are much less random then the variables on the bottom half.

Lets look at how our data is structured a little :

There’s a pretty clear pattern in the data - mainly that columns one through five have an average value of 0.2 and six through ten have an average value of 2.7. Similarly rows one through twenty center around 0 and twenty one through forty center around 2.5.

Lets see if we can’t simplify this a little using singular value decomposition. SVD attempts to describe the original data as the product of 3 other matrices described by the equation X = U D V^t. For future reference , U’s columns contain the left singular vectors , V’s columns contain the right singular vectors, and D contains the singular values.

1svd.data<-svd(dataMatrix)

This function returns the three matrices that can be recombined to created the original data set. The diagonal matrix is stored in sub variables as the same name as the previously provided equation (svd.data$d , svd.data$u , svd.data$v). Just to prove that the math is right , lets recombine the three matrices and see if we get our original data frame back.

 1reconstituted <- svd.data$u %*% diag(svd.data$d) %*% t(svd.data$v)
 2
 3par( mfrow = c(1,2) )
 4
 5image(reconstituted , col=colors , axes = FALSE , main = "Reconstitued")
 6axis(1, at = seq(0,1,length.out=10) , labels = c(1:10))
 7axis(2, at = seq(0,1,length.out=40) , labels = c(1:40))
 8
 9
10image(dataMatrix , col=colors , axes = FALSE , main = "Original")
11axis(1, at = seq(0,1,length.out=10) , labels = c(1:10))
12axis(2, at = seq(0,1,length.out=40) , labels = c(1:40))

Looks good. Lets look at the component data that is returned as a result of SVD.

Lets start by looking at the first columns of matrix U and V.

1par( mfrow =c(1,2))
2plot(svd.data$v[,1], xlab = '' , ylab = '' , main = "First Column of V")
3plot(svd.data$u[,1] ,xlab = '' , ylab = '' , main = "First Column of U")

These two plots should look very familiar; the first plot looking very similar to the row means plot from earlier and the second looking similar to the column means plot. That’s because they’re closely related to that those plots. So SVD was able to see this pattern. But why was the FIRST column so important ?

We need to look at the diagonal matrix D. This matrix describes the variance explained in each dimension.

1plot(svd.data$d^2/sum(svd.data$d^2) , main = "Variance Explained by Each PC" , xlab = "PC" , ylab="Variance Explained")

As you can see most (all) of the variance can be explained by the first column in the data. In this case our data can be described almost entirely in one dimension. To test this , lets see if we can re-create our original data frame using only the first column and diagonal of our SVD.

 1recon_1_col<-(svd.data$u[,1] %*% diag(svd.data$d[1] , 1,1) %*% t(svd.data$v[,1]))
 2
 3par( mfrow = c(1,2) )
 4
 5image(recon_1_col , col=colors , axes = FALSE , main = "Reconstitued")
 6axis(1, at = seq(0,1,length.out=10) , labels = c(1:10))
 7axis(2, at = seq(0,1,length.out=40) , labels = c(1:40))
 8
 9
10image(dataMatrix , col=colors , axes = FALSE , main = "Original")
11axis(1, at = seq(0,1,length.out=10) , labels = c(1:10))
12axis(2, at = seq(0,1,length.out=40) , labels = c(1:40))

The reconstituted data looks amazingly close to our original data set.

For those who haven’t done matrix multiplication in a while , you can think of this reconstitution in the following light:

The U vector contains the average value for a column.
The V vector contains the average value for a row.
The D vector contains the scaling applied to all values.

When multiplied together this results in a matrix of 10 columns by 40 rows. Where each cell in the matrix is the product of the appropriate row and column vector.

Lets see what happens as we allow more of our columns back in to reconstitute the original matrix.

As we add more of our components back into our reconstituted matrix the closer we get to the original. This is because each additional column contains the residuals of unexplained variance.

So lets just see if we can separate our data based on a single dimension

1plot(svd.data$u[,1] , rep(0,10) , xlab = "Transformed Coordinate" , ylab = '' , main = "Data plotted as a number line");

Here we’ve managed to take 10 data points that were described by 40 separate measurements and simplify them to 10 data points that can be described by a single measurement. Making a what could be a complex situation one that is easy to visualize and understand.

1plot(svd.data$v[,1] , rep(0,40) , type = "n" , ylab = '', yaxt='n' );
2text(svd.data$v[,1] , jitter(rep(0,40),1,1) , labels = c(1:40) , cex =.7);

What does the corresponding vector from V tell us - mainly how much “pull” a row gave in a particular direction. You could use this information to find the rows that are most important in your data set.

General limitations of these methods.

This does not handle missing data well. If your data has missing data , you’ll need to devise a method to impute it. )
Outliers will strongly effect the transformation (and your ability to reduce the data)
Data can be well described well by means and co-variances.