I recently came across an interesting article which aims to explain the Principal Component Analysis (PCA) to a technical audience. The article cites a definition from the "Pattern Recognition and Machine Learning" textbook by Christopher M. Bishop:
PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized.
Reading this quote reminded me of my long-standing impression that the way most people talk about PCA, even though technically correct, is often too complicated, incomplete and misses some very important points.
I have this impression about many concepts in statistics (I'm looking at you Receiver Operator Characteristics), but it doesn't have to be this way. Understanding PCA can be as simple as understanding how legos work.
I decided to try and give you another angle on PCA, one which I believe is simpler and hits on the main points. This also gives me an opportunity to talk about legos, which I find to be a plus in all occasions.
MATHEMATICS OF BUILDING BLOCKS
A lot of math is about building blocks. Eigenvectors are building blocks, so are prime numbers, Fourier modes (and for that matter all special functions), terms in a Taylor series etc. This is because, just like with most things, it is often easier to study complex mathematical objects by breaking them into a set of building blocks first and then studying the building blocks.
The mathematical building blocks can vary widely, but they are often defined such that they have two common properties:
- They are linear - meaning that we can multiply them with a number and add them up.
- They are orthogonal - meaning that if we "project" any building block onto another, we will get 0. In other words, information contained inside each building block is unique to that building block. This, as it turns out, has deep implications.
In order to explain what this means in practice, consider for example a 1D time series.
To a good approximation I can always write this time series as:
PCA IS ABOUT CONSTRUCTING OPTIMAL BUILDING BLOCKS
Most people who are familiar with PCA have seen it written in this form¹:
Most of the data we deal with is some form of a matrix. Images are matrices of pixels. Tables are matrices of features and observations. Often we want to find approximations of our data which minimise information loss, for reasons ranging anywhere from data compression to the "curse of dimensionality".
We use PCA to alleviate these problems by default. But why? We could have used any number of matrix decompositions to reduce data dimensionality. We could also choose any building blocks we want. What's so special about PCA?
It turns out that PCA gives the optimal building blocks for a matrix, assuming that your building blocks are linear. In fact, there is a theorem (i.e. the Eckart–Young theorem) which shows that if you choose any fixed number building blocks...
As an illustration, imagine I asked you to build a lego house, without a limit on the number or type of building blocks you could use. You might come up with something like this:
Now imagine I asked you to build a house, but you could only use 10 lego pieces. You would probably look for 10 pieces which would allow you to include the most dominant features of a house: walls, roof, maybe a door or a window.
Perhaps something like this:
PCA is very much like building a lego house out of a finite set of pieces: you are trying to find the best linear approximation to your data by using only a few building blocks.
And this is precisely why we use PCA: because it gives us building blocks such that they add up to the best² possible representation of our data with any fixed number of building blocks. (This is what people mean when they say that PCA maximises explained variance.)
Keep in mind that PCA gives the best possible building blocks if you require them to be linear, but outside of this scope, you might be able to do better (e.g. embedding methods).
PCA IS A COMPRESSION ALGORITHM
Another way to think about the lego house example from the previous section is to think of it as a compressed representation of a house. Both objects we built represent the similar information: the object is a house. One does it with many pieces, the other with only 10.
Let's look at another example. As we can decompose almost any matrix into building blocks using PCA, I can do this with any image, such as this one:
Let's see what happens when we start rebuilding this image from the building blocks and coefficients given by PCA, order by order (i.e. keeping only the building block associated with the highest coefficient, then highest two, highest three...):
Even with only one building block, PCA tries to approximate the general composition of the image. It's the best it can do with one building block - look for the most dominant features in the image (It identifies that there is a dark area in the upper left, and that there is a light area in the middle).
Adding additional building blocks picks up the dark region on the bottom of the image. Ten blocks already starts to look like the original image, and by the time we reach 30 we have already recovered most of the original information in the image.
Let's see what we've achieved in terms of the compression rate. The original image was 200 x 170 = 34000 float pixels. Keeping the first 30 building blocks means that we managed to represent this image with only 30 row vectors of dimension 200, 30 column vectors of dimension 170, and 30 coefficients which amounts to 30 x (170+200) + 30 = 11130 floats. In other words, we can reconstruct the original image to a good degree with a compression rate of roughly 70% using PCA. Not bad!
This is what dimensional reduction is all about. Finding a compressed representation of your data which tries to minimises information loss. It is precisely what PCA does for you.
I hope I managed to demystify PCA for you at least a little bit and to convince you to think about PCA not in terms of convoluted definitions, but in terms of simple building blocks that you use to construct approximations to your data.
My personal experience was that once I started thinking about linear algebra, PCA, vectors etc. in terms of building blocks, many things simply clicked in and started making sense, even beyond linear algebra and matrices.
For all of you who are more on the math side, I'll give you a very technical example: if you ever used Fourier transforms, you probably know that sines and cosines (Fourier modes) appear as solutions to a particular differential equation with a specific boundary condition (e.g. the wave equation), which in 1D can be written as:
I will leave you, for now, with a word of advice: always learn more, always strive to understand better and don't be afraid to think outside the box!
¹ Although not technically correct, Singular Value Decomposition (SVM) and PCA are commonly used as synonyms. In this article I do the same.
² "The best" here means that the difference between the original data and the approximated is minimised (e.g. Hilbert-Schmidt norm of the matrix difference).