Sunday, May 10, 2009

Understanding Principal Component Analysis via cool Gapminder graphs



Gapminder.org is a wonderful site full of "statistical porn". This chart in particular is a fascinating graph that demonstrates the correlation between income and child mortality rates. It is also a great example to teach about a cool statistical tool: "Principal Component Analysis".

In this graph of regions there is an obvious negative correlation between infant mortality and income illustrated by the fact that the data points scatter along a line from upper left to lower right. In other words, if you knew only the infant mortality rate or the income of a region you could make a reasonable guess at the other.

Principal Component Analysis (PCA) is a statistical tool that’s very useful in situations like this. PCA delivers a new set of axes that are well aligned to correlated data like this -- I've illustrated them here with black and red lines. For each axis, it also returns a “variance strength” which I’ve represented as the length of the black and red axes. (Actually I just hand approximated these axes by eye for the purposes of illustration).

The strongest new axis returned by PCA (the black one) aligns well with the primary axis of the data. In other words, if one were forced to summarize a region with a single number it would be best to do so with the position along this black axis. The zero point on the axis is arbitrary but is usually positioned in the center of the data (the mean). Positive valued points along this black axis would be those regions further toward the lower right and negative valued regions would be those further toward the upper left. Let’s call this new axis “wealth” to separate it in our minds from “income” which is the horizontal axis of the original data set. Increases in “wealth” represent an increase in income and drop in infant mortality simultaneously.

The second axis returned by PCA is shown as the red axis. Countries that lie far off the main diagonal trend-line (black axis) have particularly unique infant mortality rates given their wealth which we’ll assume is because of something unique about their health care systems. Points well below the black axis are regions that have very good health care given their wealth and those above it have particularly poor health care given their wealth.

Because PCA gives us convenient axes that are well aligned to the data, it makes senses to just rotate the graph to align to these new axes as illustrated here. Nothing has changed here, we've simply made the graph easier to read.



Before you even look at specific regions on these new axes, one could guess that socialist countries would score more negatively along this red axis and those whose economy is heavily biased towards mineral extraction -- where income tends to be very unevenly distributed -- would score more positively. Indeed, this is confirmed. The most obvious outliers below the black axis are Cuba and Vietnam where communist governments have directed the economy to spend disproportionately on health care and the outliers on the other side are: Saudi Arabia, South Africa, and Botswana -- all regions heavily dependent on resource extraction where the mean income statistics hide the reality that few are doing very well while the vast majority are in extreme relative poverty.

One particularly interesting outlier is Washington DC which is located as far along the red axis as is Botswana! In other words, based on this realigned graph, you might guess that the wealth in DC is as unevenly distributed as it is in Botswana. Fascinating! (The observation is probably at least partially explained by the fact that it is the only all urban "state" and urban areas will tend to have wider income distributions than rural/suburban areas.) Also note that all of the points in the United States (orange) are well into positive territory on the red axis -- our health care system is as messed up relative to our wealth as is Chad, Bhutan, and Kazakhstan -- countries with completely screwed-up governmental agendas. Think of it this way: the degree to which our infant mortality rates are "good" owes everything to our wealth and is despite the variables independent of wealth! In other words, countries that provide average health-care relative to their wealth like El Salvador, Ukraine, Australia and the UK fall right on the black axis but we fall significantly above that line -- roughly the same place as countries that are, independent of their wealth, really messed up like Chad and Kazakhstan. (A caveat: the chart is on a log scale so the comparative analysis is more subtle than I'm making it out here.)

PCA returns not only the direction of the new axes but also the variance of the data along those axes. To understand this, imagine for a moment that all the regions of the world had exactly the same health care given their income; in this case all the points would align perfectly along the main trend line (the black axis) and the variance of the red axis would be zero. In this imaginary case, the data would be “one dimensional”, that is income and infant mortality would be one in the same statement; if you knew one, you'd know the other exactly. Now imagine the opposite scenario. Imagine that there was no relationship at all between income and infant mortality; in that case we would see a scattering of points all over the place and there wouldn’t any obvious trend lines. Neither of these imaginary scenarios are what we see in the actual data. It isn’t quite a line along the black axis but neither is it a buckshot scattering of points, so we can say the data is somewhere between 1 dimensional and 2 dimensional. If both variances are large and equal to each other, then the system is 2 dimensional while if one of the variances is large while the other is near zero, then we know the system is nearly 1 dimensional. In other words, PCA permits you to summarize complicated data by finding axes of low variance and simply eliminate them. This technique is called “dimensional reduction” and is a very powerful tool for summarizing complicated data sets such as would arise if we looked at more than two variables. For example, we might include: car ownership, water accessibility, education, average adult height, etc to the analysis at which point performing a dimensional reduction would help to get our heads around any simplifications we might wish to make.

2 comments:

Edward said...

Nice PCA example! I love the Gapminder visualizations...

Ram said...

Thanks, Zack ! great analogy as always and that last paragraph really cleared things up about PCA in my head...