Sunday, May 10, 2009

Understanding Principal Component Analysis via cool Gapminder graphs



Gapminder.org is a wonderful site full of "statistical porn". This chart in particular is a fascinating graph that demonstrates the correlation between income and child mortality rates. It is also a great example to teach about a cool statistical tool: "Principal Component Analysis".

In this graph of regions there is an obvious negative correlation between infant mortality and income illustrated by the fact that the data points scatter along a line from upper left to lower right. In other words, if you knew only the infant mortality rate or the income of a region you could make a reasonable guess at the other.

Principal Component Analysis (PCA) is a statistical tool that’s very useful in situations like this. PCA delivers a new set of axes that are well aligned to correlated data like this -- I've illustrated them here with black and red lines. For each axis, it also returns a “variance strength” which I’ve represented as the length of the black and red axes. (Actually I just hand approximated these axes by eye for the purposes of illustration).

The strongest new axis returned by PCA (the black one) aligns well with the primary axis of the data. In other words, if one were forced to summarize a region with a single number it would be best to do so with the position along this black axis. The zero point on the axis is arbitrary but is usually positioned in the center of the data (the mean). Positive valued points along this black axis would be those regions further toward the lower right and negative valued regions would be those further toward the upper left. Let’s call this new axis “wealth” to separate it in our minds from “income” which is the horizontal axis of the original data set. Increases in “wealth” represent an increase in income and drop in infant mortality simultaneously.

The second axis returned by PCA is shown as the red axis. Countries that lie far off the main diagonal trend-line (black axis) have particularly unique infant mortality rates given their wealth which we’ll assume is because of something unique about their health care systems. Points well below the black axis are regions that have very good health care given their wealth and those above it have particularly poor health care given their wealth.

Because PCA gives us convenient axes that are well aligned to the data, it makes senses to just rotate the graph to align to these new axes as illustrated here. Nothing has changed here, we've simply made the graph easier to read.



Before you even look at specific regions on these new axes, one could guess that socialist countries would score more negatively along this red axis and those whose economy is heavily biased towards mineral extraction -- where income tends to be very unevenly distributed -- would score more positively. Indeed, this is confirmed. The most obvious outliers below the black axis are Cuba and Vietnam where communist governments have directed the economy to spend disproportionately on health care and the outliers on the other side are: Saudi Arabia, South Africa, and Botswana -- all regions heavily dependent on resource extraction where the mean income statistics hide the reality that few are doing very well while the vast majority are in extreme relative poverty.

One particularly interesting outlier is Washington DC which is located as far along the red axis as is Botswana! In other words, based on this realigned graph, you might guess that the wealth in DC is as unevenly distributed as it is in Botswana. Fascinating! (The observation is probably at least partially explained by the fact that it is the only all urban "state" and urban areas will tend to have wider income distributions than rural/suburban areas.) Also note that all of the points in the United States (orange) are well into positive territory on the red axis -- our health care system is as messed up relative to our wealth as is Chad, Bhutan, and Kazakhstan -- countries with completely screwed-up governmental agendas. Think of it this way: the degree to which our infant mortality rates are "good" owes everything to our wealth and is despite the variables independent of wealth! In other words, countries that provide average health-care relative to their wealth like El Salvador, Ukraine, Australia and the UK fall right on the black axis but we fall significantly above that line -- roughly the same place as countries that are, independent of their wealth, really messed up like Chad and Kazakhstan. (A caveat: the chart is on a log scale so the comparative analysis is more subtle than I'm making it out here.)

PCA returns not only the direction of the new axes but also the variance of the data along those axes. To understand this, imagine for a moment that all the regions of the world had exactly the same health care given their income; in this case all the points would align perfectly along the main trend line (the black axis) and the variance of the red axis would be zero. In this imaginary case, the data would be “one dimensional”, that is income and infant mortality would be one in the same statement; if you knew one, you'd know the other exactly. Now imagine the opposite scenario. Imagine that there was no relationship at all between income and infant mortality; in that case we would see a scattering of points all over the place and there wouldn’t any obvious trend lines. Neither of these imaginary scenarios are what we see in the actual data. It isn’t quite a line along the black axis but neither is it a buckshot scattering of points, so we can say the data is somewhere between 1 dimensional and 2 dimensional. If both variances are large and equal to each other, then the system is 2 dimensional while if one of the variances is large while the other is near zero, then we know the system is nearly 1 dimensional. In other words, PCA permits you to summarize complicated data by finding axes of low variance and simply eliminate them. This technique is called “dimensional reduction” and is a very powerful tool for summarizing complicated data sets such as would arise if we looked at more than two variables. For example, we might include: car ownership, water accessibility, education, average adult height, etc to the analysis at which point performing a dimensional reduction would help to get our heads around any simplifications we might wish to make.

Wednesday, May 6, 2009

External link: My Manhattan Project

This is an excellent article in New York Magazine about a software engineer on Wall Street.

Some quotes and thoughts.

> "Over time, the users of any software are inured to the intricate nature of what they are doing."

Well put. This is the heart of all software successes and failures. Software is the perfect tool to lie to others and lie to ourselves with. It is the ultimate obfuscatory tool if you let it be.

Thomas Jefferson fought against a Hamilton-supported economy based on industry and banking. Hamilton was right, of course, but Jefferson had a good point. A detail of that eighteenth century debate that has intrigued me is: If financial instruments were already obfuscated to Jefferson in the 18th century, imagine what it must be like now? This article confirmed my intuition for what must have been going on: software sold and maintained by an external company helped to obfuscate the transactions to everyone involved. Of course the technology should have allowed it to be understood too but sounds like some monopolies of thought took hold because it was short-term profitable for them to. It was Jefferson's worries manifest in twenty-first century technology. Maybe there is some law lurking like "The Law of Constant Obfuscation" -- at any given time technology will permit obfuscation to a constant level.

> “Mike,” he told me when denying my request, “can you really look for people dumber than you and then take advantage of them? That’s what trading is all about.”

Ha!

> "I was very good at programming a computer. And that computer, with my software, touched billions of dollars of the firm’s money. Every week. That justified [my salary]. When you’re close to the money, you get the first cut. Oyster farmers eat lots of oysters, don’t they?"

Rationalization is such a powerful force! As is momentum. Feynman has an excellent point in one of his books where he's talking about forgetting why he worked on the Manhattan Project. He joined because, like his collaborators, the idea of Hitler having unilateral nuclear power was unimaginably scary. But, after Hitler was defeated, he forgot why it was he was working on this project and the momentum of the technical challenge remained. He regretted that he didn't re-evaluate his thinking after the VE day.

Monday, May 4, 2009

External Link: Energy Flux Graph from Lawrence Livermore Nat. Labs



I love this graph from Lawrence Livermore National Laboratory illustrating the flux of energy through the US economy. Some things that surprised me:

1) The amount of energy wasted in the transport of electricity is staggering, slightly more than the total amount of oil imported (in energy equivilent units); technological improvement in that sector would make an enormous contribution.

2) Transportation, as I expected, is woefully inefficient. What I didn't appreciate was the magnitude, the energy wasted by transport is approximately equal to all the coal burned!

3) The residential / commercial waste is surprisingly low. One assumes that some fraction of that waste is insulation and so-forth, but even if you took a big bite out of that with building improvements, you wouldn't make a dent in the big picture. It boils down to this: if one's goal is to reduce waste (which is a very different goal than reducing consumption) then electrical and transport are the obvious primary targets.

Saturday, May 2, 2009

Vaccinate your child or gramps gets it in the stomach!

There seems to be a growing ignorance about vaccination. From my informal queries of friends and acquaintances who have chosen not to immunize their children or who do not get flu vaccines, I have found that few people understand that vaccination is part of a greater social compact, not merely a personal cost/benefit analysis. The effect is called Herd Immunity. When we vaccinate ourselves, and especially our children, we are adding to the communal common defenses. Obviously, everyone would like to have a defensive wall built to protect a community yet everyone would prefer to not contribute. But that's not the way a good society works; we share the costs of doing things that benefit the common good.

Immunization of children is particularly important for two reasons: 1) Children's immune systems respond to vaccination much more effectively than do others, especially the elderly who are the most likely to die of viral diseases such as influenza. 2) Children are responsible for much of the transport viruses throughout a community owing to their mobility and lack of hygiene.

For example, a controlled experiment conducted in 1968 by the University of Michigan demonstrated that large-scale vaccination of children conferred a 2/3 reduction in influenza illnesses to all age groups. For a nice article on the subject, here's a Slate article from 2008.

I propose an old-fashioned poster campaign to inform about the social benefits of vaccination. Here's a couple of prototype posters I photoshopped up this afternoon. (Apologies to Norman Rockwell!)






(Original photo Adam Quartarolo via WikiCommons)

Wednesday, April 29, 2009

Tree and Vine: An Allegory of Attenuated Parasitism.



The town of Forrest has been around for centuries. It’s the kind of place where sons inherit their father’s businesses and nobody can remember when things were too different. The town has always been so small that it supported only a single shopkeeper; the current proprietor of this humble store is a tall and stable fellow named Woody, the descendant of a long line of tall and stable men just like himself who have worked hard to build and maintain what’s always been a social focus of Forrest.

Small towns like Forrest might seem peaceful to visitors, but internally there are the inevitable gripes, grievances, and grudges. For example, a recent family feud over the inheritance of their grandfather’s property has split Woody from his cousin Trey. As a consequence, Trey has recently opened a competing store directly across the street from Woody thus ending Forrest’s long-established one-shop monopoly. This, as you’d suspect, has been terrible for Woody.

Forrest is not entirely full of hard working capable souls – for example, consider Vinnie the thief. Vinnie, like Woody, is descended from an ancient line of Forrest inhabitants. Vinnie, like Woody, pursues the same occupation as his father and his father’s father. But Vinnie, unlike Woody, isn’t exactly a clone of his ancestors.

Vinnie’s father was a notorious scoundrel. An aggressive thief and burglar, he was nevertheless as dimwitted as he was ruthlessness. It doesn’t take a genius to know that if you continue to steal from the same store over and over that there might eventually be nothing left to steal. This concept seemed totally lost on Vinnie’s father and as a result he almost caused Woody’s father to close the only store in town.

But, as suggested, Vinnie was not cut out of the same aggressive yet witless stock as his father. Indeed, Vinnie is more bargainer than terrorist -- a theme established early in his life. When Vinnie’s father began to push him into the family business, his father told him: “Go into Woody’s store, show him who you are, break a few things then take what you want and stroll out like you own the place. That’s how it works for guys like us. That’s how it has always worked.”

Young Vinnie tried. He walked into Woody’s store and looked around. He picked up a few items that looked breakable and considered tossing them to the ground. But, soon he became aware of Woody’s suspicious gaze following him around and found himself placing the stock back on the shelf and adverting his eyes. Finally, Vinnie decided just to come clean.

“Do you know whose son I am?” Vinnie asked Woody naively.

“Of course.”

“Then how about you just give me a hundred bucks?”

Woody thought about this. A hundred dollars was actually quite a small price to pay compared to the usual cost in damage and theft. But, a hundred dollars for what exactly? A hundred dollars just to make some kid walk away? All things being equal, Woody would just assume he didn’t have Vinnie’s small-time extortions nor his father’s grand theft, but that really wasn’t one of the available options and therefore the proposed agreement would be the lesser of two evils.

“I’ll tell you what”, said Woody pulling out the cash, “I’ll give you one hundred dollars a week for doing absolutely nothing as long as you don’t make the same deal with my cousin Trey across the street. This will be your territory, but the store across the street stays your father’s territory. Deal?”

“Deal.” Vinnie said, shaking Woody’s hand three times.

And with that simple verbal contract, an arrangement was made. Each week Vinnie would come in, shake Woody’s hand three times, and earn a hundred dollars.

Over time, their relationship became, if not exactly friendly, at least routine. Little by little they forgot about the initial circumstances of the arrangement and found themselves acting like civil gentlemen considering the issues of the day.

One day, a small force of bandits from a nearby town attempted to invade, seeking to steal supplies and animals. Obviously, both Woody and Vinnie were desperate to repel this invasion and during the crisis all past discord was forgotten. Not surprisingly, between the two of them, Vinnie was the better fighter owing to the weapons and viciousness inherited from his violent family. That’s not to say that Woody didn’t engage the enemy, but violence is clearly Vinnie's comparative advantage.

A few months later, a fire broke out. As before, both Vinnie and Woody had a mutual interest in stopping this mortal threat. While Vinnie pitched in to fight the fire, this time it was Woody – with his access to buckets and hoses – who played the comparatively larger role in extinguishing this mutual threat.

And so it went. As the relationship normalized, they found that their common needs were greater than their distrust and consequently they found more and more ways that it was profitable to depend on each other’s specializations. Vinnie became not only the defender of the neighborhood but also the store’s out-of-town sales representative and Woody paid him a commission on his sales. Meanwhile, Woody’s freed resources meant that he was able to invest more in a nicer shop with more stock to the profitable benefit of both.

Generation after generation inherited the agreement and the benefit of specializing and working together turned out to be great. The paltry hundred dollars became not so much an extortion as just one part of a complex set of mutual exchanges of goods and services. In fact, Vinnie and Woody’s sons didn’t even know why they engaged in this weekly routine of thrice handshakes and an exchange of cash -- maybe it was some sort of ritual of friendship; maybe it had to do with some old debt now long since irrelevant; whatever, it seemed a quaint part of their past. To outsiders, it was hard to imagine the shop running without two employees, and most assumed that it had always been that way and always would.

Monday, April 27, 2009

Geometry of Biological Time, Chapt 2.


Co-tidal map from NASA via Wikicommons. The points of intersection are the "phase singularities" where the tidal phase is undefined.

Slowly making my way through this book. Chapter 2 is about phase singularities -- places where the phase of some oscillation is undefined. The coolest example is the earth's tides. The surface of the earth is a sphere ("S2" in topology speak) and the tides are defined by a phase (S1). So for each point on earth at any given moment there's a tidal phase. But S2->S1 mappings (with certain continuity assumptions) must contain phase singularities -- there must be places where you can't define the phase. Above is a map from NASA showing these places as the intersections of the co-tidal lines. You can think of the tides as sloshing around those points where the sea level doesn't change.

The chapter is mostly about biological versions of such phase singularities. Detailed examples are given from fruit fly circadian rhythms, but the technical details of the experiments were overwhelming so I didn't fully follow and decided, perhaps unwisely, to plod forward without complete understanding.

Thursday, April 23, 2009

BSTQ - Bull Shit Tolerability Quotient

There are many traits that determine someone's performance in various social settings such as school, work, military, etc. A popular metric for correlation to "success" in such social system is the "Intelligence Quotient" which purports to measure elements of abstract intelligence. Another metric that has gained popularity is the "Emotional Intelligence Quotient" which purports to measure the ability to perceive and mange emotions in oneself and others. Both of these metrics claim a high correlation to success in aforementioned social institutions.

I submit that success in roles within said social systems -- student, factory worker, warrior, etc. -- requires a high tolerance of activities such as: implementing poorly articulated tasks, engaging in inane conversations, attending pointless engagements, and other time-wasting activities known informally as "Bull Shit" (BS). The ability to tolerate such BS is a very important trait that is not normally rigorously evaluated.

I propose a simple test to measure an individual's tolerance for BS: a list of increasingly inane questions and pointless tasks is given to the test taker. For example, the test might begin with questions like: "Fill in the blank: Apples are __ed" and end with stupendously pointless tasks such as "Sort these numbers from least to greatest" followed by several hundred ~20 digit numbers and then having the next task say: "Now randomize those same numbers". The Bull Shit Tolerability Quotient (BSTQ) would just ignore the given answers and simply count the number of questions that test taker was willing to consider before handing the test back in frustration and declaring: "This Bull Shit!"

If a formal BSTQ test is not available, most standardized academic tests can be used as a reasonable substitute. However, the dynamic range of such generic academic tests to measure BSTQ is low. In other words, only extreme low-scorers of a proper BSTQ test will be measurable via the number of unanswered questions on a standard academic test used as a BSTQ surrogate. Extreme caution must be used when interpreting an academic test as a BSTQ analog -- the test giver may misinterpret the number of unanswered questions as the result of the test taker's low knowledge of the test's subject matter instead of as a spectacularly low BSTQ score.

BSTQ tests can easily be made age independent. For pre-verbal children the test would involve increasingly inane tasks such as matching sets of colored blocks to colored holes and so forth. The test would simply measure how many of these tasks the pre-verbal child could engage in before he or she became irritated or upset with the examiner.

Like the IQ and EIQ I suspect that the BSTQ will be correlated to the degree of success within many social endeavors, in particular: school; however, I also suspect that there is a substantial fraction of the population that has an inverse correlation between their IQ and their BSTQ scores. Of these, of particular interest are those with high IQ with low BSTQ. I would not be surprised if the population of people rated by their co-workers as "indispensable" is significantly enriched for individuals with a high IQ / low BSTQ score. Finally, I submit that these individuals are severely under-served by the educational system which demands -- indeed glorifies -- extremely high BSTQ, especially among those with high IQ.

Adding a BSTQ evaluation to pre-academic children might suggest that the student would excel in a non-traditional educational environment where the student is allowed to select their own agendas and tasks. A very low BSTQ coupled with a very high IQ would seem to almost guarantee rebellion if a traditional educational approach is applied. Identifying individuals with exceptionally high IQ scores and exceptionally low BSTQ scores may be a valuable tool to prevent the mis-classification of such students as "trouble makers" and instead correctly classify them as "potential indispensable iconoclasts".


(This idea evolved from lunch discussion with Marvin today, so thanks Marvin!)