Swaroop C H

blog books about contact subscribe

Super Crunchers

23 Jun 2008

Today, I re-read a book called Super Crunchers: How Anything Can Be Predicted by Ian Ayres.

So what is supercrunching?

Now something is changing. Business and government professionals are relying more and more on databases to guide their decisions. The story of hedge funds is really the story of a new breed of number crunchers - call them Super Crunchers - who have analyzed large datasets to discover empirical correlations between seemingly unrelated things. Want to hedge a large purchase of euros? Turns out you should sell a carefully balanced portfolio of twenty-six other stocks and commodities that might include Wal-Mart stock.

What is Super Crunching? It is statistical analysis that impacts real-world decisions. Super Crunching predictions usually bring together some combination of size, speed and scale. The sizes of datasets are really big - both in the number of observations and in the number of variables. The speed of the analysis is increasing. We often witness the real-time crunching of numbers as the data come hot off the press. And the scale of the impact is sometimes truly huge. This isn't a bunch of egghead academics cranking out provocative journal articles. Super Crunching is done by or for decision makers who are looking for a better way to do things.

This is best explained by the chess example:

We tend to think that the chess grandmaster Garry Kasparov lost to the Deep Blue computer because of IBM's smarter software. That software is really a gigantic database that ranks the power of different positions. The speed of the computer is important, but in large part it was the computer's ability to access a database of 700,000 grandmaster chess games that was decisive. Kasparov's intuitions lost out to data-based decision making.

(emphasis mine)

The book starts off with the example of Orley Ashenfelter, a Princeton economics professor as well as founder and editor of the Journal of Wine Economics who wanted to apply supercrunching techniques to predict whether a wine from a particular year would be a good wine or not. He ended up with the following equation:

Wine quality = 12.145 + 0.00117 winter rainfall + 0.0614 average growing season temperature - 0.00386 harvest rainfall

You can imagine the commotion that followed. The wine experts brushed off this theory and that numbers can predict the wine quality better than they can. After all, "Just as it's more accurate to see the movie, shouldn't it be more accurate to actually taste the wine?"

And yet, the equation did indeed make better predictions, especially with the prediction that 1989 and 1990 wines would be bestsellers.

Orley was able to make this analysis because he had access to data about the weather and the wine quality. Ian explains that there are two ways to get the data - it already exists (like surveys and census or simply transaction logs of companies) or you create it using randomized trials.

The latter idea of creating data with the "flip of a coin" is such a simple yet powerful concept. Techies would be familiar with this already under a different name - "A/B testing".

Let's take the example of JoAnn sewing machines:

So when JoAnn.com was optimizing their website, they decided to take a gamble and include in their testing an unlikely promotion for sewing machines: "Buy two machines and save 10 percent." They didn't expect this test to pan out. After all, how many people need to buy two sewing machines? Much to their amazement, the promotion generated by far the highest returns. "People were pulling their friends together," says Linsly Donnelly, JoAnn.com's chief operating officer. The discount was turning their customers into sales agents. Overall, randomized testing increased its revenue per visitor by a whopping 209 percent.

The key is that:

Randomization also frees the researcher to take control of the questions being asked and to create the information that he/she wants. Data mining on the historic record is limited by what people have actually done.

To realize how valuable this methodology is, let's take the case of Progresa:

But by far the most important recent randomized social experiment of development policy is the Progresa Program for Education Health and Nutrition.

(paraphrased) Zedillo, the Mexican President in 1995, decided that he wanted to have a major effect on Mexico's poverty and together with the members of his administration, he came up with a very unique poverty alleviation program, which is Progresa.

Progresa is a conditional transfer of cash to poor people. "To get cash," Gertler said, "you had to keep your kids in school. To get the cash you had to get prenatal care if you are pregnant. You had to go for nutrition monitoring. The idea was to break the intergenerational transfer of poverty because children who typically grow up in poverty tend to remain poor."

...

Zedillo's biggest problem was to try to structure Progresa so that it might outlive his presidency... Gertler said "If you have a five-year administration and it takes three years to get a program up and running, then it doesn't have much time to have an impact before the new government comes and closes it".

So starting in 1997, Mexico began a randomized experiment on more than 24,000 households in 506 villages.

...

The Progresa villages almost immediately showed substantial improvements in education and health. Progresa boys attended school 10 percent more than their non-Progresa counterparts. And school enrollment for Progresa girls was 20 percent higher than for the control group.

...

The improvements in health were even more dramatic. The program produced a 12 percent lower incidence of serious illness and a 12.7 percent reduction in hemoglobin measures of anemia. Children in the treated villages were nearly a centimeter taller than their non-Progresa peers. A centimeter of additional growth in such a short time is a big deal as a measure of increased health.

(emphasis mine)

Best of all, the evidence of Progresa being a good thing was so convincing that the new government kept it going but under a different name for political reasons. Zedillo's idea worked. And beautifully.

Ian goes on to demonstrate similarly how Don Berwick's campaign prevented an estimated 1,22,342 hospital deaths in eighteen months. The campaign was just a few simple suggestions that were determined based on statistics of how deaths occurred and these suggestions were implemented by the participating hospitals. The suggestions included regular washing of hands.

Ian quotes several real-world examples throughout the book and the number of times that number crunching and data crunching beat human expertise is staggering. But Ian says that this does not mean the end of need for human intervention. Supercrunching can validate ideas but the ideas and hypotheses themselves have to be formulated by us humans.

He goes on to explain the 2SD rule and the Bayes' theorem in layman terms. Just understanding these two concepts would go a long way in helping anyone decipher statistics.

All in all, the book was a good inspiring read. I would highly recommend the book for anyone (even non-techies) interested in how computers and databases are changing how decisions are made. These decisions are not limited to websites. As we have seen above, it is changing everything from how government policy decisions are made to how movie scripts are being written.

The key takeaway for me is that data insights are hard and so is intuition. People who can straddle both will be important people in future. Learning to read the data will mean getting comfortable with statistics, models and even neural networks (as explained in the book).

If you're not patient enough to read the book, you can watch the Google Tech Talk by Ian Ayres. You can also read more of Ian Ayres' supercrunching stories on the Freakonomics blog.



We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.

-- E. O. Wilson (entomologist and biologist)

Comments

Vinayak Hegde says:

Does this blog post mean that I can now borrow the book finally ? :D

Swaroop says:

@Vinayak Haha, yeah, will pass it to you when we catch up :)

Frank460 says:

I read the "Super Crunchers Book". Got a lot of helpful information.
I wish there was just one chapter possibly devoted to more everyday applications with more detailed calculations. Is there a website that supports current applications and research in number crunching or a place for the average person to get some feedback on how to conduct some experiments?

Swaroop says:

@Frank Have you checked out the O'Reilly book called "Programming Collective Intelligence"?

Feedback

There's no comment box, but please do email me or tweet me your thoughts and criticisms, and I will publish the relevant ones here.