Super Crunchers

Today, I re-read a book called Super
Crunchers: How Anything Can Be Predicted

by Ian Ayres.

So what is supercrunching?

Now something is changing. Business and government professionals are
relying more and more on databases to guide their decisions. The
story of hedge funds is really the story of a new breed of number
crunchers – call them Super Crunchers – who have analyzed large
datasets to discover empirical correlations between seemingly
unrelated things. Want to hedge a large purchase of euros? Turns out
you should sell a carefully balanced portfolio of twenty-six other
stocks and commodities that might include Wal-Mart stock.
What is Super Crunching? It is statistical analysis that impacts
real-world decisions. Super Crunching predictions usually bring
together some combination of size, speed and scale. The sizes of
datasets are really big – both in the number of observations and in
the number of variables. The speed of the analysis is increasing. We
often witness the real-time crunching of numbers as the data come
hot off the press. And the scale of the impact is sometimes truly
huge. This isn’t a bunch of egghead academics cranking out
provocative journal articles. Super Crunching is done by or for
decision makers who are looking for a better way to do things.

This is best explained by the chess example:

We tend to think that the chess grandmaster Garry Kasparov lost to
the Deep Blue computer because of IBM’s smarter software. That
software is really a gigantic database that ranks the power of
different positions. The speed of the computer is important, but in
large part it was the computer’s ability to access a database of
700,000 grandmaster chess games that was decisive. Kasparov’s
intuitions lost out to data-based decision making.
(emphasis mine)

The book starts off with the example of Orley Ashenfelter, a Princeton
economics professor as well as founder and editor of the Journal
of Wine Economics who wanted to apply supercrunching techniques to
predict whether a wine from a particular year would be a good wine or
not. He ended up with the following equation:

Wine quality = 12.145 + 0.00117 winter rainfall + 0.0614 average
growing season temperature – 0.00386 harvest rainfall

You can imagine the commotion that followed. The wine experts brushed
off this theory and that numbers can predict the wine quality better
than they can. After all, “Just as it’s more accurate to see the
movie, shouldn’t it be more accurate to actually taste the wine?”

And yet, the equation did indeed make better predictions, especially
with the prediction that 1989 and 1990 wines would be
bestsellers.

Orley was able to make this analysis because he had access to data
about the weather and the wine quality. Ian explains that there are
two ways to get the data – it already exists (like surveys and census
or simply transaction logs of companies) or you create it using
randomized
trials.

The latter idea of creating data with the “flip of a coin” is such
a simple yet powerful concept. Techies would be familiar with this
already under a different name – “A/B
testing”.

Let’s take the example of JoAnn sewing machines:

So when JoAnn.com was optimizing their website, they decided to take
a gamble and include in their testing an unlikely promotion for
sewing machines: “Buy two machines and save 10 percent.” They didn’t
expect this test to pan out. After all, how many people need to buy
two sewing machines? Much to their amazement, the promotion
generated by far the highest returns. “People were pulling their
friends together,” says Linsly Donnelly, JoAnn.com’s chief operating
officer. The discount was turning their customers into sales agents.
Overall, randomized testing increased its revenue per visitor by
a whopping 209 percent.

The key is that:

Randomization also frees the researcher to take control of the
questions being asked and to create the information that he/she
wants. Data mining on the historic record is limited by what people
have actually done.

To realize how valuable this methodology is, let’s take the case of
Progresa:

But by far the most important recent randomized social experiment of
development policy is the Progresa Program for Education Health and
Nutrition.
(paraphrased) Zedillo, the Mexican President in 1995, decided that
he wanted to have a major effect on Mexico’s poverty and together
with the members of his administration, he came up with a very
unique poverty alleviation program, which is Progresa.
Progresa is a conditional transfer of cash to poor people. “To get
cash,” Gertler said, “you had to keep your kids in school. To get
the cash you had to get prenatal care if you are pregnant. You had
to go for nutrition monitoring. The idea was to break the
intergenerational transfer of poverty because children who typically
grow up in poverty tend to remain poor.”
Zedillo’s biggest problem was to try to structure Progresa so that
it might outlive his presidency… Gertler said “If you have
a five-year administration and it takes three years to get a program
up and running, then it doesn’t have much time to have an impact
before the new government comes and closes it”.
So starting in 1997, Mexico began a randomized experiment on more
than 24,000 households in 506 villages.
The Progresa villages almost immediately showed substantial
improvements in education and health. Progresa boys attended school
10 percent more than their non-Progresa counterparts. And school
enrollment for Progresa girls was 20 percent higher than for the
control group.
The improvements in health were even more dramatic. The program
produced a 12 percent lower incidence of serious illness and a 12.7
percent reduction in hemoglobin measures of anemia. Children in the
treated villages were nearly a centimeter taller than their
non-Progresa peers. A centimeter of additional growth in such
a short time is a big deal as a measure of increased health.
(emphasis mine)

Best of all, the evidence of Progresa being a good thing was so
convincing that the new government kept it going but under a different
name for political reasons. Zedillo’s idea worked. And beautifully.

Ian goes on to demonstrate similarly how Don Berwick’s campaign
prevented an estimated 1,22,342 hospital deaths in eighteen months.
The campaign was just a few simple suggestions that were determined
based on statistics of how deaths occurred and these suggestions were
implemented by the participating hospitals. The suggestions included
regular washing of hands.

Ian quotes several real-world examples throughout the book and the
number of times that number crunching and data crunching beat human
expertise is staggering. But Ian says that this does not mean the end
of need for human intervention. Supercrunching can validate ideas but
the ideas and hypotheses themselves have to be formulated by us
humans.

He goes on to explain the 2SD
rule
and the Bayes’ theorem in layman terms. Just understanding these two
concepts would go a long way in helping anyone decipher statistics.

All in all, the book was a good inspiring read. I would highly
recommend the book for anyone (even non-techies) interested in how
computers and databases are changing how decisions are made. These
decisions are not limited to websites. As we have seen above, it is
changing everything from how government policy decisions are made to
how movie scripts are being
written.

The key takeaway for me is that data insights are hard and so is
intuition.  People who can straddle both will be important people in
future.  Learning to read the data will mean getting comfortable with
statistics, models and even neural networks (as explained in the
book).

If you’re not patient enough to read the book, you can watch the
Google Tech Talk by Ian
Ayres. You can also read
more of Ian Ayres’ supercrunching stories on the Freakonomics
blog.


We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.
— E. O. Wilson (entomologist and biologist)