Recently, I finished reading the latest “early access” version of the Big Data Book by Nathan Marz.

What is Big Data

Let’s look up Wikipedia:

In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization.

So, Big Data is relevant for any technical and business person whose company deals with lots of information and wants to make use of it. For example, Gmail search, etc.

Why this book is awesome

The book has been a fascinating and engaging learning for me because of two reasons:

First, it has a strong and simple “first principles” approach to an architecture and scalability problem, as opposed to the confusing (to me) and mushrooming complexity and treating Hadoop as a panacea in the Big Data world.

Second, Nathan Marz was one of the only 3 engineers who made the BackType search engine (the company was acq-hired by Twitter):

BackType captures online conversations, everything from tweets to blog comments to checkins and Facebook interactions. Its business is aimed at helping marketers and others understand those conversations by measuring them in a lot of ways, which means processing a massive amount of data.

To give you an idea of the scale of its task, it has about 25 terabytes of compressed binary data on its servers, holding over 100 billion individual records. Its API serves 400 requests per second on average, and it has 60 EC2 servers around at all times, scaling up to 150 for peak loads.

It has pulled this off with only seed funding and just three employees: Christopher Golda, Michael Montano and Nathan Marz. They’re all engineers, so there’s not even any sysadmins to take some of the load.

Note: BackType’s (now open sourced) real time data processing engine Storm powers Twitter’s analytics product and real-time trends among other things.

Lambda Architecture

wise man once told me that programming is about managing complexity and that is exactly why I love Nathan Marz’s approach to Big Data which is called “Lambda Architecture”:

Lambda Architecture

There are three layers:

  1. Batch layer
  2. Serving layer