Review: Big Data Book by Nathan Marz

Recently, I finished reading the latest “early access” version of the Big Data Book by Nathan Marz.

What is Big Data

In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization.

So, Big Data is relevant for any technical and business person whose company deals with lots of information and wants to make use of it. For example, Gmail search, etc.

Why this book is awesome

The book has been a fascinating and engaging learning for me because of two reasons:

First, it has a strong and simple “first principles” approach to an architecture and scalability problem, as opposed to the confusing (to me) and mushrooming complexity and treating Hadoop as a panacea in the Big Data world.

Second, Nathan Marz was one of the only 3 engineers who made the BackType search engine (the company was acq-hired by Twitter):

BackType captures online conversations, everything from tweets to blog comments to checkins and Facebook interactions. Its business is aimed at helping marketers and others understand those conversations by measuring them in a lot of ways, which means processing a massive amount of data.

To give you an idea of the scale of its task, it has about 25 terabytes of compressed binary data on its servers, holding over 100 billion individual records. Its API serves 400 requests per second on average, and it has 60 EC2 servers around at all times, scaling up to 150 for peak loads.

It has pulled this off with only seed funding and just three employees: Christopher Golda, Michael Montano and Nathan Marz. They’re all engineers, so there’s not even any sysadmins to take some of the load.

Note: BackType’s (now open sourced) real time data processing engine Storm powers Twitter’s analytics product and real-time trends among other things.

Lambda Architecture

A wise man once told me that programming is about managing complexity and that is exactly why I love Nathan Marz’s approach to Big Data which is called “Lambda Architecture”:

Lambda Architecture

There are three layers:

Batch layer
- The fundamental shift to be done when designing a Big Data system according to Nathan Marz is to have an append-only data set which means that once a bunch of data is written, it’s never altered, you can only add more to the set. The master data is immutable.
  - This ensures that the data does not get lost or corrupted by bad code / bad assumptions, etc which happens more often than we would like to admit.
  - For a more thorough explanation, do watch Pat Helland’s talk “Immutability changes everything”.
- Each granular piece of data should be accompanied with a timestamp and should be uniquely identifiable. Add a nonce if required to ensure that each row is unique, so that inserts into the database can be idempotent.
  - This allows us to view the data as it was at any instant of time.
- Schemas written using Apache Thrift
  - This enables us to validate the data before storage.
  - Thrift can generate wrappers for every programming language which makes the system language-agnostic for us.
- We ask questions to our system by querying the precomputed views (called batch views), which are simply aggregated information that we generate from our master data similar to indexes in RDBMS systems for fast lookup.
  - Hadoop MapReduce can be used to precompute data. That is what MapReduce is designed for.
  - The batch views are regenerated from the master data continuously, so once a batch view has been generated, the next cycle of computing starts immediately. We’ll see why this is important.
- This layer takes care of the storage of the data.
Serving layer
- The serving layer is a specialized distributed database that loads in batch views, makes them queryable, and continuously swaps in new versions of a batch view as they’re computed by the batch layer.
- Answering queries is based on precomputing query functions to get “batch views” which is indexed for fast random reads.
- This layer takes care of the fast queries on the data.