4 min read

Review: Big Data Book by Nathan Marz

Recently, I finished reading the latest “early access” version of the Big Data Book by Nathan Marz.

What is Big Data

Let’s look up Wikipedia:

In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization.

So, Big Data is relevant for any technical and business person whose company deals with lots of information and wants to make use of it. For example, Gmail search, etc.

Why this book is awesome

The book has been a fascinating and engaging learning for me because of two reasons:

First, it has a strong and simple “first principles” approach to an architecture and scalability problem, as opposed to the confusing (to me) and mushrooming complexity and treating Hadoop as a panacea in the Big Data world.

Second, Nathan Marz was one of the only 3 engineers who made the BackType search engine (the company was acq-hired by Twitter):

BackType captures online conversations, everything from tweets to blog comments to checkins and Facebook interactions. Its business is aimed at helping marketers and others understand those conversations by measuring them in a lot of ways, which means processing a massive amount of data.
To give you an idea of the scale of its task, it has about 25 terabytes of compressed binary data on its servers, holding over 100 billion individual records. Its API serves 400 requests per second on average, and it has 60 EC2 servers around at all times, scaling up to 150 for peak loads.
It has pulled this off with only seed funding and just three employees: Christopher Golda, Michael Montano and Nathan Marz. They’re all engineers, so there’s not even any sysadmins to take some of the load.

Note: BackType’s (now open sourced) real time data processing engine Storm powers Twitter’s analytics product and real-time trends among other things.

Lambda Architecture

A wise man once told me that programming is about managing complexity and that is exactly why I love Nathan Marz’s approach to Big Data which is called “Lambda Architecture”:

Lambda Architecture
Lambda Architecture

There are three layers:

  1. Batch layer
    • The fundamental shift to be done when designing a Big Data system according to Nathan Marz is to have an append-only data set which means that once a bunch of data is written, it’s never altered, you can only add more to the set. The master data is immutable.
    • Each granular piece of data should be accompanied with a timestamp and should be uniquely identifiable. Add a nonce if required to ensure that each row is unique, so that inserts into the database can be idempotent.
      • This allows us to view the data as it was at any instant of time.
    • Schemas written using Apache Thrift
      • This enables us to validate the data before storage.
      • Thrift can generate wrappers for every programming language which makes the system language-agnostic for us.
    • We ask questions to our system by querying the precomputed views (called batch views), which are simply aggregated information that we generate from our master data similar to indexes in RDBMS systems for fast lookup.
      • Hadoop MapReduce can be used to precompute data. That is what MapReduce is designed for.
      • The batch views are regenerated from the master data continuously, so once a batch view has been generated, the next cycle of computing starts immediately. We’ll see why this is important.
    • This layer takes care of the storage of the data.
  2. Serving layer
    • The serving layer is a specialized distributed database that loads in batch views, makes them queryable, and continuously swaps in new versions of a batch view as they’re computed by the batch layer.
    • Answering queries is based on precomputing query functions to get “batch views” which is indexed for fast random reads.
    • This layer takes care of the fast queries on the data.
  3. Speed layer
    • Speed layer compensates for the last few hours of data coming in after batch layer took a “snapshot” for computation.
    • Speed layer is similar to batch layer in that it produces views based on data it receives but biggest difference is incremental updates for the incoming realtime data vs. recomputation updates.
    • Once data makes it through the batch layer into the serving layer, the corresponding results in the realtime views are no longer needed. This means you can discard pieces of the realtime view as they’re no longer needed.
    • If anything goes wrong, you can discard the state for entire speed layer and everything will be back to normal within a few hours. This “complexity isolation” property greatly limits the potential negative impact of the complexity of the speed layer.
    • This layer takes care of the queries including real-time data.
  4. Combining Results
    • The last piece of the Lambda Architecture is merging the results from the batch (serving layer) and realtime (speed layer) views to quickly compute query functions.

Some of the benefits in the architecture compared to traditional database systems are:

  • Storage and Querying concepts are not mixed. This itself is a big win IMHO.
  • Human errors in computation can be easily fixed in a few hours because the batch views are regenerated every few hours (time taken depends on your calculations and data size).
  • Scalable (“Scalability is the ability of a system to maintain performance under increased load by adding more resources.”)
  • Real-time view of the data

Understanding this architecture is one thing but internalizing it by going through the details in the book is when the beauty of this architecture is really understood.

Note that the book is still a work-in-progress (MEAP version 7 as of this writing) and has already helped me understand Big Data architectures better than before. I am quite looking forward to the chapters in the latter half of the book.

To figure out if the book is relevant to you, I would recommend watching Nathan Marz’s presentation “Runaway Complexity in Big Data, and a Plan to Stop It” along with the slides.