Skip navigation


For the last year or so, my team at Berkeley — in collaboration with Yahoo Research — has been undertaking an aggressive experiment in programming.  The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: The Cloud.  We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.

To kick this off we built something we call BOOM Analytics [link updated to Eurosys10 final version]: a clone of Hadoop and HDFS built largely in Overlog, a declarative language we developed some years back for network protocols.  BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure.  As a result we were able — with amazingly little effort — to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop’s Java codebase.  Two of the fanciest are:

  • High availability: Hadoop has a single point of failure at its HDFS master (“name”) node.  BOOM Analytics provides hot-standby name node failover, courtesy of a concise Overlog implementation of MultiPaxos.
  • Scale-Out: Hadoop has two chokepoints for scalability at its master nodes, which must run on a single box.   BOOM Analytics provides name node scaleout via data partitioning of the namespace.  When the name node gets full, you just buy more machines and repartition.  And this composes with the previous feature: each partition enjoys high availability via MultiPaxos.

The whole effort of building BOOM Analytics took 12 months for 4 PhD students — including the time taken to write an Overlog interpreter from scratch.  The scaleout feature took one grad-student developer a day to implement.  One day.

BOOM Analytics is serious distributed fu.  But it really turned out to be pretty simple with the right programming model.

Why did we do this?  What’s next?  Well, you can read the tech report for details.  Just a couple notes here.  First, we have no designs on displacing Hadoop.  BOOM Analytics is intended as an example of the power of declarative programming.  If it has any utility for the Hadoop community it would probably be as a rapid prototyping environment for new features. The other point is that we’re not trying to sell Overlog as the “right” language for anything. In fact, a key motivation for building BOOM Analytics was to gain some practical experience in order to design a better language.  We are calling that language Lincoln. I look forward to talking about it more soon.

4 Trackbacks/Pingbacks

  1. By A Revolution in DMBS’s « Idle Process on 26 Jul 2009 at 11:55 am

    […] final effort worth mentioning in this space is the BOOM Analytics paper out of Joe Hellerstein’s group at Berkeley.  They have a re-implementation of Hadoop […]

  2. […] Data Beta on computing and data .. in permanent beta About « The Cloud Goes BOOM: First Strike […]

  3. By Props to the Team « Data Beta on 22 Jan 2010 at 6:09 pm

    […] the BOOM front, our initial paper on BOOM Analytics was accepted to Eurosys 2010.  This was a big team effort, involving Peter Alvaro, Tyson Condie, […]

  4. By Data as Cash « Idea2Bank on 28 Jun 2010 at 8:48 am

    […] Knowledge Fusion is a term that historically draws on many dataspace concepts and techniques developed in other areas such as artificial intelligence (AI), machine discovery, machine learning, search, knowledge representation, semantics and statistics. It is starkly different from other decision-support technologies in as much as it is not purely retrospective in nature. For example, language based ad hoc queries and reporting are used to analyze what has happened in the past, answering very specific business questions such as, ‘How many widgets did we sell last week?’ When using these tools, the user will already have a question or hypothesis that requires answering or validation. Knowledge Fusion is very different as it is forward-thinking and aims to predict future events and discover unknown patterns and subsequently build models – these models are then used to support predictive ‘what-if’ analysis, such as, ‘How many widgets are we likely to sell next week?’ based on the context and meaning of the data as it is utilized.  KF also allows this context and meaning to infer further linkages from the initial query.  How is this knowledge fusion processed and accessed.  Why is it needed?  Lets start with some technical assumptions from the BOOM paper: […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: