Skip navigation

oscilloHadoop MapReduce is a batch-processing system.  Why?  Because that’s the way Google described their MapReduce implementation.

But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype [updated link to final NSDI ’10 version]. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop.  And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster.  This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.

Background: Parallel database engines have always been able to stream out results to certain queries while they run.  And a bunch of us did research over the years to leverage that feature to do more things, like online aggregation (for “early returns” from inherently batch jobs), and continuous queries (for processing infinitely streaming data inputs, producing infinitely streaming data outputs).  So why not do the same for MapReduce engines?  Why are they limited to batch processing?

The natural response is that the Dean/Ghemawat MapReduce design from Google enables high availability and performance reliability at unheard-of scale, via a simple checkpoint/restart fault-tolerance mechanism that requires batch-writing things to disk.  And I have to say, their approach has an elegant simplicity of mechanism, especially as compared to our work on FLuX from the same year. (FLuX = [Fault-tolerant] [Load balancing] [eXchange].  We tackled the same challenges without resorting to batch processing, via a process-pairs mechanism.  But in fairness, FLuX’s ambitions make it much more complex than the D/G approach.)

So in our recent work, we tried to preserve the D/G fault-tolerance model already in Hadoop, while also providing the pipelining of results across tasks and jobs.

HOP is infrastructure.  There’s lots more to do now that it’s out there — from systems work (e.g. adaptive dataflow processing and optimization, efficient streaming aggregates, intelligent scheduling…) to algorithmics (sampling, robust estimators and confidence intervals, anytime machine learning techniques…).  We hope to get the HOP stuff into the Hadoop distribution, but in the meantime get in touch with us if you’re eager to work with the code.

Advertisements

2 Comments

  1. More coverage in O’Reilly Radar

  2. Paper accepted to NSDI 2011! Will post camera-ready version when available.


8 Trackbacks/Pingbacks

  1. […] MapReduce Online! (and some gimmes) « Data Beta databeta.wordpress.com/2009/10/18/mapreduce-online – view page – cached * Posted in database, gimmes, map reduce, parallelism, research * Tagged hadoop, HOP, online aggregation, stream queries — From the page […]

  2. By uberVU - social comments on 18 Oct 2009 at 9:48 pm

    Social comments and analytics for this post…

    This post was mentioned on Twitter by neil_conway: Blog post on our pipelined version of Hadoop, which does running approx answers and stream processing (continuous M/R): http://tr.im/CgSE

  3. […] MapReduce Online! (and some gimmes) Data Beta […]

  4. […] who is a Ph.D student in UC Berkeley, accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the original MapReduce is specialized and designed for only […]

  5. […] 19, 2009 Interesting paper about changes to M-R, in part to enable online processing. Ideas like pipelining and better […]

  6. By Top Posts « WordPress.com on 19 Oct 2009 at 5:43 pm

    […] MapReduce Online! (and some gimmes) Hadoop MapReduce is a batch-processing system.  Why?  Because that’s the way Google described their MapReduce […] […]

  7. By MapReduce Online Comes Out! | The Chung Lab. on 19 Oct 2009 at 7:16 pm

    […] who is a Ph.D student in UC Berkeley accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the original MapReduce is specialized and designed for only […]

  8. By Props to the Team « Data Beta on 22 Jan 2010 at 6:09 pm

    […] work on Hadoop Online (HOP) was accepted to NSDI 2010.  This was driven by Tyson Condie, with yeoman work from the BOOM gang […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: