Hadoop MapReduce is a batch-processing system. Why? Because that’s the way Google described their MapReduce implementation.
But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop. And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster. This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.
Background: Parallel database engines have always been able to stream out results to certain queries while they run. And a bunch of us did research over the years to leverage that feature to do more things, like online aggregation (for “early returns” from inherently batch jobs), and continuous queries (for processing infinitely streaming data inputs, producing infinitely streaming data outputs). So why not do the same for MapReduce engines? Why are they limited to batch processing?
The natural response is that the Dean/Ghemawat MapReduce design from Google enables high availability and performance reliability at unheard-of scale, via a simple checkpoint/restart fault-tolerance mechanism that requires batch-writing things to disk. And I have to say, their approach has an elegant simplicity of mechanism, especially as compared to our work on FLuX from the same year. (FLuX = [Fault-tolerant] [Load balancing] [eXchange]. We tackled the same challenges without resorting to batch processing, via a process-pairs mechanism. But in fairness, FLuX’s ambitions make it much more complex than the D/G approach.)
So in our recent work, we tried to preserve the D/G fault-tolerance model already in Hadoop, while also providing the pipelining of results across tasks and jobs.
HOP is infrastructure. There’s lots more to do now that it’s out there — from systems work (e.g. adaptive dataflow processing and optimization, efficient streaming aggregates, intelligent scheduling…) to algorithmics (sampling, robust estimators and confidence intervals, anytime machine learning techniques…). We hope to get the HOP stuff into the Hadoop distribution, but in the meantime get in touch with us if you’re eager to work with the code.
One Comment
More coverage in O’Reilly Radar
7 Trackbacks/Pingbacks
[...] MapReduce Online! (and some gimmes) « Data Beta databeta.wordpress.com/2009/10/18/mapreduce-online – view page – cached * Posted in database, gimmes, map reduce, parallelism, research * Tagged hadoop, HOP, online aggregation, stream queries — From the page [...]
Social comments and analytics for this post…
This post was mentioned on Twitter by neil_conway: Blog post on our pipelined version of Hadoop, which does running approx answers and stream processing (continuous M/R): http://tr.im/CgSE...
[...] MapReduce Online! (and some gimmes) Data Beta [...]
[...] who is a Ph.D student in UC Berkeley, accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the original MapReduce is specialized and designed for only [...]
[...] 19, 2009 Interesting paper about changes to M-R, in part to enable online processing. Ideas like pipelining and better [...]
[...] MapReduce Online! (and some gimmes) Hadoop MapReduce is a batch-processing system. Why? Because that’s the way Google described their MapReduce [...] [...]
[...] who is a Ph.D student in UC Berkeley accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the original MapReduce is specialized and designed for only [...]