But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype [updated link to final NSDI ’10 version]. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop. And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster. This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.
Background: Parallel database engines have always been able to stream out results to certain queries while they run. And a bunch of us did research over the years to leverage that feature to do more things, like online aggregation (for “early returns” from inherently batch jobs), and continuous queries (for processing infinitely streaming data inputs, producing infinitely streaming data outputs). So why not do the same for MapReduce engines? Why are they limited to batch processing?
The natural response is that the Dean/Ghemawat MapReduce design from Google enables high availability and performance reliability at unheard-of scale, via a simple checkpoint/restart fault-tolerance mechanism that requires batch-writing things to disk. And I have to say, their approach has an elegant simplicity of mechanism, especially as compared to our work on FLuX from the same year. (FLuX = [Fault-tolerant] [Load balancing] [eXchange]. We tackled the same challenges without resorting to batch processing, via a process-pairs mechanism. But in fairness, FLuX’s ambitions make it much more complex than the D/G approach.)
So in our recent work, we tried to preserve the D/G fault-tolerance model already in Hadoop, while also providing the pipelining of results across tasks and jobs.
HOP is infrastructure. There’s lots more to do now that it’s out there — from systems work (e.g. adaptive dataflow processing and optimization, efficient streaming aggregates, intelligent scheduling…) to algorithmics (sampling, robust estimators and confidence intervals, anytime machine learning techniques…). We hope to get the HOP stuff into the Hadoop distribution, but in the meantime get in touch with us if you’re eager to work with the code.