One thing I plan to do here is jot down ideas I don’t have time to work on myself. Here’s the first installment in what will hopefully be a running series of “Research Gimme’s”. Anybody who wants to run with this, I’d love to hear what you’re up to.
So…. who’s going to re-examine Online Aggregation in the Hadoop context? Goodness knows it’d be useful. It will require moving Hadoop beyond a slavish implementation of the Google MapReduce paper. That’s got to be a good thing… Here’s the start of the program:
- implement hash-grouping for reduce instead of sort-grouping (yes, this goes against the Google MR assumptions)
- provide an API for outputting “early returns” from reducers
- decide how/when to stream early returns through to the next Map phase. This will require revising the inter-task disk-staging that the Google paper suggested.
- streaming data through M/R tasks will strain the Google/Hadoop model for task recovery. That’s good fun to think about. If you want to consider a fancy/expensive solution, have a look at Mehul Shah’s two FLuX papers [ICDE03] [SIGMOD04]. But there are other points in the design space that would be easier/cheaper and still support streaming and recovery.
No doubt there’s more to it than that — which will make it even more fun to work on. And think of all the multi-hour jobs you could approximate in seconds.
On a similar note, what about adding Stream query support to Hadoop? Would share a lot of the infrastructure. For continuous streams, though, you really do need to think about FT solutions like FLuX.
4 Comments
Interesting ideas. How does hash-grouping work with variable size clusters?
David — Not sure if I’m interpreting the question right — could be “cluster” of machines, or a “cluster” (group) of data.
The Google MapReduce model in Hadoop already hashes across a chosen number of machines in a cluster of computers, which may have varying membership (due to failures, for example). The master node keeps track of machine liveness and chooses a degree of parallelism.
If you’re asking about doing grouping using a hash algorithm rather than sorting, it’s a single-input variant of the classic Hash Join technique. There’s an early paper by Bratbergsengen in VLDB 84 on this. I did some follow-on work in SIGMOD 96 (applied to function caching but usable for Group-BY/Reduce as well).
It seems like this should be a cost-based decision – I smell databases :)
Jeff Hammerbacher points out a relevant demo from VLDB 2008 by Logothetis and Yocum. It is closer to SQL window aggregates than Online Agg, but has some of the right flavor (e.g. pipelining).