One thing I plan to do here is jot down ideas I don’t have time to work on myself. Here’s the first installment in what will hopefully be a running series of “Research Gimme‘s”. Anybody who wants to run with this, I’d love to hear what you’re up to.
So…. who’s going to re-examine Online Aggregation in the Hadoop context? Goodness knows it’d be useful. It will require moving Hadoop beyond a slavish implementation of the Google MapReduce paper. That’s got to be a good thing… Here’s the start of the program:
- implement hash-grouping for reduce instead of sort-grouping (yes, this goes against the Google MR assumptions)
- provide an API for outputting “early returns” from reducers
- decide how/when to stream early returns through to the next Map phase. This will require revising the inter-task disk-staging that the Google paper suggested.
- streaming data through M/R tasks will strain the Google/Hadoop model for task recovery. That’s good fun to think about. If you want to consider a fancy/expensive solution, have a look at Mehul Shah‘s two FLuX papers [ICDE03] [SIGMOD04]. But there are other points in the design space that would be easier/cheaper and still support streaming and recovery.
No doubt there’s more to it than that — which will make it even more fun to work on. And think of all the multi-hour jobs you could approximate in seconds.
On a similar note, what about adding Stream query support to Hadoop? Would share a lot of the infrastructure. For continuous streams, though, you really do need to think about FT solutions like FLuX.