Skip navigation

One thing I plan to do here is jot down ideas I don’t have time to work on myself. Here’s the first installment in what will hopefully be a running series of “Research Gimme‘s”. Anybody who wants to run with this, I’d love to hear what you’re up to.

So…. who’s going to re-examine Online Aggregation in the Hadoop context? Goodness knows it’d be useful. It will require moving Hadoop beyond a slavish implementation of the Google MapReduce paper. That’s got to be a good thing… Here’s the start of the program:

  1. implement hash-grouping for reduce instead of sort-grouping (yes, this goes against the Google MR assumptions)
  2. provide an API for outputting “early returns” from reducers
  3. decide how/when to stream early returns through to the next Map phase. This will require revising the inter-task disk-staging that the Google paper suggested.
  4. streaming data through M/R tasks will strain the Google/Hadoop model for task recovery. That’s good fun to think about. If you want to consider a fancy/expensive solution, have a look at Mehul Shah‘s two FLuX papers [ICDE03] [SIGMOD04]. But there are other points in the design space that would be easier/cheaper and still support streaming and recovery.

No doubt there’s more to it than that — which will make it even more fun to work on. And think of all the multi-hour jobs you could approximate in seconds.

On a similar note, what about adding Stream query support to Hadoop? Would share a lot of the infrastructure. For continuous streams, though, you really do need to think about FT solutions like FLuX.


  1. Interesting ideas. How does hash-grouping work with variable size clusters?

  2. David — Not sure if I’m interpreting the question right — could be “cluster” of machines, or a “cluster” (group) of data.

    The Google MapReduce model in Hadoop already hashes across a chosen number of machines in a cluster of computers, which may have varying membership (due to failures, for example). The master node keeps track of machine liveness and chooses a degree of parallelism.

    If you’re asking about doing grouping using a hash algorithm rather than sorting, it’s a single-input variant of the classic Hash Join technique. There’s an early paper by Bratbergsengen in VLDB 84 on this. I did some follow-on work in SIGMOD 96 (applied to function caching but usable for Group-BY/Reduce as well).

  3. It seems like this should be a cost-based decision – I smell databases :)

  4. Jeff Hammerbacher points out a relevant demo from VLDB 2008 by Logothetis and Yocum. It is closer to SQL window aggregates than Online Agg, but has some of the right flavor (e.g. pipelining).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: