Skip navigation

The first of two invited posts at GigaOm are up.  These are not researchy, they’re intended to be informative to a broad audience.  They describe the state of affairs in data parallelism, and some of the reasons why this is an increasingly hot topic.

This started out as an exercise for Greenplum, a company I advise that sells a massively parallel DBMS based on PostgreSQL.  I’ve been helping them with their recent launch of a MapReduce interface to their system.  That’s been an interesting project. I’ll write about it more soon.

Along the way, they asked if I’d write a blog post for them about parallelism, SQL and MapReduce to put things into perspective.  I sat down to write a few paragraphs on the subject and ended up with a seven-page essay.  Too long for a blog post so I just turned it into a Tech Report. (a.k.a. a white paper in industrial terms).  We excerpted it for GigaOm to run in a couple posts.  The original is more nuanced and playful, but hey — blogging isn’t about 7-page essays.  I’ll try to control myself here too, and stick with a few paragraphs per post.  And if that causes me to write more tech reports, so be it — I’ll link them in.

Advertisements

3 Comments

  1. It would be great if you could document equivalent concepts to MapReduce that have been explored in academia.

  2. I’m not sure how to define equivalent, but any discussion of parallel dataflow software should start with Gray and DeWitt’s CACM survey on parallel databases. You can work backwards from there through the Gamma and Bubba projects. More recently there’s Dryad from Microsoft Research. I keep meaning to read the Clustera work from Wisconsin. And there’s been a variety of academic work around Hadoop itself in the last year or two, including both internals issues and languages (Pig/JAQL/etc). The Hadoop-centric stuff should be relatively easy to find with a web search.

    MapReduce is not so far from data streams. Our work on FLuX for Fault-tolerant and Load-balanced parallel data streams came out about the same time as Google’s MapReduce, and it’s interesting to compare them. This grew out of earlier work by the ArpaciDusseaus on River and NOW-Sort.

  3. There’s also the Active Data Repository project (http://www.cs.umd.edu/projects/hpsl/chaos/ResearchAreas/adr/) that proposed using a MapReduce-like paradigm to represent certain computations on multi-dimensional scientific data.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: