I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates. I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.
Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by introducing MapReduce and Hadoop in our initial 61A course. I have argued elsewhere that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.
But as a matter of both pedagogy and system design, I have begun to think that Google’s MapReduce model is not healthy for beginning students. The basic issue is that Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication. But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.
From an architectural point of view, a good language for parallelism should expose pipelining, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it. So here we go.
The canonical example here is the relational join, which pairs up matching objects from two different input streams. One of the early observations in shared-nothing parallel databases was that parallel joins can work in a fully streaming fashion even though they require all-to-all communication (Wilschut and Apers’ Pipelining Hash Join). This is impossible to implement in stock Hadoop. Which means, for example, that the NSA—apparently a big Hadoop user—can’t use MapReduce to match live feeds of suspicious activity reports with a repository of intelligence on suspicious people. (Whether you think it’s good or bad that the NSA can’t use Hadoop for this task may depend on your politics regarding national security, open source, or both.)
That example is easy to understand, but maybe too easy to dismiss as the special case of “stream processing”. Sure, Hadoop wasn’t designed for that, so maybe you should dismiss my point as saying that somebody should just build a streaming edition of Hadoop for special uses. (Oh yeah, somebody did that.)
A more telling pedagogical example is the standard discussion of implementing PageRank in MapReduce. What an awesome lecture: the two most famous pieces of Google technology in action at once! The thing is, at a logical level PageRank is a simple self-join query with transitive closure: it pairs up items in the Nodes table with their neighbors in that same table, and outputs new weights until a stopping condition is reached. The matching of nodes and neighbors needs to be hash-partitioned across the network, but there is no reason to require the entire cluster to do a barrier synchronization at each iteration boundary. The subtext of this lecture’s awesomeness is to deeply implant a misleading (and potentially very inefficient) idea about the need for synchronized execution in parallel linear algebra and query processing tasks. That sounds like a bad way to teach.
And it goes deeper. At a very basic level, asynchronous parallel computation is fundamentally a matter of streaming rendezvous (join) between event channels and buffers. We converted this observation into a first-class feature of the Bloom language for distributed systems. When you write event-handling code in Bloom, you express it as the parallel join of a partitioned event stream with partitioned program state. Whether or not this metaphor makes sense to you (yet!), I assure you that it is fundamentally what goes on under the covers of any asynchronous distributed system: the matching of streams of requests or responses with stored state. By teaching students the Google MapReduce model (as represented in Hadoop), we teach them that distributed joins need to block, and therefore cannot be the basis of the kind of streaming computation that servers must do by their very nature. That is not only bad, it’s paradoxical: underneath the Hadoop implementation is a message handling dataflow that effectively does streaming joins.
So there—at Brian’s request I wrote down some cautionary words about using Hadoop in school. Lest somebody read this as my declaration of allegiance to the “get off my lawn” side of the AntiNoSQLdisestablishmentarianism argument, let me clarify my view on a few key things:
- Google MapReduce and the Hadoop open-source implementation opened the floodgates for discussing big data and parallelism in the core of the computing curriculum. This has been a critical step in moving computing education into the era of abundant computing and data resources. Bully for that.
- I do understand that Google put barriers into MapReduce for fault tolerance purposes. I also get that fault-tolerance is important in scale-out of parallel dataflows (we did some work on this) and that Google uses a lot of machines at once. But as a practical matter, many Hadoop users run on a small handful of nodes and don’t need Google-scale fault tolerance features. More to the point, when we teach computing, we shouldn’t focus on only one design point; we should focus on the fundamentals. This is a case where the fundamentals get taught in a tangled way that only makes sense at extreme scale.
- For the record, as languages go I like MapReduce about as much as I like SQL. That is to say I would invite them both to my birthday party. But I wouldn’t invite either to the prom. My sweetheart here is Bloom.
If you’ve been following our Bloom work, you know where this discussion is coming from: the CALM Theorem says that the real reason to use coordination is to manage non-monotonic reasoning. Many (most?) Reduce functions in the wild are in fact monotonic w.r.t. subsets of their inputs, but the standard Reduce API defies monotonicity analysis. And introducing barriers or monotonic computation is a waste of “time”, in the temporal logical sense of the word.