I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates. I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.
Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by introducing MapReduce and Hadoop in our initial 61A course. I have argued elsewhere that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.
But as a matter of both pedagogy and system design, I have begun to think that Google’s MapReduce model is not healthy for beginning students. The basic issue is that Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication. But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.
From an architectural point of view, a good language for parallelism should expose pipelining, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it. So here we go.
The canonical example here is the relational join, which pairs up matching objects from two different input streams. One of the early observations in shared-nothing parallel databases was that parallel joins can work in a fully streaming fashion even though they require all-to-all communication (Wilschut and Apers’ Pipelining Hash Join). This is impossible to implement in stock Hadoop. Which means, for example, that the NSA—apparently a big Hadoop user—can’t use MapReduce to match live feeds of suspicious activity reports with a repository of intelligence on suspicious people. (Whether you think it’s good or bad that the NSA can’t use Hadoop for this task may depend on your politics regarding national security, open source, or both.)
That example is easy to understand, but maybe too easy to dismiss as the special case of “stream processing”. Sure, Hadoop wasn’t designed for that, so maybe you should dismiss my point as saying that somebody should just build a streaming edition of Hadoop for special uses. (Oh yeah, somebody did that.)
A more telling pedagogical example is the standard discussion of implementing PageRank in MapReduce. What an awesome lecture: the two most famous pieces of Google technology in action at once! The thing is, at a logical level PageRank is a simple self-join query with transitive closure: it pairs up items in the Nodes table with their neighbors in that same table, and outputs new weights until a stopping condition is reached. The matching of nodes and neighbors needs to be hash-partitioned across the network, but there is no reason to require the entire cluster to do a barrier synchronization at each iteration boundary. The subtext of this lecture’s awesomeness is to deeply implant a misleading (and potentially very inefficient) idea about the need for synchronized execution in parallel linear algebra and query processing tasks. That sounds like a bad way to teach.
And it goes deeper. At a very basic level, asynchronous parallel computation is fundamentally a matter of streaming rendezvous (join) between event channels and buffers. We converted this observation into a first-class feature of the Bloom language for distributed systems. When you write event-handling code in Bloom, you express it as the parallel join of a partitioned event stream with partitioned program state. Whether or not this metaphor makes sense to you (yet!), I assure you that it is fundamentally what goes on under the covers of any asynchronous distributed system: the matching of streams of requests or responses with stored state. By teaching students the Google MapReduce model (as represented in Hadoop), we teach them that distributed joins need to block, and therefore cannot be the basis of the kind of streaming computation that servers must do by their very nature. That is not only bad, it’s paradoxical: underneath the Hadoop implementation is a message handling dataflow that effectively does streaming joins.
So there—at Brian’s request I wrote down some cautionary words about using Hadoop in school. Lest somebody read this as my declaration of allegiance to the “get off my lawn” side of the AntiNoSQLdisestablishmentarianism argument, let me clarify my view on a few key things:
- Google MapReduce and the Hadoop open-source implementation opened the floodgates for discussing big data and parallelism in the core of the computing curriculum. This has been a critical step in moving computing education into the era of abundant computing and data resources. Bully for that.
- I do understand that Google put barriers into MapReduce for fault tolerance purposes. I also get that fault-tolerance is important in scale-out of parallel dataflows (we did some work on this) and that Google uses a lot of machines at once. But as a practical matter, many Hadoop users run on a small handful of nodes and don’t need Google-scale fault tolerance features. More to the point, when we teach computing, we shouldn’t focus on only one design point; we should focus on the fundamentals. This is a case where the fundamentals get taught in a tangled way that only makes sense at extreme scale.
- For the record, as languages go I like MapReduce about as much as I like SQL. That is to say I would invite them both to my birthday party. But I wouldn’t invite either to the prom. My sweetheart here is Bloom.
If you’ve been following our Bloom work, you know where this discussion is coming from: the CALM Theorem says that the real reason to use coordination is to manage non-monotonic reasoning. Many (most?) Reduce functions in the wild are in fact monotonic w.r.t. subsets of their inputs, but the standard Reduce API defies monotonicity analysis. And introducing barriers or monotonic computation is a waste of “time”, in the temporal logical sense of the word.
10 Comments
Joe, I don’t understand why Hadoop/MR is the only representative considered from a whole class of useful constructs. There are other alternatives that have much more flexibility, e.g. Dryad, which can teach the benefits of massive parallelism without the rather artificial constraints of MR.
Ben, I didn’t mean to suggest it was the only representative to consider. I’d invite Dryad to my birthday too!
The fact is that MR is at the center of the education discussion for a bunch of reasons, including (a) it has orders of magnitude more users, and (b) it connects directly to many intro courses that use Scheme or related single-node functional languages. Also Google invested time and money some years ago into getting MR curriculum and Hadoop into schools.
I’m just saying that I have some second thoughts about that phenomenon. I didn’t mean to propose or exclude any specific alternatives.
Oh absolutely, I get that. I’m not favoring any of these systems, but just wondering why MR was a preferred tool for teaching. Not that I get the full depth of your post, but from what I did grok, I think many of your points focused on global barriers, and some of the other constraints MR placed on the model by virtue of its original design. But I don’t see those characteristics as fundamental (or even necessary, as Dryad shows). So hypothetically, if there was an open source implementation of Dryad, would that be a more attractive candidate? Just curious :-) I think teaching distributed computing tools early is a great idea, and I agree that it definitely matters which model young CS minds are shown as their introduction into this space…
Ben: I was one of the students that came up with the initial version of the MapReduce curriculum for the 61 series, so I might be able to provide some insight on why we went with MapReduce.
The reason we found MapReduce particularly attractive for 61A is that it was not a huge cognitive leap from Scheme to MapReduce. The students were already very familiar with map and accumulate operations over lists by that point, and really MapReduce is just those operations writ large. We also knew that we wanted the students to be able to play with data parallel computation from within the Scheme interpreter, and Hadoop provided a reasonably straightforward way to do that.
We tried to stay away from the MapReduce _implementation_ as much as possible, though, for the reasons already mentioned and several others. I think the main thing we wanted students to take away from the unit on MapReduce in 61A is that the functional concepts that they were becoming familiar with in the course were actually being deployed to do real work at large scales and weren’t just a pedagogical toy used to teach would-be computer scientists about recursion. I don’t know if Brian or Dan will agree with me on that though.
Completely agreed that additional programming models and fundamental program structures in this space need to be taught alongside MR. It is a little frightening what people will try to shoehorn into the MapReduce model just because Hadoop exists. For many, particularly in industry, Big Data == Hadoop because they don’t know about anything else, and that’s fundamentally limiting.
Joe: Bravo, by the way, for open-sourcing Bloom as quickly as you have. Hadoop’s open-source nature is the key to its success, and having alternative models and frameworks that can compete with it in that respect is really important.
Well I guess it boils down to what you want to teach
Overall the fault tolerance features are very central to the map/reduce paradigm and if they are not important to you then you should seriously consider other options for a class
“But as a practical matter, many Hadoop users run on a small handful of nodes and don’t need Google-scale fault tolerance features. ”
I found this confusing. As a practical matter, if you running hadoop on a small handful of nodes you are probably better off not running hadoop? Certainly there is a break even point with small clusters where with the overhead of a cluster actually runs things slower then if you ran them on a a single SMP
Also remember that a lot of people are running map/reduce on cloud providers like ec2 and rackspace. Those clouds have failure characteristics are very very different from physical hardware.
Maybe teach an alternate computing paradigm on the hadoop stack in addition to map/reduce like say Giraph? Would be an interesting compare and contrast since Giraph is a bulk-synchronous parallel model
Yes! Your first sentence is right in line with some of my concerns. I don’t think the fault tolerance facet of the problem should drive the way we teach the fundamentals to first-year students.
Obviously FT is an important topic for an upper-level course (though the Ghemawat/Dean approach is overemphasized in a lot of today’s discussion). And the BSP model is another good point in the design space to look at with advanced students. But that’s not an intro-to-CS kind of discussion.
WRT small Hadoop clusters in practice — lots of people do it. They might be better off with something else, but most of the appropriate somethings either cost money or are not as mature.
yes lots of people do it. That doesn’t mean it makes sense. There are a lot of free, mature equivalents for low volume data processing. They do it because hadoop is new and shiny and they are looking for an excuse to use it. It’s a common antipattern.
I like many of the broader points and the way they’ve been made here.
But there is a rather easy way to specify that a reduce can be done recursively (that the computation is commutative and associative for example), which in turn translates to a barrier-free execution. I would hazard to say that most real data-parallel engines do that already…
Barrier punching beyond that, as in “can work in a fully streaming fashion even though they require all-to-all communication” is a somewhat more complex trade-off. In part, due to the complexity in recovering from partial failures and the cost of reserving resources well ahead of when they are actually needed.
You make a good point. My remaining concern is that asserting commutativity/associativity as you describe is dangerous: you’re promising the compiler something about your code that the compiler can’t check, and you’re asserting not only that you have successfully achieved that property, but that it will continue to hold even as other developers evolve you code over time. That’s generally not good software engineering.
So I guess my point is that it’s preferable to teach a language (or API) that is checkable in this respect. MapReduce isn’t such a language.
There are in-memory implementations of the map/reduce idea for different platforms. I’ve seen a Ruby one and one for Qt (OS for windows/unix/Nokia phones). You can’t buy a single-processor general-purpose machine any more, not even a laptop. I’ve done it in maybe 10 simple Java classes (github, LanceNorskog, littlemr).
A good computer science education these days should include a few different parallel-programming techniques: map/reduce or another purely data style, some object-passing style, some message-passing style like MPI. And the student should preferably implement the guts of such a thing.
College is not about learning a skill, it is about learning to think. Once you understand two different answers to a problem, it becomes easy to think of a third answer.
Thanks for your blog- I’m learning a lot from other posts.
2 Trackbacks/Pingbacks
[…] a steep learning curve associated with these tools. Some have even wondered if they’re worth introducing to programming […]
[…] – Could teaching MapReduce be good for beginning […]