I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates. I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.
Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by introducing MapReduce and Hadoop in our initial 61A course. I have argued elsewhere that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.
But as a matter of both pedagogy and system design, I have begun to think that Google’s MapReduce model is not healthy for beginning students. The basic issue is that Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication. But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.
From an architectural point of view, a good language for parallelism should expose pipelining, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it. So here we go.
Read More »
We were happy to find out this week that our BOOM project and and Bloom langauge have been selected by Technology Review magazine as one of the TR10, their “annual list of the emerging technologies that will have the biggest impact on our world.” This was news to us — we knew they were going to run an article, but weren’t aware of the TR10 distinction. Pretty neat.
I’ve been getting a lot of questions since the article launched about the project and language. So while folks are paying attention, here’s a quick FAQ to answer what the project is all about and its status.
Read More »
Hadoop MapReduce is a batch-processing system. Why? Because that’s the way Google described their MapReduce implementation.
But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype [updated link to final NSDI ’10 version]. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop. And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster. This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.
Read More »
For the last year or so, my team at Berkeley — in collaboration with Yahoo Research — has been undertaking an aggressive experiment in programming. The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: The Cloud. We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.
To kick this off we built something we call BOOM Analytics [link updated to Eurosys10 final version]: a clone of Hadoop and HDFS built largely in Overlog, a declarative language we developed some years back for network protocols. BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure. As a result we were able — with amazingly little effort — to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop’s Java codebase. Two of the fanciest are: Read More »
I just heard through the Berkeley grapevine about the BashReduce effort at Last.fm: MapReduce in 126 lines of bash script! Awesome. I’m sure it doesn’t do X, Y and Z. So ask yourself: do you need X? Y? Z? Maybe instead you want V and W. Maybe you should roll your own tool.
Makes you think.
Was intrigued last week by the confluence of two posts:
- Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
- Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes
Both impressive. But wildly different hardware deployments. Why?? It’s well known that Hadoop is tuned for availability not efficiency. But does it really need 40x the number of machines as eBay’s Greenplum cluster? How did smart folks end up with such wildly divergent numbers?
Read More »
Update: VLDB slides posted [pptx] [pdf]
It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.
The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version). The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):
- It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers. This is a good thing, on many fronts.
- It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
- It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data. They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
- It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc. If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
- It advocates a similarly catholic approach to storage. Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you. These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems. (C’mon, you know it’s true… )
Read More »
At HPTS 2001 I gave a quick seat-of-the-pants talk called We Lose, which argued that database software and research wasn’t targeting the hacker community, and therefore was dooming itself to irrelevance. This thing — which I cooked up in about 10 minutes — still gets me a bunch of feedback. (The talk included a pitch for an easy-to-use dataflow framework that could harness textual data from files, as part of our original Telegraph work. MapReduce anyone?)
This issue is decidedly back on the table as different approaches are being explored for Cloud development platforms. So I gave a similar pitch at CIDR this year, to try and get the data-centric experts to work on the most important piece of the Cloud: the programming model. I’m hoping this time some folks other than us will bite.