Category Archives: trends
It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.
The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version). The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):
- It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers. This is a good thing, on many fronts.
- It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
- It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data. They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
- It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc. If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
- It advocates a similarly catholic approach to storage. Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you. These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems. (C’mon, you know it’s true… )
I attended the Greenplum customer advisory board meeting this week, including a public briefing in San Francisco for analysts and potential customers. The Greenplum folks asked me to speak at the briefing about parallelism and analytics in the large, outside the scope of Greenplum per se. I cooked up a little slide deck for the occasion on why and whither parallelism and analytics. A familiar story about how the future is parallel, and the practical future is dataflow parallelism. (Familiar yes, but with some nice Flickr clip-art and approachable analogies to explain it.)
The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).
At HPTS 2001 I gave a quick seat-of-the-pants talk called We Lose, which argued that database software and research wasn’t targeting the hacker community, and therefore was dooming itself to irrelevance. This thing — which I cooked up in about 10 minutes — still gets me a bunch of feedback. (The talk included a pitch for an easy-to-use dataflow framework that could harness textual data from files, as part of our original Telegraph work. MapReduce anyone?)
This issue is decidedly back on the table as different approaches are being explored for Cloud development platforms. So I gave a similar pitch at CIDR this year, to try and get the data-centric experts to work on the most important piece of the Cloud: the programming model. I’m hoping this time some folks other than us will bite.
On Wednesday morning there was very interesting discussion of power management in datacenters and data management software
James Hamilton — formerly of Microsoft and now at Amazon — gave a killer keynote talk on power in datacenters. He traces the power from the provider through all the steps involved in powering machines and cooling. He compares all the power-related costs to the costs of the machines, networking and other costs of running a datacenter, and points out the places where efficiency can be found. A few things amazed me:
If you don’t know about CIDR, it’s a conference founded by David DeWitt, Jim Gray and Mike Stonebraker 8 years back, for data systems work and other data-centric topics that would have a tough time getting accepted to the more hidebound conferences like SIGMOD and VLDB. It’s usually a good conference — everybody stays in one room and pays attention, and it tends to attract a smart mix of old hands and young upstarts who really dig in and engage. I ran the program this year, and was really happy with the papers and speakers we got. Only have time to comment on a few things here.
One more post on MapReduce and parallel SQL, this time for the folks at O’Reilly Radar.
Just for the record, I think MapReduce is fine, but not especially interesting technology. The thing is, the “teachable moment” it presents is really great stuff, because it is bringing people toward data-centric parallel programming. So it’s good for the data-centric research business in general, and especially for data-centric approaches to parallelism.
I.e. chum in the water for our research on Lincoln…
The first of two invited posts at GigaOm are up. These are not researchy, they’re intended to be informative to a broad audience. They describe the state of affairs in data parallelism, and some of the reasons why this is an increasingly hot topic.
This started out as an exercise for Greenplum, a company I advise that sells a massively parallel DBMS based on PostgreSQL. I’ve been helping them with their recent launch of a MapReduce interface to their system. That’s been an interesting project. I’ll write about it more soon.
Along the way, they asked if I’d write a blog post for them about parallelism, SQL and MapReduce to put things into perspective. I sat down to write a few paragraphs on the subject and ended up with a seven-page essay. Too long for a blog post so I just turned it into a Tech Report. (a.k.a. a white paper in industrial terms). We excerpted it for GigaOm to run in a couple posts. The original is more nuanced and playful, but hey — blogging isn’t about 7-page essays. I’ll try to control myself here too, and stick with a few paragraphs per post. And if that causes me to write more tech reports, so be it — I’ll link them in.
- invitations to guest blog at CCCBlog and GigaOM
- enthusiasm to write about our new Lincoln and BOOM research projects
- voting with my feet: Open Notebook Science over Double Blind Reviewing
In defense of my Internet cred, I’m not totally late to the game. Back in undergrad days, I spent a lot of time writing Usenet articles: in fact I founded (i.e. collected votes for) the jazz/blues newsgroup rec.music.bluenote, which was the first Internet discussion group on the topic. It had a pretty great run in its day. Blogging isn’t much different.
But I’ll stick to technical computer science topics here — a bit of trendwatching, along with updates and thoughts on the research we’re doing in my group at Berkeley, and what we’re seeing from colleagues and friends in research and in the field. I’ll kick that off soon.
Undoubtedly lots of this will be beta quality. That’s the point, right?