It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.
The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version). The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):
- It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers. This is a good thing, on many fronts.
- It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
- It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data. They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
- It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc. If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
- It advocates a similarly catholic approach to storage. Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you. These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems. (C’mon, you know it’s true… )
I started to write the paper because it was just too cool what Brian Dolan was doing with Greenplum at Fox Interactive Media (parent company of MySpace.com) — e.g., writing Support Vector Machines in SQL and running it over dozens of TB of data. Brian was a great sport about taking his real-world experience and good ideas and putting them down on paper for others to read. Along the way I learned a lot about the data architecture ideas he’s been cooking with Mark Dunlap, which are real thumb in the eye of the warehouse orthodoxy, and make eminent good sense in today’s world. Finally, it was nice to get to write about the good things that Jeff Cohen and Caleb Welton have been doing at Greenplum to cut through the hype and shrink the distance between SQL and MapReduce. I’m hoping those guys will have time to sit down one of these days and patiently write up how they’ve done it … it’s really very elegant.
And it still warms my heart that it’s Postgres code underneath all that. Time to resurrect the xfunc code!