It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.
The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version). The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):
- It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers. This is a good thing, on many fronts.
- It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
- It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data. They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
- It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc. If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
- It advocates a similarly catholic approach to storage. Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you. These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems. (C’mon, you know it’s true… )