Skip navigation

1463574952_dd400430e5

Update: VLDB slides posted [pptx] [pdf]

It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.

The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version).  The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):

  • It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers.  This is a good thing, on many fronts.
  • It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
  • It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data.  They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
  • It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc.  If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
  • It advocates a similarly catholic approach to storage.  Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you.  These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems.  (C’mon, you know it’s true… )

I started to write the paper because it was just too cool what Brian Dolan was doing with Greenplum at Fox Interactive Media (parent company of MySpace.com) — e.g., writing Support Vector Machines in SQL and running it over dozens of TB of data.  Brian was a great sport about taking his real-world experience and good ideas and putting them down on paper for others to read.  Along the way I learned a lot about the data architecture ideas he’s been cooking with Mark Dunlap, which are real thumb in the eye of the warehouse orthodoxy, and make eminent good sense in today’s world.  Finally, it was nice to get to write about the good things that Jeff Cohen and Caleb Welton have been doing at Greenplum to cut through the hype and shrink the distance between SQL and MapReduce.  I’m hoping those guys will have time to sit down one of these days and patiently write up how they’ve done it … it’s really very elegant.

And it still warms my heart that it’s Postgres code underneath all that.  Time to resurrect the xfunc code!

5 Comments

  1. Thanks for posting the paper draft up. I’m trying to wrap my head around this new way of thinking about using DBs.

    But I have a question on the paper. Why in the bootstrap example do you exclude repeated keys with the DISTINCT keyword. I understood from everything I’ve read about bootstrap that the point is to do random draws with replacement, which means you *don’t* want that distinct operator in there.

    (Wish we had the budget for a Greenplum solution here at UCI, but we’ll probably just have to roll our own Hadoop/Couchdb/PostgreSQL Frankenstein monster.)

    James

  2. Hi James. You’re right that bootstrapping works with replacement. The DISTINCT comes in to help us simulate the subroutine of what you call a “random draw” and the paper calls a “subsample”. Think of a random draw as sticking your hand in a jar and picking out a fistful of jellybeans. In bootstrapping we look at the fistful, compute a summary statistic, then put the jellybeans back and repeat, smoothing out all those summary statistics. So we sample WITH replacement across draws, as you say.

    The DISTINCT in the query ensures that we don’t simulate a single fistful that magically has the same jellybean twice. We need to worry about this because we’re simulating an entire draw by iteratively sampling jellybeans one at a time. So we do that WITHOUT replacement.

    The way the query is written, we get what we want: random well-formed draws with replacement across draws. We ensure that the pair of (sample, subsample) is unique, which amounts to saying that a given jellybean can appear in a given draw at most once.

    Joe

  3. PS: Greenplum (like a lot of DB vendors) has a free download for evaluation use.

  4. Hmm, respectfully I disagree with you. According to Chernick’s book “Bootstrap Methods”, p 9, (all the other refs on bootstrap methods were checked out of the library!)

    1. Generate a sample with replacement from the empirical distribution

    2. Compute * the value of theta-hat obtained by using the bootstrap sample

    3. Repeat steps 1 and 2 k times.

    So according to this, the *subsample* is what should be drawn with replacement. Of course the replacement also happens between draws (these are independent trials), but each variable in each subsample should have a uniform probability of 1/n.

    Similarly in Venables and Ripley’s MASS, 3rd ed, p 142, they reiterate that “the new samples consist of an integer number of copies of each of the original data points, and so will normally have ties.” If you prevent the ties in each subsample, then you aren’t correctly sampling with replacement. Most of the examples I’ve read are on smaller data sets, and so each subsample is of size n, and so the meaning of “replacement” is pretty clear—if you don’t do the replacement in the sample, you just get a random permutation of the n original observations. Admittedly, with a subsample size of 3, and if n is on the order of millions or billions, you’re not going to get very many repeats!

    (But I’m not reviewing the paper, so of course you can ignore me).

    And I missed the free evaluation download. Thanks for the tip.

    James

  5. Good! I respectfully accept your disagreement for the schooling that it is :-) We’ll be sure to fix and ack you in a future version of the paper.


5 Trackbacks/Pingbacks

  1. By Bootstrap in a view « Contour Line on 03 Apr 2009 at 11:35 am

    […] a view Filed under: couchdb, research, transportation — jmarca @ 11:35 am Inspired by this post, I am playing around with implementing bootstrapping various statistics as a view in couchdb.  I […]

  2. […] Greenplum adviser Joe Hellerstein’s pitch for agile data warehousing […]

  3. By EMC += Greenplum « Data Beta on 06 Jul 2010 at 9:52 pm

    […] recent months, following their involvement in the MAD Skills work and their Chorus collaboration platform, Greenplum began discussions with academics and […]

  4. By The End of the Data Warehouse « BI Monitor on 31 Oct 2012 at 2:33 pm

    […] When I ran product at Greenplum, we understood this reality. Working with brilliant folks like Joe Hellerstein (UC Berkeley) and Brian Dolan (then at Fox Interactive), the team developed practices to navigate around the outmoded approaches of the past. Joe coined the name ‘MAD Skills’ (Magnetic, Agile and Deep). […]

Leave a comment