MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!
Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:
- standard statistical methods like multi-variate linear and logistic regressions,
- supervised learning methods including support-vector machines, naive Bayes, and decision trees
- unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
- descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
- statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.
More methods are planned for future releases. Myself, I’m working with Daisy Wang on merging her SQL-based Conditional Random Fields and Bayesian inference implementations into the library for an upcoming release, to support sophisticated text processing.
Read More »
I don’t usually post about business deals on my blog. But today’s acquisition of Greenplum by EMC
is too close to home not to comment. I’ve been involved as a technical advisor at Greenplum for almost three years, and joined the EMC technical advisory board this spring — so I have some interest in the deal.
Below is my take on things from the technical side. Note that I’m not privy to any private information about the deal, and I’m generally more interested in the tech than the finance. No need to try and read financial tea leaves here — there aren’t any. This is a computer scientist’s view of the technology implications. Here goes:
Was intrigued last week by the confluence of two posts:
- Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
- Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes
Both impressive. But wildly different hardware deployments. Why?? It’s well known that Hadoop is tuned for availability not efficiency. But does it really need 40x the number of machines as eBay’s Greenplum cluster? How did smart folks end up with such wildly divergent numbers?
Read More »
Update: VLDB slides posted [pptx] [pdf]
It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.
The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version). The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):
- It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers. This is a good thing, on many fronts.
- It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
- It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data. They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
- It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc. If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
- It advocates a similarly catholic approach to storage. Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you. These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems. (C’mon, you know it’s true… )
Read More »
I’m increasingly believing my own story that data-centric programming is the future of parallel computing at the high end. I’m starting to hear it echoed back at me from real people.
I attended the Greenplum customer advisory board meeting this week, including a public briefing in San Francisco for analysts and potential customers. The Greenplum folks asked me to speak at the briefing about parallelism and analytics in the large, outside the scope of Greenplum per se. I cooked up a little slide deck for the occasion on why and whither parallelism and analytics. A familiar story about how the future is parallel, and the practical future is dataflow parallelism. (Familiar yes, but with some nice Flickr clip-art and approachable analogies to explain it.)
The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).
Read More »