MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!
Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:
- standard statistical methods like multi-variate linear and logistic regressions,
- supervised learning methods including support-vector machines, naive Bayes, and decision trees
- unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
- descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
- statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.
More methods are planned for future releases. Myself, I’m working with Daisy Wang on merging her SQL-based Conditional Random Fields and Bayesian inference implementations into the library for an upcoming release, to support sophisticated text processing.
MADlib 0.20beta was designed as a drop-in addition to the popular open-source PostgreSQL database. The routines also all run massively parallel in the Greenplum database, which inherits the extensibility interfaces of PostgreSQL. (Non-commercial users can get the parallelism for free—no crippleware!—via Greenplum’s Community Edition.)
Obviously the first beta release is still experimental code. But we’ve come a long way since our alpha announcement at the Strata conference last February in both breadth and usability. For me, the most exciting aspect of the beta release is that we’re about ready to grow our developer community beyond the initial committers at Berkeley and EMC-Greenplum. If you’re interested in adding methods to MADlib—or in developing ports for other DBMS platforms (something I’d love to see happen!)—please get in touch.
And hats off to EMC-Greenplum for putting significant development resources behind this open-source effort. I started this project at Greenplum before they were acquired, and have been happy to see EMC embrace it and push it further. As I discussed at Strata last year, I think this is a very healthy direction for open-source software development.