Skip navigation

MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!

Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:

  • standard statistical methods like multi-variate linear and logistic regressions,
  • supervised learning methods including support-vector machines, naive Bayes, and decision trees
  • unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
  • descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
  • statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.

More methods are planned for future releases.  Myself, I’m working with Daisy Wang on merging her SQL-based Conditional Random Fields and Bayesian inference implementations into the library for an upcoming release, to support sophisticated text processing.

MADlib 0.20beta was designed as a drop-in addition to the popular open-source PostgreSQL database.  The routines also all run massively parallel in the Greenplum database, which inherits the extensibility interfaces of PostgreSQL. (Non-commercial users can get the parallelism for free—no crippleware!—via Greenplum’s Community Edition.)

Obviously the first beta release is still experimental code.  But we’ve come a long way since our alpha announcement at the Strata conference last February in both breadth and usability. For me, the most exciting aspect of the beta release is that we’re about ready to grow our developer community beyond the initial committers at Berkeley and EMC-Greenplum.  If you’re interested in adding methods to MADlib—or in developing ports for other DBMS platforms (something I’d love to see happen!)—please get in touch.

And hats off to EMC-Greenplum for putting significant development resources behind this open-source effort.  I started this project at Greenplum before they were acquired, and have been happy to see EMC embrace it and push it further.  As I discussed at Strata last year, I think this is a very healthy direction for open-source software development.

Advertisements

2 Trackbacks/Pingbacks

  1. […] Magnetic, Agile, Deep: MADlib in-database analytics gets more serious By Eugene Dubossarsky On July 11, 2011 · Leave a Comment var addthis_product = 'wpp-261'; var addthis_config = {"data_track_clickback":true};MADlib, the open source, in-database analytics package has gone beta. […]

  2. By MADlib goes beta! « Another Word For It on 12 Jul 2011 at 5:08 pm

    […] MADlib goes beta! Serious in-database analytics […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: