When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I’m involved with. So I wrote up a discussion of MADlib, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote a paper on the design and use of MADlib, which made my writing job a bit easier.) I’m optimistic about MADlib closing a gap between algorithm researchers and working data scientists, using familiar SQL as a vector for adoption on both fronts.
I kicked off MADlib as a part-time consulting project for Greenplum during my sabbatical in 2010-2011. As I built out the first two methods (FM and CountMin sketches) and an installer, Greenplum started assembling a team of their own engineers and data scientists to overlap with and eventually replace me when I returned to campus. They also developed a roadmap of additional methods that their customers wanted in the field. Eighteen months later, Greenplum now contributes the bulk of the labor, management and expertise for the project, and has built bridges to leading academics as well.
While I’ve encouraged Greenplum’s investment all along, it has been equally important to me that MADlib avoid becoming Yet Another Proprietary Library for Analytics. There are a bunch of those libraries provided by DBMS vendors and third parties; they are all-but-invisible to researchers, and are immutable black boxes to data scientists in the field. So while some of them may be quite good (who knows?!) they don’t create or benefit from a network effect in the technical community. Given the current excitement about Big Data and scarcity of data scientists, that kind of isolation could be deadly to both adoption and evolution.
As a result of those considerations, MADlib is open source, and I’ve insisted all along that it be maintained on at least one open-source platform: PostgreSQL. I’m also genuinely eager to see MADlib ports to other massively-parallel commercial DBMSs. This will require some work, but it’s tractable and well-scoped engineering work, and I hope that incentives will arise to get it done and expand the scope of the community.
Finally, I’m happy to see academics like Chris Ré, Daisy Wang and their students starting to contribute to the project. I hope their research will help out data scientists in the field, and generate the kind of user feedback that is so hard to get in academia. I also hope that other researchers will join them and get involved. I think it’s great experience to write meaty research code that actually ships, and I hope MADlib can be a mechanism to encourage more of that experience in the community.