The big news around here today is the public announcement of Trifacta, a company I’ve been quietly cooking over the last few months with colleagues Jeff Heer and Sean Kandel of Stanford. Trifacta is taking on an important and satisfying challenge: to build a new generation of user-centric data management software that is beautiful, powerful, and eminently useful.
Before I talk more about the background let me say this: We Are Hiring. We’re looking for people with passion and talent in Interaction Design, Data Visualization, Databases, Distributed Systems, Languages, and Machine Learning. We’re looking for folks who want to reach across specialties, and work together to build integrated, rich, and deeply satisfying software. We’ve got top-shelf funding and a sun-soaked office in the heart of SOMA in San Francisco, and we’re building a company with clear, tangible value. It’s early days and the fun is ahead. If you ever considered joining a data startup, this is the one. Get in touch.
Read More »
When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I’m involved with. So I wrote up a discussion of MADlib, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote a paper on the design and use of MADlib, which made my writing job a bit easier.) I’m optimistic about MADlib closing a gap between algorithm researchers and working data scientists, using familiar SQL as a vector for adoption on both fronts.
Read More »
MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!
Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:
- standard statistical methods like multi-variate linear and logistic regressions,
- supervised learning methods including support-vector machines, naive Bayes, and decision trees
- unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
- descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
- statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.
More methods are planned for future releases. Myself, I’m working with Daisy Wang on merging her SQL-based Conditional Random Fields and Bayesian inference implementations into the library for an upcoming release, to support sophisticated text processing.
Read More »
[Update 1/15/2010: this paper was awarded Best Student Paper at ICDE 2010! Congrats to Kuang, Harr and Neil on the well-deserved recognition!]
[Update 11/5/2009: the first paper on Usher will appear in the ICDE 2010 conference.]
Data quality is a big, ungainly problem that gets too little attention in computing research and the technology press. Databases pick up “bad data” – errors, omissions, inconsistencies of various kinds — all through their lifecycle, from initial data entry/acquisition through data transformation and summarization, and through integration of multiple sources.
While writing a survey for the UN on the topic of quantitative data cleaning, I got interested in the dirty roots of the problem: data entry. This led to our recent work on Usher [11/5/2009: Link updated to final version], a toolkit for intelligent data entry forms, led by Kuang Chen.
Read More »
Relational databases are for structured data, right? And free text lives in the world of keyword search?
Another paper we recently finished up was on Declarative Information Extraction in a Probabilistic Database System. In a nutshell (as my buddy Minos is wont to say), this is about
- automatically converting free text into structured data,
- using the state of the art machine learning technique (Conditional Random Fields), which is
- coded up in a few lines of SQL that integrates with the rest of your query processing.
This is Daisy Wang‘s baby, and it’s really cool. She’s achieved a convergence where free text, relational data and statistical models all come together in an elegant and very practical way.
Read More »