I often hear that many of the leading data analysts in the field have PhDs in physics or biology or the like, rather than computer science. Computer scientists are typically interested in methods; physical scientists are interested in data.
Another thing I often hear is that a large fraction of the time spent by analysts — some say the majority of time — involves data preparation and cleaning: transforming formats, rearranging nesting structures, removing outliers, and so on. (If you think this is easy, you’ve never had a stack of ad hoc Excel spreadsheets to load into a stat package or database!)
Putting these together, something is very wrong: high-powered people are wasting most of their time doing low-function work. And the challenge of improving this state of affairs has fallen in the cracks between the analysts and computer scientists.
DataWrangler is a new tool we’re developing to address this problem, which I demo’d today at the O’Reilly Strata Conference. DataWrangler is an intelligent visual data transformation tool that lets users reshape, transform and clean data in an intuitive way that surprises most people who’ve worked with data. As you manipulate data in a grid layout, the tool automatically infers information both about the data, and about your intentions for transforming the data. It’s hard to describe, but the lead researcher on the project — Stanford PhD student Sean Kandel — has a quick video up on the DataWrangler homepage that shows how it works. Sean has put DataWrangler live on the site as well.
Tackling these problems fundamentally requires a hybrid technical strategy. Under the covers, DataWrangler is a heady mix of second-order logic, machine learning methods, and human-computer interaction design methodology. We wrote a research paper about it that will appear in this year’s SIGCHI.
If you’re interested in this space, also have a look at Shankar Raman’s very prescient Potter’s Wheel work from a decade ago, the PADS project at AT&T and Princeton, recent research from Sumit Gulwani at Microsoft Research, and David Huynh’s most excellent Google Refine. All good stuff!