Skip navigation

I often hear that many of the leading data analysts in the field have PhDs in physics or biology or the like, rather than computer science.  Computer scientists are typically interested in methods; physical scientists are interested in data.

Another thing I often hear is that a large fraction of the time spent by analysts — some say the majority of time — involves data preparation and cleaning: transforming formats, rearranging nesting structures, removing outliers, and so on.  (If you think this is easy, you’ve never had a stack of ad hoc Excel spreadsheets to load into a stat package or database!)

Putting these together, something is very wrong:  high-powered people are wasting most of their time doing low-function work.  And the challenge of improving this state of affairs has fallen in the cracks between the analysts and computer scientists.

DataWrangler is a new tool we’re developing to address this problem, which I demo’d today at the O’Reilly Strata Conference.  DataWrangler is an intelligent visual data transformation tool that lets users reshape, transform and clean data in an intuitive way that surprises most people who’ve worked with data.  As you manipulate data in a grid layout, the tool automatically infers information both about the data, and about your intentions for transforming the data.  It’s hard to describe, but the lead researcher on the project — Stanford PhD student Sean Kandel — has a quick video up on the DataWrangler homepage that shows how it works.  Sean has put DataWrangler live on the site as well.

Tackling these problems fundamentally requires a hybrid technical strategy.  Under the covers, DataWrangler is a heady mix of second-order logic, machine learning methods, and human-computer interaction design methodology.   We wrote a research paper about it that will appear in this year’s SIGCHI.

If you’re interested in this space, also have a look at Shankar Raman’s very prescient Potter’s Wheel work from a decade ago, the PADS project at AT&T and Princeton, recent research from Sumit Gulwani at Microsoft Research, and David Huynh’s most excellent Google Refine.  All good stuff!

About these ads

3 Trackbacks/Pingbacks

  1. [...] paper explaining how the tool works. Joseph M. Hellerstein explains the origins of the project in a blog post: Another thing I often hear is that a large fraction of the time spent by analysts — some say [...]

  2. [...] paper explaining how the tool works. Joseph M. Hellerstein explains the origins of the project in a blog post: Another thing I often hear is that a large fraction of the time spent by analysts — some say [...]

  3. By Strata 2011 | My Blog on 08 Nov 2012 at 1:02 am

    [...] and Deep Approach to Scalable Analytics – the hotness of this talk seemed related more to the DataWrangler tool (for cleansing data) than the MAD library (scalable analytics engine running inside Postgres) [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 60 other followers

%d bloggers like this: