[Update 11/5/2009: the first paper on Usher will appear in the ICDE 2010 conference.]
Data quality is a big, ungainly problem that gets too little attention in computing research and the technology press. Databases pick up “bad data” – errors, omissions, inconsistencies of various kinds — all through their lifecycle, from initial data entry/acquisition through data transformation and summarization, and through integration of multiple sources.
While writing a survey for the UN on the topic of quantitative data cleaning, I got interested in the dirty roots of the problem: data entry. This led to our recent work on Usher [11/5/2009: Link updated to final version], a toolkit for intelligent data entry forms, led by Kuang Chen.
Out of the gate, it seemed to me that traditional forms with integrity constraints are a classic source of user frustration (don’t you hate those little red stars asking you to resubmit?), based on old-fashioned deterministic reasoning that should be swept aside by statistical methods. But before diving into the math, I started to look around for what I assumed would be the extensive HCI literature on the design of forms. Guess what: there’s next to nothing written on the topic in the HCI community! Form design has apparently been considered just too boring to bother with for the last few decades, even though it’s a nearly-ubiquitous human-computer interaction, and most forms you run into are just terrible.
As a researcher, when you find a universal problem that nobody is working on, you have yourself a golden opportunity.
Kuang jumped on this, and it jibed perfectly with his interest in healthcare informatics in the developing world. After shadowing health workers in Tanzania, he came to the conclusion that smarter forms could make a material difference in the quality of the information used to inform medical and public health workers in developing regions. The niceties of “Total Quality Management” and other business-school mantras run up against a lot of hard realities in third-world NGO’s on shoestring budgets.
In collaboration with Harr Chen at MIT (an old Seattle buddy of Kuang’s), Tapan Parikh at Berkeley’s iSchool (Tap and I are co-advising Kuang’s thesis), and fellow Berkeley Ph.D. student Neil Conway, Kuang started developing Usher: a toolkit that uses Bayesian statistics and human-computer interaction principles to deliver intelligent forms that can improve the quality of data entry.
The first paper on Usher is under submission. It describes how Usher learns models of data and data entry personnel, and uses those models to dynamically decide what question to ask next, and what questions to re-ask. The paper also begins a longer-term discussion on the design and evaluation of smart UI widgets that encourage better data entry.
This agenda is a great fusion of databases, machine learning and human-computer interaction, in service of a pressing practical need. And I’m delighted to have Kuang and Tap driving it toward the service of ICTD, an area I really want to understand better.
