Relational databases are for structured data, right? And free text lives in the world of keyword search?
- automatically converting free text into structured data,
- using the state of the art machine learning technique (Conditional Random Fields), which is
- coded up in a few lines of SQL that integrates with the rest of your query processing.
This is Daisy Wang‘s baby, and it’s really cool. She’s achieved a convergence where free text, relational data and statistical models all come together in an elegant and very practical way.
The general problem of Information Extraction is to pull labeled phrases (names, addresses, etc.) out of free text. The basic idea of our approach (which seems to be kicking around a lot!) is to model everything you can as first-class relational data in a database. In this case, “everything” includes free text, statistical model parameters, and the ongoing state of an algorithm.
First, building on the BayesStore work on Probabilistic Databases, Daisy noted that the statistical correlation model for Conditional Random Fields can be naturally stored in database tables. Then, she played the traditional trick of converting free text into an inverted index — i.e., a table with the schema (docId, position, word). Finally, she put together the fact that the famous Viterbi Inference algorithm is a simple dynamic programming technique, so the data structure for the algorithm (the DP table) can be stored as a relation, and the logic of the algorithm can be written as a (recursive) SQL query.
The result? Every word in a document is automatically given a “label” (think of an XML-style “tag”) that identifies its structure. This means that items in text (including phrases) can be extracted as People, Companies, Addresses, Calendar Events, Celebrities, etc. And you can integrate queries on your structured data with queries over these tags — all in SQL. So you can, for example, write ad-hoc SQL queries that find prices and dates inside classified ads for Jaguar cars, without getting confused about animals or football teams. Since it’s free-form SQL, you can adapt that query to compute averages or standard deviations of prices, trends of prices over time, etc. As you wish.
There are a bunch of things I like about this. First and foremost, this extends a nice line of research on declarative information extraction that friends and colleagues have been working on for some time at places like Wisconsin, Yahoo!, IBM, and IIT Bombay, and puts it simultaneously on a strong statistical footing, and a grounded and reusable database footing. Second, it takes a lot of the relatively exotic research work that folks have been doing on probabilistic databases, and shows how to roll it out for a key application in a way that can have immediate practical impact. This stuff can be used in vanilla PostgreSQL right away (well, in the beta of v8.4), without buying into a new probabilistic data model or language extension, or embracing reasoning with uncertainty. (Though if you do embrace that stuff, you now have a robust probabilistic interpretation to work with including full marginal inference.) Along the way are various feats of derring-do, including the fact that the SQL version in PostgreSQL runs as fast or faster than the handwritten open-source implementation of Viterbi, and various query optimization tricks to make queries on top of the text labels run more quickly than you’d get if you hadn’t integrated into the DBMS.