Skip navigation

1463574952_dd400430e5_m2Relational databases are for structured data, right? And free text lives in the world of keyword search?

Well.  

Another paper we recently finished up was on Declarative Information Extraction in a Probabilistic Database System.  In a nutshell (as my buddy Minos is wont to say), this is about

  1. automatically converting free text into structured data,
  2. using the state of the art machine learning technique (Conditional Random Fields), which is 
  3. coded up in a few lines of SQL that integrates with the rest of your query processing.

This is Daisy Wang‘s baby, and it’s really cool.  She’s achieved a convergence where free text, relational data and statistical models all come together in an elegant and very practical way.  

The general problem of Information Extraction is to pull labeled phrases (names, addresses, etc.) out of free text.  The basic idea of our approach (which seems to be kicking around a lot!) is to model everything you can as first-class relational data in a database.  In this case, “everything” includes free text, statistical model parameters, and the ongoing state of an algorithm.

First, building on the BayesStore work on Probabilistic Databases, Daisy noted that the statistical correlation model for Conditional Random Fields can be naturally stored in database tables.  Then, she played the traditional trick of converting free text into an inverted index — i.e., a table with the schema (docId, position, word).  Finally, she put together the fact that the famous Viterbi Inference algorithm is a simple dynamic programming technique, so the data structure for the algorithm (the DP table) can be stored as a relation, and the logic of the algorithm can be written as a (recursive) SQL query.  

The result?  Every word in a document is automatically given a “label” (think of an XML-style “tag”) that identifies its structure.  This means that items in text (including phrases) can be extracted as People, Companies, Addresses, Calendar Events,  Celebrities, etc.  And you can integrate queries on your structured data with queries over these tags — all in SQL.  So you can, for example, write ad-hoc SQL queries that find prices and dates inside classified ads for Jaguar cars, without getting confused about animals or football teams.  Since it’s free-form SQL, you can adapt that query to compute averages or standard deviations of prices, trends of prices over time, etc.  As you wish.

There are a bunch of things I like about this.  First and foremost, this extends a nice line of research on declarative information extraction that friends and colleagues have been working on for some time at places like Wisconsin, Yahoo!, IBM, and IIT Bombay, and puts it simultaneously on a strong statistical footing, and a grounded and reusable database footing.  Second, it takes a lot of the relatively exotic research work that folks have been doing on probabilistic databases, and shows how to roll it out for a key application in a way that can have immediate practical impact. This stuff can be used in vanilla PostgreSQL right away (well, in the beta of v8.4), without buying into a new probabilistic data model or language extension, or embracing reasoning with uncertainty.  (Though if you do embrace that stuff, you now have a robust probabilistic interpretation to work with including full marginal inference.)  Along the way are various feats of derring-do, including the fact that the SQL version in PostgreSQL runs as fast or faster than the handwritten open-source implementation of Viterbi, and various query optimization tricks to make queries on top of the text labels run more quickly than you’d get if you hadn’t integrated into the DBMS.

4 Trackbacks/Pingbacks

  1. [...] Raft of papers #2: Declarative Info Extraction From Text « Data Beta A paper on Declarative Information Extraction in a Probabilistic Database System. It's about (1) automatically converting free text into structured data, (2) using the state of the art machine learning technique (Conditional Random Fields), which is (3) coded up in a few lines of SQL that integrates with the rest of your query processing. It represents a convergence where free text, relational data, and statistical models all come together in an elegant and very practical way. (tags: data programming database statistics paper academic probability computerscience sql text berkeley JosephHellerstein) [...]

  2. [...] perception that unstructured data isn’t for relational databases may be changing slightly. Recently, a team at UC Berkeley used a SQL database to perform entity-extraction. They took unstructured [...]

  3. [...] Declarative IE: In Submission Filed under: Uncategorized — datamodelview @ 5:37 am An Excellent entry in Joe’s blog on this paper [...]

  4. By Props to the Team « Data Beta on 22 Jan 2010 at 6:09 pm

    [...] Wang’s initial work on extracting structure from text using SQL was also accepted to ICDE 2010 in short form (with more to come!)  (Co-advised by Mike Franklin [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 58 other followers

%d bloggers like this: