Skip navigation

Category Archives: research


For the last year or so, my team at Berkeley — in collaboration with Yahoo Research — has been undertaking an aggressive experiment in programming.  The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: The Cloud.  We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.

To kick this off we built something we call BOOM Analytics [link updated to Eurosys10 final version]: a clone of Hadoop and HDFS built largely in Overlog, a declarative language we developed some years back for network protocols.  BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure.  As a result we were able — with amazingly little effort — to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop’s Java codebase.  Two of the fanciest are: Read More »

428397739_e5ac735923_bWas intrigued last week by the confluence of two posts:

  • Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
  • Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes

Both impressive.  But wildly different hardware deployments. Why??  It’s well known that Hadoop is tuned for availability not efficiency.  But does it really need 40x the number of machines as eBay’s Greenplum cluster?  How did smart folks end up with such wildly divergent numbers?

Read More »


[Update 1/15/2010: this paper was awarded Best Student Paper at ICDE 2010!  Congrats to Kuang, Harr and Neil on the well-deserved recognition!]

[Update 11/5/2009: the first paper on Usher will appear in the ICDE 2010 conference.]

Data quality is a big, ungainly problem that gets too little attention in computing research and the technology press. Databases pick up “bad data” — errors, omissions, inconsistencies of various kinds — all through their lifecycle, from initial data entry/acquisition through data transformation and summarization, and through integration of multiple sources.

While writing a survey for the UN on the topic of quantitative data cleaning, I got interested in the dirty roots of the problem: data entry. This led to our recent work on Usher [11/5/2009: Link updated to final version], a toolkit for intelligent data entry forms, led by Kuang Chen.

Read More »

1463574952_dd400430e5_m2Relational databases are for structured data, right? And free text lives in the world of keyword search?


Another paper we recently finished up was on Declarative Information Extraction in a Probabilistic Database System.  In a nutshell (as my buddy Minos is wont to say), this is about

  1. automatically converting free text into structured data,
  2. using the state of the art machine learning technique (Conditional Random Fields), which is 
  3. coded up in a few lines of SQL that integrates with the rest of your query processing.

This is Daisy Wang‘s baby, and it’s really cool.  She’s achieved a convergence where free text, relational data and statistical models all come together in an elegant and very practical way.  

Read More »


Update: VLDB slides posted [pptx] [pdf]

It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.

The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version).  The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):

  • It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers.  This is a good thing, on many fronts.
  • It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
  • It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data.  They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
  • It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc.  If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
  • It advocates a similarly catholic approach to storage.  Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you.  These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems.  (C’mon, you know it’s true… )

Read More »


The Cloud

At HPTS 2001 I gave a quick seat-of-the-pants talk called We Lose, which argued that database software and research wasn’t targeting the hacker community, and therefore was dooming itself to irrelevance.  This thing — which I cooked up in about 10 minutes — still gets me a bunch of feedback.  (The talk included a pitch for an easy-to-use dataflow framework that could harness textual data from files, as part of our original Telegraph work.  MapReduce anyone?)


This issue is decidedly back on the table as different approaches are being explored for Cloud development platforms. So I gave a similar pitch at CIDR this year, to try and get the data-centric experts to work on the most important piece of the Cloud: the programming model.  I’m hoping this time some folks other than us will bite.

On Wednesday morning there was very interesting discussion of power management in datacenters and data management software

James Hamilton — formerly of Microsoft and now at Amazon — gave a killer keynote talk on power in datacenters.  He traces the power from the provider through all the steps involved in powering machines and cooling.  He compares all the power-related costs to the costs of the machines, networking and other costs of running a datacenter, and points out the places where efficiency can be found.  A few things amazed me:

Read More »

CIDR 2009 is shaping up to be as interesting as I had hoped.  

If you don’t know about CIDR, it’s a conference founded by David DeWitt, Jim Gray and Mike Stonebraker 8 years back, for data systems work and other data-centric topics that would have a tough time getting accepted to the more hidebound conferences like SIGMOD and VLDB.  It’s usually a good conference — everybody stays in one room and pays attention, and it tends to attract a smart mix of old hands and young upstarts who really dig in and engage.  I ran the program this year, and was really happy with the papers and speakers we got.  Only have time to comment on a few things here.

Read More »

Some of my colleagues and I have a pretty nifty idea that we’ve submitted to SIGMOD 2009.  But I’m not gonna tell you what it is.  Because my friends at SIGMOD won’t let me.

Blogs democratize publication and reduce delays and friction in scientific dialog. Right?  A small step toward Open Notebook Science.  Too bad SIGMOD has moved in the opposite direction with double blind reviewing, bottling up ideas for months at a time. Actually,  they’d really prefer I didn’t tell you about other ideas I’m playing with that I might submit next year.  Then you might guess who I am!  As if my work didn’t give me away.

Read More »