Category Archives: database

Papers are being solicited for the ACM’s new symposium on cloud computing (SOCC) — and they’re due pretty soon, January 15. Both research and industrial papers are welcome. The folks involved (present company excepted) are really strong, and we expect to have some very interesting invited speakers as well. Interesting enough to entice folks to Indianapolis!

The best parts of these smaller symposia are the give-and-take of people in the room talking about each other’s work. So send in your best ideas and plan to come.

More info including the call for papers at http://research.microsoft.com/socc2010

oscilloHadoop MapReduce is a batch-processing system.  Why?  Because that’s the way Google described their MapReduce implementation.

But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop.  And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster.  This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.

Read More »

Agreement Protocol

Headline: We now have a robust declarative implementation of MultiPaxos with leader election, which is radically simpler than most existing implementations.  It’s compact, suprisingly readable (as Paxos implementations go!) and live.  It forms a key part of our Boom Analytics implementation of a high-availability Hadoop File System.

Maybe more interesting are the lessons we learned about how distributed protocols and declarative languages go together, and the design patterns that emerged.  We’re using this to ground the design of our new language, code-name Lincoln.  A paper on the topic is being presented this Wednesday at NetDB 2009, after SOSP.

Read More »

lightning

For the last year or so, my team at Berkeley — in collaboration with Yahoo Research — has been undertaking an aggressive experiment in programming.  The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: The Cloud.  We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.

To kick this off we built something we call BOOM Analytics [link updated to new version]: a clone of Hadoop and HDFS built largely in Overlog, a declarative language we developed some years back for network protocols.  BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure.  As a result we were able — with amazingly little effort — to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop’s Java codebase.  Two of the fanciest are: Read More »

428397739_e5ac735923_bWas intrigued last week by the confluence of two posts:

  • Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
  • Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes

Both impressive.  But wildly different hardware deployments. Why??  It’s well known that Hadoop is tuned for availability not efficiency.  But does it really need 40x the number of machines as eBay’s Greenplum cluster?  How did smart folks end up with such wildly divergent numbers?

Read More »

1463574952_dd400430e5

[Update 11/5/2009: the first paper on Usher will appear in the ICDE 2010 conference.]

Data quality is a big, ungainly problem that gets too little attention in computing research and the technology press. Databases pick up “bad data” – errors, omissions, inconsistencies of various kinds — all through their lifecycle, from initial data entry/acquisition through data transformation and summarization, and through integration of multiple sources.

While writing a survey for the UN on the topic of quantitative data cleaning, I got interested in the dirty roots of the problem: data entry. This led to our recent work on Usher [11/5/2009: Link updated to final version], a toolkit for intelligent data entry forms, led by Kuang Chen.

Read More »

1463574952_dd400430e5_m2Relational databases are for structured data, right? And free text lives in the world of keyword search?

Well.  

Another paper we recently finished up was on Declarative Information Extraction in a Probabilistic Database System.  In a nutshell (as my buddy Minos is wont to say), this is about

  1. automatically converting free text into structured data,
  2. using the state of the art machine learning technique (Conditional Random Fields), which is 
  3. coded up in a few lines of SQL that integrates with the rest of your query processing.

This is Daisy Wang’s baby, and it’s really cool.  She’s achieved a convergence where free text, relational data and statistical models all come together in an elegant and very practical way.  

Read More »

1463574952_dd400430e5

Update: VLDB slides posted [pptx] [pdf]

It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.

The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version).  The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):

  • It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers.  This is a good thing, on many fronts.
  • It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
  • It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data.  They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
  • It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc.  If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
  • It advocates a similarly catholic approach to storage.  Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you.  These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems.  (C’mon, you know it’s true… )

Read More »

I’m increasingly believing my own story that data-centric programming is the future of parallel computing at the high end. I’m starting to hear it echoed back at me from real people.

I attended the Greenplum customer advisory board meeting this week, including a public briefing in San Francisco for analysts and potential customers.  The Greenplum folks asked me to speak at the briefing about parallelism and analytics in the large, outside the scope of Greenplum per se.  I cooked up a little slide deck for the occasion on why and whither parallelism and analytics.  A familiar story about how the future is parallel, and the practical future is dataflow parallelism. (Familiar yes, but with some nice Flickr clip-art and approachable analogies to explain it.)

The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).  

Read More »

 

The Cloud

At HPTS 2001 I gave a quick seat-of-the-pants talk called We Lose, which argued that database software and research wasn’t targeting the hacker community, and therefore was dooming itself to irrelevance.  This thing — which I cooked up in about 10 minutes — still gets me a bunch of feedback.  (The talk included a pitch for an easy-to-use dataflow framework that could harness textual data from files, as part of our original Telegraph work.  MapReduce anyone?)

 

This issue is decidedly back on the table as different approaches are being explored for Cloud development platforms. So I gave a similar pitch at CIDR this year, to try and get the data-centric experts to work on the most important piece of the Cloud: the programming model.  I’m hoping this time some folks other than us will bite.