Papers are being solicited for the ACM’s new symposium on cloud computing (SOCC) — and they’re due pretty soon, January 15. Both research and industrial papers are welcome. The folks involved (present company excepted) are really strong, and we expect to have some very interesting invited speakers as well. Interesting enough to entice folks to Indianapolis!

The best parts of these smaller symposia are the give-and-take of people in the room talking about each other’s work. So send in your best ideas and plan to come.

More info including the call for papers at http://research.microsoft.com/socc2010

oscilloHadoop MapReduce is a batch-processing system.  Why?  Because that’s the way Google described their MapReduce implementation.

But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop.  And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster.  This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.

Read More »

argueThanks to Boon Thau Loo and Stefan Sariou for a very interesting workshop on Networking Meets Databases (NetDB), and especially for inviting a high-octane panel to debate the success and directions of Declarative Networking.

The panel members included:

  • Fred Baker, Cisco
  • Joe Hellerstein, Berkeley
  • Eddie Kohler, UCLA and Meraki
  • Arvind Krishnamurthy, U Washington
  • Petros Maniatis, Intel Research
  • Timothy Roscoe, ETH Zurich

Butler Lampson made numerous comments from the audience, and given his insight and stature was viewed by most as something of an additional panelist.

I was happy to see a very vigorous debate!  Lots of interesting points made, no punches pulled.  My slides are posted here, and include an ad hoc manifesto for how to move forward. Read More »

Agreement Protocol

Headline: We now have a robust declarative implementation of MultiPaxos with leader election, which is radically simpler than most existing implementations.  It’s compact, suprisingly readable (as Paxos implementations go!) and live.  It forms a key part of our Boom Analytics implementation of a high-availability Hadoop File System.

Maybe more interesting are the lessons we learned about how distributed protocols and declarative languages go together, and the design patterns that emerged.  We’re using this to ground the design of our new language, code-name Lincoln.  A paper on the topic is being presented this Wednesday at NetDB 2009, after SOSP.

Read More »

lightning

For the last year or so, my team at Berkeley — in collaboration with Yahoo Research — has been undertaking an aggressive experiment in programming.  The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: The Cloud.  We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.

To kick this off we built something we call BOOM Analytics [link updated to new version]: a clone of Hadoop and HDFS built largely in Overlog, a declarative language we developed some years back for network protocols.  BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure.  As a result we were able — with amazingly little effort — to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop’s Java codebase.  Two of the fanciest are: Read More »

 

stripped down VW

I just heard through the Berkeley grapevine about the BashReduce effort at Last.fm: MapReduce in 126 lines of bash script! Awesome. I’m sure it doesn’t do X, Y and Z. So ask yourself: do you need X? Y? Z? Maybe instead you want V and W. Maybe you should roll your own tool.  

 

Makes you think.

428397739_e5ac735923_bWas intrigued last week by the confluence of two posts:

  • Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
  • Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes

Both impressive.  But wildly different hardware deployments. Why??  It’s well known that Hadoop is tuned for availability not efficiency.  But does it really need 40x the number of machines as eBay’s Greenplum cluster?  How did smart folks end up with such wildly divergent numbers?

Read More »

1463574952_dd400430e5

[Update 11/5/2009: the first paper on Usher will appear in the ICDE 2010 conference.]

Data quality is a big, ungainly problem that gets too little attention in computing research and the technology press. Databases pick up “bad data” – errors, omissions, inconsistencies of various kinds — all through their lifecycle, from initial data entry/acquisition through data transformation and summarization, and through integration of multiple sources.

While writing a survey for the UN on the topic of quantitative data cleaning, I got interested in the dirty roots of the problem: data entry. This led to our recent work on Usher [11/5/2009: Link updated to final version], a toolkit for intelligent data entry forms, led by Kuang Chen.

Read More »

1463574952_dd400430e5_m2Relational databases are for structured data, right? And free text lives in the world of keyword search?

Well.  

Another paper we recently finished up was on Declarative Information Extraction in a Probabilistic Database System.  In a nutshell (as my buddy Minos is wont to say), this is about

  1. automatically converting free text into structured data,
  2. using the state of the art machine learning technique (Conditional Random Fields), which is 
  3. coded up in a few lines of SQL that integrates with the rest of your query processing.

This is Daisy Wang’s baby, and it’s really cool.  She’s achieved a convergence where free text, relational data and statistical models all come together in an elegant and very practical way.  

Read More »

1463574952_dd400430e5

Update: VLDB slides posted [pptx] [pdf]

It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.

The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version).  The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):

  • It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers.  This is a good thing, on many fronts.
  • It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
  • It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data.  They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
  • It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc.  If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
  • It advocates a similarly catholic approach to storage.  Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you.  These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems.  (C’mon, you know it’s true… )

Read More »