Category Archives: map reduce

oscilloHadoop MapReduce is a batch-processing system.  Why?  Because that’s the way Google described their MapReduce implementation.

But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop.  And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster.  This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.

Read More »

lightning

For the last year or so, my team at Berkeley — in collaboration with Yahoo Research — has been undertaking an aggressive experiment in programming.  The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: The Cloud.  We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.

To kick this off we built something we call BOOM Analytics [link updated to new version]: a clone of Hadoop and HDFS built largely in Overlog, a declarative language we developed some years back for network protocols.  BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure.  As a result we were able — with amazingly little effort — to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop’s Java codebase.  Two of the fanciest are: Read More »

 

stripped down VW

I just heard through the Berkeley grapevine about the BashReduce effort at Last.fm: MapReduce in 126 lines of bash script! Awesome. I’m sure it doesn’t do X, Y and Z. So ask yourself: do you need X? Y? Z? Maybe instead you want V and W. Maybe you should roll your own tool.  

 

Makes you think.

428397739_e5ac735923_bWas intrigued last week by the confluence of two posts:

  • Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
  • Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes

Both impressive.  But wildly different hardware deployments. Why??  It’s well known that Hadoop is tuned for availability not efficiency.  But does it really need 40x the number of machines as eBay’s Greenplum cluster?  How did smart folks end up with such wildly divergent numbers?

Read More »

1463574952_dd400430e5

Update: VLDB slides posted [pptx] [pdf]

It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.

The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version).  The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):

  • It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers.  This is a good thing, on many fronts.
  • It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
  • It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data.  They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
  • It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc.  If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
  • It advocates a similarly catholic approach to storage.  Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you.  These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems.  (C’mon, you know it’s true… )

Read More »

 

The Cloud

At HPTS 2001 I gave a quick seat-of-the-pants talk called We Lose, which argued that database software and research wasn’t targeting the hacker community, and therefore was dooming itself to irrelevance.  This thing — which I cooked up in about 10 minutes — still gets me a bunch of feedback.  (The talk included a pitch for an easy-to-use dataflow framework that could harness textual data from files, as part of our original Telegraph work.  MapReduce anyone?)

 

This issue is decidedly back on the table as different approaches are being explored for Cloud development platforms. So I gave a similar pitch at CIDR this year, to try and get the data-centric experts to work on the most important piece of the Cloud: the programming model.  I’m hoping this time some folks other than us will bite.