Category Archives: greenplum

428397739_e5ac735923_bWas intrigued last week by the confluence of two posts:

  • Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
  • Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes

Both impressive.  But wildly different hardware deployments. Why??  It’s well known that Hadoop is tuned for availability not efficiency.  But does it really need 40x the number of machines as eBay’s Greenplum cluster?  How did smart folks end up with such wildly divergent numbers?

Read More »

1463574952_dd400430e5

Update: VLDB slides posted [pptx] [pdf]

It’s been a busy month pushing out papers. I’ll cover some of them here over the next days.

The first one I’ll mention is MAD Skills: New Analysis Practices for Big Data (link updated to VLDB version).  The paper does a few controversial things (if you’re the kind of person who finds data management a source of controversy):

  • It takes on “data warehousing” and “business intelligence” as outmoded, low-tech approaches to getting value out of Big Data. Instead, it advocates a “Magnetic, Agile, Deep” (MAD) approach to data, that shifts the locus of power from what Brian Dolan calls the “DBA priesthood” to the statisticians and analysts who actually like to crunch the numbers.  This is a good thing, on many fronts.
  • It describes a state-of-the-art parallel data warehouse that sits on 800TB of disk, using 40 dual-processor dual-core Sun Thumper boxes.
  • It presents a set of general-purpose, hardcore, massively parallel statistical methods for big data.  They’re expressed in SQL (OMG!) but could be easily translated to MapReduce if that’s your bag.
  • It argues for a catholic (small-c) approach to programming Big Data, including SQL & MapReduce, Java & R, Python & Perl, etc.  If you already have a parallel database, it just shouldn’t be that hard to support all those things in a single engine.
  • It advocates a similarly catholic approach to storage.  Use your parallel filesystem, or your traditional database tables, or your compressed columnstore formats, or what have you.  These should not be standalone “technologies”, they are great features that should — no, will — get added to existing parallel data systems.  (C’mon, you know it’s true… )

Read More »

I’m increasingly believing my own story that data-centric programming is the future of parallel computing at the high end. I’m starting to hear it echoed back at me from real people.

I attended the Greenplum customer advisory board meeting this week, including a public briefing in San Francisco for analysts and potential customers.  The Greenplum folks asked me to speak at the briefing about parallelism and analytics in the large, outside the scope of Greenplum per se.  I cooked up a little slide deck for the occasion on why and whither parallelism and analytics.  A familiar story about how the future is parallel, and the practical future is dataflow parallelism. (Familiar yes, but with some nice Flickr clip-art and approachable analogies to explain it.)

The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).  

Read More »