- Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
- Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes
Both impressive. But wildly different hardware deployments. Why?? It’s well known that Hadoop is tuned for availability not efficiency. But does it really need 40x the number of machines as eBay’s Greenplum cluster? How did smart folks end up with such wildly divergent numbers?
Now there’s some modest percentage of quibbling to do about the numbers per se — Greenplum uses compression, so that 6.5 petabytes lives on “only” 4.5 petabytes of storage. Also, the storage is not full of base data; Monash quotes 70% compression from Greenplum. So fine — there’s only 1.95 petabytes of raw eBay bits living on those 4.5 petabytes of Greenplum disks.
Still, 40x? I mean, Java is slower than C, and Hadoop is slower than Postgres, etc. But 40x?
Presumably a large part of this is the difference between the Hadoop philosophy of using whitebox PCs (4 disks for 2 quad-core CPUs), and Greenplum’s use of dense servers like the Sun Thor (SunFire X4540 — 48 disks for 2 quad-core CPUs.)
Perhaps another part has to do with different fault tolerance approaches. Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of every single Map or Reduce stage to disks, before reading it right back in. (I describe this to my undergrads as the “regurgitation approach” to fault tolerance.) By contrast, classic MPP database approaches (like Graefe‘s famous Exchange operator) are wildly optimistic and pipeline everything, requiring restarts of deep dataflow pipelines in the case of even a single fault.
Chicken or Egg? The Google MapReduce pessimistic fault model requires way more machines, but the more machines you have, the more likely you are to see a fault, which will make you pessimistic…. Even so, 40x?
I talked briefly to Owen O’Malley at Yahoo about cluster sizing, and have talked with Greenplum’s resident expert, Tim Heath, as well (GP wooed Tim away from Yahoo, actually). Both gave rational explanations for why they ended up with what they use. And neither seemed doctrinaire or agitated about this issue one way or the other — both Hadoop and Greenplum are getting their jobs done, and have happy reference stories.
Still I wonder from a general point of view: how much hardware should be thrown at these problems? What’s the sweet spot between optimism and pessimism in the software fault tolerance, given the hardware/operational/energy cost to support it? So far all I hear are casual opinions — there’s science to do here (as both Owen and Tim agreed).
As a side note, Mehul Shah has had some important things to say on this note in the past:
- His work on FLuX presents an alternative fault tolerance and load balancing approach: fully pipelined dataflow, but with process pairs rather than checkpointing. It’s much trickier to build, and not necessarily cheaper in hardware overhead, but an intriguing alternative to regurgitation.
- His more recent JouleSort benchmark tries to factor energy cost into the sorting machismo story. Unfortunately, nobody else has risen to the challenge, and it’s a single-node benchmark for now.
So a number of good research gimmes here with big potential impact on practice:
- Predictive Snapshots for Dataflows: It sounds wise to only play the Google regurgitation game when the cost of staging to disk is worth the expected benefit of enabling restart. Can’t this be predicted reasonably well, so that the choice of pipelining or snapshotting is done judiciously?
- TCO metrics for Analytics hardware in modern datacenters: What is the right way to measure cost for these deployments, including energy consumption, rackspace, management, etc.
- Energy-centric scalable benchmarks: Combine the best of JouleSort, PennySort and the Petabyte-scale work going on in the field, and get people to compete for the right metric on big data. The scaling will modify JouleSort and PennySort to include purchasing and energy costs of components like network switches on- and across racks.