Was intrigued last week by the confluence of two posts:
- Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
- Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes
Both impressive. But wildly different hardware deployments. Why?? It’s well known that Hadoop is tuned for availability not efficiency. But does it really need 40x the number of machines as eBay’s Greenplum cluster? How did smart folks end up with such wildly divergent numbers?