Diverging views on Big Data density, and some gimmes

May 14, 2009 – 12:33 am
Posted in database, gimmes, greenplum, map reduce, parallelism, research

Was intrigued last week by the confluence of two posts:

Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes

Both impressive. But wildly different hardware deployments. Why?? It’s well known that Hadoop is tuned for availability not efficiency. But does it really need 40x the number of machines as eBay’s Greenplum cluster? How did smart folks end up with such wildly divergent numbers?

Now there’s some modest percentage of quibbling to do about the numbers per se — Greenplum uses compression, so that 6.5 petabytes lives on “only” 4.5 petabytes of storage. Also, the storage is not full of base data; Monash quotes 70% compression from Greenplum. So fine — there’s only 1.95 petabytes of raw eBay bits living on those 4.5 petabytes of Greenplum disks.

Still, 40x? I mean, Java is slower than C, and Hadoop is slower than Postgres, etc. But 40x?

Presumably a large part of this is the difference between the Hadoop philosophy of using whitebox PCs (4 disks for 2 quad-core CPUs), and Greenplum’s use of dense servers like the Sun Thor (SunFire X4540 — 48 disks for 2 quad-core CPUs.)

Perhaps another part has to do with different fault tolerance approaches. Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of every single Map or Reduce stage to disks, before reading it right back in. (I describe this to my undergrads as the “regurgitation approach” to fault tolerance.) By contrast, classic MPP database approaches (like Graefe‘s famous Exchange operator) are wildly optimistic and pipeline everything, requiring restarts of deep dataflow pipelines in the case of even a single fault.

Chicken or Egg? The Google MapReduce pessimistic fault model requires way more machines, but the more machines you have, the more likely you are to see a fault, which will make you pessimistic…. Even so, 40x?

I talked briefly to Owen O’Malley at Yahoo about cluster sizing, and have talked with Greenplum’s resident expert, Tim Heath, as well (GP wooed Tim away from Yahoo, actually). Both gave rational explanations for why they ended up with what they use. And neither seemed doctrinaire or agitated about this issue one way or the other — both Hadoop and Greenplum are getting their jobs done, and have happy reference stories.

Still I wonder from a general point of view: how much hardware should be thrown at these problems? What’s the sweet spot between optimism and pessimism in the software fault tolerance, given the hardware/operational/energy cost to support it? So far all I hear are casual opinions — there’s science to do here (as both Owen and Tim agreed).

As a side note, Mehul Shah has had some important things to say on this note in the past:

His work on FLuX presents an alternative fault tolerance and load balancing approach: fully pipelined dataflow, but with process pairs rather than checkpointing. It’s much trickier to build, and not necessarily cheaper in hardware overhead, but an intriguing alternative to regurgitation.
His more recent JouleSort benchmark tries to factor energy cost into the sorting machismo story. Unfortunately, nobody else has risen to the challenge, and it’s a single-node benchmark for now.

So a number of good research gimmes here with big potential impact on practice:

Predictive Snapshots for Dataflows: It sounds wise to only play the Google regurgitation game when the cost of staging to disk is worth the expected benefit of enabling restart. Can’t this be predicted reasonably well, so that the choice of pipelining or snapshotting is done judiciously?
TCO metrics for Analytics hardware in modern datacenters: What is the right way to measure cost for these deployments, including energy consumption, rackspace, management, etc.
Energy-centric scalable benchmarks: Combine the best of JouleSort, PennySort and the Petabyte-scale work going on in the field, and get people to compete for the right metric on big data. The scaling will modify JouleSort and PennySort to include purchasing and energy costs of components like network switches on- and across racks.

24 Comments

- Matei
- Posted May 14, 2009 at 9:31 am
- Permalink
- Reply
Hi Joe,

While more selective checkpointing makes sense for multi-job pipelines, I think MapReduce’s materialize-all-maps approach makes sense for large single MapReduce jobs like Petasort. The problem in Petasort is that you have 3600 nodes running maps, and part of each map output must be sent to each of 3600 reducers. If you didn’t materialize the outputs and a single reducer failed, you’d have to re-run *all* the maps to rebuild the part of the output that had been sent to that reducer. Having a single failure trigger a re-scan of the entire data set is unacceptable. You can actually think of the map-side materialization as building an index to avoid this re-scan.
- - jmh
  - Posted May 14, 2009 at 11:17 am
  - Permalink
  - Reply
  When you say “you” have 3600 nodes running map, who is “you”? It’s folks running Hadoop on 3600 nodes. This is the chicken and egg problem I was talking about. If you sort on 96 nodes, do you need to land outputs on disk? If you don’t land outputs on disk, you may need fewer nodes. Etc.
  
  I’m not saying I know the right answer here, what I’m saying is that the choice of the answer shouldn’t frame the question. The question is: given a data set, a time bound, a budget for hardware and operations, how do you make it all spin?
  - - Matei
    - Posted May 14, 2009 at 1:24 pm
    - Permalink
    You can’t sort a petabyte without accessing disk multiple times for each byte of input unless you have a petabyte of RAM. Even the Yahoo cluster had only 57 TB of RAM. I agree that the right hardware depends on the workload, and I’m not sure that Yahoo’s is optimal, but I think petasort is simply a “bigger” computational problem than typical data warehouse queries. This should be taken into account in comparing the hardware choices and the corresponding software choices.
  - - jmh
    - Posted May 14, 2009 at 4:29 pm
    - Permalink
    Hi Matei: Both clusters have comparable numbers of disks. The question is on the 40x factor in CPUs, networking hardware, power supplies, etc.
    
    As to what a “typical data warehouse” query is, I don’t think I know at this scale. That eBay database might be the biggest operational data warehouse on the planet, and if the folks at FIM are any indication, people who scale that big are doing some crazy stuff.
- Adam
- Posted May 14, 2009 at 1:27 pm
- Permalink
- Reply
Do you have a sense for how many queries at eBay require sorting the
petabyte databases? It seems that OLAP-style queries against either
GreenPlum or Vertica are running against databases that are
intelligently layed out on disk, so that a single table (or column)
scan suffices. In that case, it helps to have lots of disks, and
throwing extra CPUs at the problem won’t help. In the case of the
sorting benchmark, the CPU is a lot more important for running
comparison operators, and you do more than one pass over the data
in-core, so perhaps throwing more CPUs at the problem speeds things
up?

I assume that having a sense of the workload immensely helps eBay avoid a
lot of the heavier sort-based query plans, which might explain some of
the disparity. You’re absolutely right that it would be nice to see
an academic comparison of the breakdown of these reasons. Thanks for
an interesting entry!
- Owen O'Malley
- Posted May 14, 2009 at 2:12 pm
- Permalink
- Reply
There is a huge difference between having a petabyte in your system and running a single sort to process it. We’ve had single clusters with 3 petabytes of triple replicated compressed user data on them for a long time. A single job to sort a petabyte is much harder. Finishing the sort quickly is even harder. The advantage of Hadoop is that we not only can store large datasets, we can combine them with compute power. We’ve taken sample queries and a fixed hardware budget ($ not nodes) and representative query and compared a wide range of solutions including Greenplum and Hadoop. Hadoop was not only much faster, it could process more data without blowing up.
- jmh
- Posted May 14, 2009 at 4:24 pm
- Permalink
- Reply
A lot of good discussion here about workload. I agree wholeheartedly — you can’t decide how much hardware you need without understanding your offered workload.

Adam, you point out that workload is expressed by application needs, and sometimes you can satisfy the those needs with less resource consumption if you’re clever. I totally agree. (An extreme version of that argument is to use sampling whenever you can.) Of course any of those smarts can be achieved via a DBMS or via MapReduce, with differing amounts of effort from the user.

To your point Owen, Gray’s sort benchmarks are a stress test, and I get that. But they were in part a test of whether you’ve built a balanced system. Historically, for example, they exposed that at one time Sun had over-engineered CPUs and under-engineered I/O busses. What can we conclude along those lines about the system that Yahoo! has cobbled together? Where are the bottlenecks, where are the fat spots, and is the software lean enough to expose those? Maybe a shift to PennySort and JouleSort would help us learn more about that, rather than about sheer scale in # of machines (which likely isn’t the figure of merit in this decade.)

BTW, according to Monash eBay is storing clickstreams. So they’re probably doing some fairly holistic queries (and presumably a variety of queries) over that data. I know nothing about the GP deployment at eBay other than what I read on the web, but I’m sure it’s not a case like an email service where each shard is independent.

In the end, this conversation is still in a realm without metrics. One thing we know for sure is that eBay and Yahoo disagreed here by an order of magnitude on the right ratio of disks to processors. 10x is a big enough number to make us stop and ask questions. Are these really both rational decisions based on different workloads? What benchmarks would help us decide if that’s true?

PS, if somebody from eBay or Greenplum wants to weigh in with info on how that cluster gets used, would be fun to hear it.
- Joydeep
- Posted May 15, 2009 at 4:16 am
- Permalink
- Reply
a little surprised that no one bought up that hadoop doesn’t need to checkpoint to disk. map outputs could be stored in in-memory fs. (I think the earlier terasort run ran things this way – Owen would know best).

this is one of the things that’s struck me as a little false in a lot of debates (people claiming that hadoop doesn’t do pipelining) – with checkpointing to in memory fs – it comes awfully close to that. stock tmpfs may not have the best paging characteristics (could easily causing thrashing of the application) – and there might be room for a better in-memory+disk buffering system.
- - jmh
  - Posted May 15, 2009 at 6:28 am
  - Permalink
  - Reply
  Hey Joydeep — Interesting point. I wonder what the perf difference is between an explicit producer/consumer pipeline with well-chosen buffer sizes, and a producer/consumer pipeline that buffers everything in an in-memory filesystem. Maybe not so very much?
  
  Even with that, for complex processing Hadoop will stage reduce results to HDFS before re-reading in the next Map. That must happen a lot in Hive settings, right?
  
  Clearly all this stuff is changeable and can be studied, especially in an open-source setting!
  - - Joydeep
    - Posted May 15, 2009 at 10:56 am
    - Permalink
    it’s true – Hive stages to HDFS for multi job runs. We haven’t explored this yet – but this can be altered as well:
    – HDFS replication settings for intermediate data can be lowered.
    – an HDFS instance can be created over an in-memory fs for staging intermediate data (in parallel to the one that exists over the on-disk instance).
- - Arun C Murthy
  - Posted May 18, 2009 at 12:24 am
  - Permalink
  - Reply
  @Joydeep: All sorts we’ve ever entered into sortbenchmark.org have had map-outputs materialized to disk. We have had some interesting conversations around using RAM-disks for map-outputs and so on…
  - - jmh
    - Posted May 18, 2009 at 1:33 am
    - Permalink
    Seems like it’d be simpler/wiser to tweak the MapReduce code to take a flag for a different output mode, rather than fool the triply-replicated distributed storage system into unknowingly forgoing persistence. :-) Specially since Hive knows when it has a multi-job query and could pass that info down…
  - - Arun C Murthy
    - Posted May 18, 2009 at 11:34 am
    - Permalink
    @jmh – We were exploring putting intermediate map-outputs into a RAM-disk, not the output of job (reduces) between pipelines. It clearly works for Hive/Pig to put intermediate job-outputs (part of a pipeline) on HDFS with lower replication of 1 or 2.
    
    PS: I can’t seem to reply to replies here… *smile*
- Stu Hood
- Posted May 18, 2009 at 10:43 am
- Permalink
- Reply
As Owen said, there is a huge difference in the use cases between the 3,800 node and 96 node situations mentioned here.

To make the node count comparison fair, you would probably want to compare the time it takes to load the data into Hadoop/Greenplum in addition to the time to execute the job/query. This would take into account all of the time Greenplum spent indexing the data, which Hadoop does at execution time.

The reason for the extra nodes is speed: and as Owen mentions, because of the focus on this particular usecase, Hadoop would probably dominate in the comparison I described.
- - Ben Werther
  - Posted May 18, 2009 at 12:50 pm
  - Permalink
  - Reply
  @stu: Not sure I understand where you are going with this. Greenplum Database is pretty well unmatched for load performance — see recent press and blog discussion of our ‘Scatter/Gather Streaming’ technology where we’re getting 4TB/hr into a 40 node system at FIM, and linear scalability with each new node added.
  
  For this kind of use case, Greenplum doesn’t have any need for indexes. These are generally only needed when you really want to do single-row lookups.
  
  The data can be streamed in and processed/queried immediately. In fact we do ELT-style operations where we can aggregate or perform arbitrary parallel operations (e.g. MapReduce jobs) on the way in without first staging it to disk.
- Siva
- Posted May 19, 2009 at 4:45 pm
- Permalink
- Reply
Going back to the original intent of the post, I guess the problem could be generalized to a DAG scheduling problem with a failure model for tasks.

Here’s an example of a work in this area:
Matching and Scheduling Algorithms for Minimizing Execution Time and
Failure Probability of Applications in Heterogeneous Computing
Atakan Dogan, Student Member, IEEE, and FuÈ sun OÈ zguÈ ner, Member, IEEE

Link: http://hpcnl.ece.ohio-state.edu/pdfs/match_sched_alg.pdf
- Siva
- Posted May 19, 2009 at 4:54 pm
- Permalink
- Reply
Well.. the problem you speak of is a little different than the one in the paper – however it involves a similar tradeoff :)

The generalization of the problem you speak of would involve pipelined DAGs (or maybe even a simple pipeline). The scheduling choices there would involve materializing intermediate outputs vs restarting the pipeline. The twist here is that some operations (such as sort) break the pipeline anyway.
- Daniel Abadi
- Posted May 20, 2009 at 2:15 pm
- Permalink
- Reply
At Yale, we’ve been playing around with Hadoop vs parallel databases recently (we’re currently working on a hybrid that differs from what GP/Aster/Hive do). One thing we noticed with Hadoop is that it is shockingly CPU inefficient. When running a very basic selection query over a large dataset on a laptop with just one disk, we were shocked to see the CPU chugging along at 100% utilization (this query ought to have been disk bottlenecked). It seems that Hadoop’s runtime parsing of input data is partly to blame (along with various issues with Java handling string data). Hadoop’s poor CPU efficiency might be partly to blame for the large differences in the number of CPUs in the different clusters (yahoo vs ebay).

Wrt pipelining vs materialization, the premise of the original blog post, and Joydeep’s comment, a Yale undergraduate who took my database architecture and implementation class I teach every spring played around with storing Map output in an in-memory fs. He found this had limited affect on performance; however, this is partly because it is hard to create queries where Hadoop’s pessimistic checkpointing (to borrow Joe’s language) is the performance bottleneck. Some might argue that “sort” is a little bit unusual in that the Map task does zero filtering (the output size is the same as the input size).
- - jmh
  - Posted May 21, 2009 at 9:22 pm
  - Permalink
  - Reply
  Materialization of temps may or may not affect throughput, depending on bottlenecks. But it certainly affects resource/energy consumption…
- - Yunhong Gu
  - Posted June 23, 2009 at 5:15 pm
  - Permalink
  - Reply
  Just found this interesting discussion today. We developed another open source system called Sector/Sphere (sector.sf.net), which, despite many differences, is similar to Hadoop in principle. Sector uses the pipelining strategy, is implemented in C++, and has a simpler storage system. Various benchmarks (including sort) show that Sector can be 2 – 4 times faster than Hadoop.
- Matei
- Posted May 21, 2009 at 6:56 pm
- Permalink
- Reply
Parsing strings, especially for numbers, is expensive in any language (just try loading a CSV file in any DBMS). It shouldn’t be surprising that it is slow in Hadoop. However, Hadoop doesn’t force you to use plain text files for all data — if you look at how projects like Hive are used, they use compressed SequenceFiles, and even column-oriented compression now. Hadoop lets you choose the storage format that is most suitable for your data.

The reason you see a lot of people talk about using Hadoop on plain text data is that many data sets come into a system that way and are not queried often enough to justify converting to anything else. A perfect example is web search. Every night, you might crawl a number of web pages and end up with raw text data. You then scan these pages to update an index, and never do anything else with them. It wouldn’t make sense to convert the data to some optimized binary format first if you can just run your analysis on the raw text files. Other web backend applications, such as spam detection, have the same “run-once” property (you never check a new email twice to see whether it’s spam). For these “run-once” applications, Hadoop provides a convenient means to run over raw data and get your answer quickly without wasting time on building indexes, packing the data into your database’s favorite storage format, etc. From my understanding, Greenplum and others are starting to support similar “run over raw data” functionality.
- Daniel Abadi
- Posted May 21, 2009 at 8:34 pm
- Permalink
- Reply
I agree 100% with what you say about tradeoffs for one-time use vs. repeated use. Still, I was surprised to see Hadoop saturate the CPU for such a simple task (even if the data wasn’t in optimized storage format).
- - jmh
  - Posted May 21, 2009 at 9:28 pm
  - Permalink
  - Reply
  Most of the commercial databases have free downloads; would be easy enough to benchmark CSV load times…
- birthday messages
- Posted September 11, 2009 at 11:27 am
- Permalink
- Reply
The reason you see a lot of people talk about using Hadoop on plain text data is that many data sets come into a system that way and are not queried often enough to justify converting to anything else. A perfect example is web search. Every night, you might crawl a number of web pages and end up with raw text data. You then scan these pages to update an index, and never do anything else with them.

4 Trackbacks/Pingbacks

By The Three Sexy Skills of Data Geeks : Dataspora Blog on 27 May 2009 at 3:02 am

[…] the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to […]
By Diverging views on Big Data density | Large Data Matters on 06 Sep 2010 at 9:53 pm

[…] really interesting post by Joseph M. Hellerstein professor at UC Berkeley, comparing two quite different hardware […]
By The Three Sexy Skills of Data Geeks « Dataspora on 26 Apr 2011 at 9:27 am

[…] the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to […]
By Data Science | ModrnWiki (Pre-Alpha) on 06 Feb 2013 at 4:24 am

[…] the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to […]

Data in Beta