<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Diverging views on Big Data density, and some gimmes</title>
	<atom:link href="http://databeta.wordpress.com/2009/05/14/bigdata-node-density/feed/" rel="self" type="application/rss+xml" />
	<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/</link>
	<description>on computing and data .. in permanent beta</description>
	<lastBuildDate>Wed, 21 Oct 2009 05:02:21 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: birthday messages</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-81</link>
		<dc:creator>birthday messages</dc:creator>
		<pubDate>Fri, 11 Sep 2009 18:27:08 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-81</guid>
		<description>The reason you see a lot of people talk about using Hadoop on plain text data is that many data sets come into a system that way and are not queried often enough to justify converting to anything else. A perfect example is web search. Every night, you might crawl a number of web pages and end up with raw text data. You then scan these pages to update an index, and never do anything else with them.</description>
		<content:encoded><![CDATA[<p>The reason you see a lot of people talk about using Hadoop on plain text data is that many data sets come into a system that way and are not queried often enough to justify converting to anything else. A perfect example is web search. Every night, you might crawl a number of web pages and end up with raw text data. You then scan these pages to update an index, and never do anything else with them.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Yunhong Gu</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-67</link>
		<dc:creator>Yunhong Gu</dc:creator>
		<pubDate>Wed, 24 Jun 2009 00:15:54 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-67</guid>
		<description>Just found this interesting discussion today. We developed another open source system called Sector/Sphere (sector.sf.net), which, despite many differences, is similar to Hadoop in principle. Sector uses the pipelining strategy, is implemented in C++, and has a simpler storage system. Various benchmarks (including sort) show that Sector can be 2 - 4 times faster than Hadoop.</description>
		<content:encoded><![CDATA[<p>Just found this interesting discussion today. We developed another open source system called Sector/Sphere (sector.sf.net), which, despite many differences, is similar to Hadoop in principle. Sector uses the pipelining strategy, is implemented in C++, and has a simpler storage system. Various benchmarks (including sort) show that Sector can be 2 &#8211; 4 times faster than Hadoop.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: The Three Sexy Skills of Data Geeks : Dataspora Blog</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-58</link>
		<dc:creator>The Three Sexy Skills of Data Geeks : Dataspora Blog</dc:creator>
		<pubDate>Wed, 27 May 2009 10:02:14 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-58</guid>
		<description>[...] the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to [...]</description>
		<content:encoded><![CDATA[<p>[...] the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jmh</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-57</link>
		<dc:creator>jmh</dc:creator>
		<pubDate>Fri, 22 May 2009 04:28:14 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-57</guid>
		<description>Most of the commercial databases have free downloads; would be easy enough to benchmark CSV load times...</description>
		<content:encoded><![CDATA[<p>Most of the commercial databases have free downloads; would be easy enough to benchmark CSV load times&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jmh</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-56</link>
		<dc:creator>jmh</dc:creator>
		<pubDate>Fri, 22 May 2009 04:22:03 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-56</guid>
		<description>Materialization of temps may or may not affect throughput, depending on bottlenecks.  But it certainly affects resource/energy consumption...</description>
		<content:encoded><![CDATA[<p>Materialization of temps may or may not affect throughput, depending on bottlenecks.  But it certainly affects resource/energy consumption&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Abadi</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-55</link>
		<dc:creator>Daniel Abadi</dc:creator>
		<pubDate>Fri, 22 May 2009 03:34:09 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-55</guid>
		<description>I agree 100% with what you say about tradeoffs for one-time use vs. repeated use. Still, I was surprised to see Hadoop saturate the CPU for such a simple task (even if the data wasn&#039;t in optimized storage format).</description>
		<content:encoded><![CDATA[<p>I agree 100% with what you say about tradeoffs for one-time use vs. repeated use. Still, I was surprised to see Hadoop saturate the CPU for such a simple task (even if the data wasn&#8217;t in optimized storage format).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matei</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-54</link>
		<dc:creator>Matei</dc:creator>
		<pubDate>Fri, 22 May 2009 01:56:07 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-54</guid>
		<description>Parsing strings, especially for numbers, is expensive in any language (just try loading a CSV file in any DBMS). It shouldn&#039;t be surprising that it is slow in Hadoop. However, Hadoop doesn&#039;t force you to use plain text files for all data -- if you look at how projects like Hive are used, they use compressed SequenceFiles, and even column-oriented compression now. Hadoop lets you choose the storage format that is most suitable for your data.

The reason you see a lot of people talk about using Hadoop on plain text data is that many data sets come into a system that way and are not queried often enough to justify converting to anything else. A perfect example is web search. Every night, you might crawl a number of web pages and end up with raw text data. You then scan these pages to update an index, and never do anything else with them. It wouldn&#039;t make sense to convert the data to some optimized binary format first if you can just run your analysis on the raw text files. Other web backend applications, such as spam detection, have the same &quot;run-once&quot; property (you never check a new email twice to see whether it&#039;s spam). For these &quot;run-once&quot; applications, Hadoop provides a convenient means to run over raw data and get your answer quickly without wasting time on building indexes, packing the data into your database&#039;s favorite storage format, etc. From my understanding, Greenplum and others are starting to support similar &quot;run over raw data&quot; functionality.</description>
		<content:encoded><![CDATA[<p>Parsing strings, especially for numbers, is expensive in any language (just try loading a CSV file in any DBMS). It shouldn&#8217;t be surprising that it is slow in Hadoop. However, Hadoop doesn&#8217;t force you to use plain text files for all data &#8212; if you look at how projects like Hive are used, they use compressed SequenceFiles, and even column-oriented compression now. Hadoop lets you choose the storage format that is most suitable for your data.</p>
<p>The reason you see a lot of people talk about using Hadoop on plain text data is that many data sets come into a system that way and are not queried often enough to justify converting to anything else. A perfect example is web search. Every night, you might crawl a number of web pages and end up with raw text data. You then scan these pages to update an index, and never do anything else with them. It wouldn&#8217;t make sense to convert the data to some optimized binary format first if you can just run your analysis on the raw text files. Other web backend applications, such as spam detection, have the same &#8220;run-once&#8221; property (you never check a new email twice to see whether it&#8217;s spam). For these &#8220;run-once&#8221; applications, Hadoop provides a convenient means to run over raw data and get your answer quickly without wasting time on building indexes, packing the data into your database&#8217;s favorite storage format, etc. From my understanding, Greenplum and others are starting to support similar &#8220;run over raw data&#8221; functionality.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Abadi</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-53</link>
		<dc:creator>Daniel Abadi</dc:creator>
		<pubDate>Wed, 20 May 2009 21:15:10 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-53</guid>
		<description>At Yale, we&#039;ve been playing around with Hadoop vs parallel databases recently (we&#039;re currently working on a hybrid that differs from what GP/Aster/Hive do). One thing we noticed with Hadoop is that it is shockingly CPU inefficient. When running a very basic selection query over a large dataset on a laptop with just one disk, we were shocked to see the CPU chugging along at 100% utilization (this query ought to have been disk bottlenecked). It seems that Hadoop&#039;s runtime parsing of input data is partly to blame (along with various issues with Java handling string data). Hadoop&#039;s poor CPU efficiency might be partly to blame for the large differences in the number of CPUs in the different clusters (yahoo vs ebay).

Wrt pipelining vs materialization, the premise of the original blog post, and Joydeep&#039;s comment, a Yale undergraduate who took my database architecture and implementation class I teach every spring played around with storing Map output in an in-memory fs. He found this had limited affect on performance; however, this is partly because it is hard to create queries where Hadoop&#039;s pessimistic checkpointing (to borrow Joe&#039;s language) is the performance bottleneck. Some might argue that &quot;sort&quot; is a little bit unusual in that the Map task does zero filtering (the output size is the same as the input size).</description>
		<content:encoded><![CDATA[<p>At Yale, we&#8217;ve been playing around with Hadoop vs parallel databases recently (we&#8217;re currently working on a hybrid that differs from what GP/Aster/Hive do). One thing we noticed with Hadoop is that it is shockingly CPU inefficient. When running a very basic selection query over a large dataset on a laptop with just one disk, we were shocked to see the CPU chugging along at 100% utilization (this query ought to have been disk bottlenecked). It seems that Hadoop&#8217;s runtime parsing of input data is partly to blame (along with various issues with Java handling string data). Hadoop&#8217;s poor CPU efficiency might be partly to blame for the large differences in the number of CPUs in the different clusters (yahoo vs ebay).</p>
<p>Wrt pipelining vs materialization, the premise of the original blog post, and Joydeep&#8217;s comment, a Yale undergraduate who took my database architecture and implementation class I teach every spring played around with storing Map output in an in-memory fs. He found this had limited affect on performance; however, this is partly because it is hard to create queries where Hadoop&#8217;s pessimistic checkpointing (to borrow Joe&#8217;s language) is the performance bottleneck. Some might argue that &#8220;sort&#8221; is a little bit unusual in that the Map task does zero filtering (the output size is the same as the input size).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Siva</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-50</link>
		<dc:creator>Siva</dc:creator>
		<pubDate>Tue, 19 May 2009 23:54:12 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-50</guid>
		<description>Well.. the problem you speak of is a little different than the one in the paper - however it involves a similar tradeoff :)

The generalization of the problem you speak of would involve pipelined DAGs (or maybe even a simple pipeline). The scheduling choices there would involve materializing intermediate outputs vs restarting the pipeline. The twist here is that some operations (such as sort) break the pipeline anyway.</description>
		<content:encoded><![CDATA[<p>Well.. the problem you speak of is a little different than the one in the paper &#8211; however it involves a similar tradeoff :)</p>
<p>The generalization of the problem you speak of would involve pipelined DAGs (or maybe even a simple pipeline). The scheduling choices there would involve materializing intermediate outputs vs restarting the pipeline. The twist here is that some operations (such as sort) break the pipeline anyway.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Siva</title>
		<link>http://databeta.wordpress.com/2009/05/14/bigdata-node-density/#comment-49</link>
		<dc:creator>Siva</dc:creator>
		<pubDate>Tue, 19 May 2009 23:45:43 +0000</pubDate>
		<guid isPermaLink="false">http://databeta.wordpress.com/?p=140#comment-49</guid>
		<description>Going back to the original intent of the post, I guess the problem could be generalized to a DAG scheduling problem with a failure model for tasks.

Here&#039;s an example of a work in this area:
Matching and Scheduling Algorithms for Minimizing Execution Time and 
Failure Probability of Applications in Heterogeneous Computing 
Atakan Dogan, Student Member, IEEE, and FuÈ sun OÈ zguÈ ner, Member, IEEE

Link: http://hpcnl.ece.ohio-state.edu/pdfs/match_sched_alg.pdf</description>
		<content:encoded><![CDATA[<p>Going back to the original intent of the post, I guess the problem could be generalized to a DAG scheduling problem with a failure model for tasks.</p>
<p>Here&#8217;s an example of a work in this area:<br />
Matching and Scheduling Algorithms for Minimizing Execution Time and<br />
Failure Probability of Applications in Heterogeneous Computing<br />
Atakan Dogan, Student Member, IEEE, and FuÈ sun OÈ zguÈ ner, Member, IEEE</p>
<p>Link: <a href="http://hpcnl.ece.ohio-state.edu/pdfs/match_sched_alg.pdf" rel="nofollow">http://hpcnl.ece.ohio-state.edu/pdfs/match_sched_alg.pdf</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>
