<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Data Beta</title>
	<atom:link href="http://databeta.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://databeta.wordpress.com</link>
	<description>on computing and data .. in permanent beta</description>
	<lastBuildDate>Tue, 10 Jan 2012 18:48:48 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='databeta.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Data Beta</title>
		<link>http://databeta.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://databeta.wordpress.com/osd.xml" title="Data Beta" />
	<atom:link rel='hub' href='http://databeta.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Quantifying Eventual Consistency via PBS</title>
		<link>http://databeta.wordpress.com/2012/01/10/pbs-quantifying-eventual-consistency-of-replicas/</link>
		<comments>http://databeta.wordpress.com/2012/01/10/pbs-quantifying-eventual-consistency-of-replicas/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 18:29:53 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[boom]]></category>
		<category><![CDATA[consistency]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[bailis]]></category>
		<category><![CDATA[cassandra]]></category>
		<category><![CDATA[datastax]]></category>
		<category><![CDATA[eventual consistency]]></category>
		<category><![CDATA[linkedin]]></category>
		<category><![CDATA[pbs]]></category>
		<category><![CDATA[quorum systems]]></category>
		<category><![CDATA[replica consistency]]></category>
		<category><![CDATA[venkataraman]]></category>
		<category><![CDATA[yammer]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=528</guid>
		<description><![CDATA[If you follow this blog, you know that my BOOM group has spent a lot of time in the past couple years formalizing eventual consistency (EC) for distributed programs, via the CALM theorem and practical tools for analyzing Bloom programs. In recent months, my student Peter Bailis and his teammate Shivaram Venkataraman took a different [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=528&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://databeta.files.wordpress.com/2012/01/92755998_b14771616a.jpg"><img class="alignright  wp-image-549" title="Copy" src="http://databeta.files.wordpress.com/2012/01/92755998_b14771616a.jpg?w=350&#038;h=263" alt="Copy" width="350" height="263" /></a>If you follow this blog, you know that my <a href="http://boom.cs.berkeley.edu">BOOM</a> group has spent a lot of time in the past couple years formalizing eventual consistency (EC) for distributed programs, via the <a title="CALM theorem post" href="http://databeta.wordpress.com/2010/10/28/the-calm-conjecture-reasoning-about-consistency/">CALM theorem</a> and <a title="Budplot and Budviz" href="https://github.com/bloom-lang/bud/blob/master/docs/visualizations.md">practical tools</a> for analyzing <a href="http://bloom-lang.org">Bloom</a> programs.</p>
<p>In recent months, my student <a href="http://www.cs.berkeley.edu/~pbailis/">Peter Bailis</a> and his teammate <a href="http://amplab.cs.berkeley.edu/author/shivaram/">Shivaram Venkataraman</a> took a different tack on the whole EC analysis problem which they call <strong>PBS: Probabilistically Bounded Staleness</strong>. The results are interesting, and extremely relevant to current practice.  (See, for example, the very nice <a href="http://www.datastax.com/dev/blog/your-ideal-performanceconsistency-tradeoff">blog post</a> by folks at <a href="http://www.datastax.com">DataStax</a>).</p>
<p>Many people today deal with EC in the specific context of replica consistency, particularly in distributed NoSQL-style Key-Value Stores (KVSs). It is typical to configure these stores with so-called &#8220;partial&#8221; quorum replication, to get a comfortable mix of low latency with reasonable availability. The term &#8220;partial&#8221; signifies that you are not guaranteed consistency of writes by these configurations &#8212; at best they guarantee a form of eventual consistency of final writes, but readers may well read stale data along the way. Lots of people are deploying these configurations in the field, but there&#8217;s little information on how often the approach messes up, and how badly.</p>
<p>Jumping off from earlier <a title="Malkhi  et al on Probabilistic Quorums" href="http://dl.acm.org/citation.cfm?id=259458&amp;dl=">theoretical work on probabilistic quorum systems</a>, Peter and Shivaram answered two natural questions about how these systems should perform in current practice:</p>
<ol>
<li><strong>How many versions ago?</strong>  On expectation, if you do a read in a partial-quorum KVS, how many versions behind are you? Peter and Shivaram answer this one definitively, via a closed-form mathematical analysis.</li>
<li><strong>How stale on the (wall-)clock?</strong>  On expectation, if you do a read in a partial-quorum KVS, how out-of-date will your version be in terms of wall-clock time? Answering this one requires modeling a read/write workload in wall-clock time, as well as system parameters like replica propagation (&#8220;anti-entropy&#8221;). Peter and Shivaram address this with a Monte Carlo model, and run the model with parameters grounded in real-world performance numbers generously provided by two of our most excellent colleagues: <a href="http://twitter.com/#!/strlen">Alex Feinberg</a> at <a href="http://www.linkedin.com">LinkedIn</a> and <a href="http://codahale.com/">Coda Hale</a> at <a href="http://www.yammer.com">Yammer</a> (both of whom also guest-lectured in my <a href="http://programthecloud.github.com/">Programming the Cloud</a> course last fall.)  Peter and Shivaram validated their models in practice using <a href="http://cassandra.apache.org/">Cassandra</a>, a widely-used KVS.</li>
</ol>
<p>On the whole, PBS shows that being sloppy about consistency doesn&#8217;t bite you often or badly &#8212; especially if you&#8217;re in a single datacenter and you use SSDs. But things get more complex with magnetic disks, garbage collection delays (grr), and wide-area replication.</p>
<p>Interested in more detail?  You can check out two things:</p>
<ul>
<li>The paper (currently under submission to a conference) is <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-4.pdf">available as a Berkeley Tech Report</a>.</li>
<li>Peter put up a <a href="http://www.cs.berkeley.edu/~pbailis/projects/pbs/">web-based version of the Monte Carlo simulation</a> that allows you to specify quorum parameters and workload parameters, and observe the tradeoff between those parameters and the probability of various levels of staleness.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/528/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/528/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/528/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/528/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/528/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/528/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/528/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/528/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/528/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/528/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/528/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/528/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/528/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/528/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=528&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2012/01/10/pbs-quantifying-eventual-consistency-of-replicas/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2012/01/92755998_b14771616a.jpg" medium="image">
			<media:title type="html">Copy</media:title>
		</media:content>
	</item>
		<item>
		<title>Maxim-ization</title>
		<link>http://databeta.wordpress.com/2011/09/22/maxim-ization/</link>
		<comments>http://databeta.wordpress.com/2011/09/22/maxim-ization/#comments</comments>
		<pubDate>Fri, 23 Sep 2011 00:50:35 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=522</guid>
		<description><![CDATA[For a class I&#8217;m teaching, I&#8217;d like to collect a list of favorite &#8220;maxims&#8221; or &#8220;aphorisms&#8221; for computer systems. I&#8217;d be very grateful if you would add your favorites below to the comments, preferably with a link to a source that either introduces or references the maxim.  It&#8217;s OK to agree or disagree with the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=522&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://databeta.files.wordpress.com/2011/09/maxim.jpg"><img class="alignright size-full wp-image-523" title="maxim" src="http://databeta.files.wordpress.com/2011/09/maxim.jpg?w=510" alt=""   /></a>For a class I&#8217;m teaching, I&#8217;d like to collect a list of favorite &#8220;maxims&#8221; or &#8220;aphorisms&#8221; for computer systems.</p>
<p>I&#8217;d be very grateful if you would add your favorites below to the comments, preferably with a link to a source that either introduces or references the maxim.  It&#8217;s OK to agree or disagree with the maxim.</p>
<p>I&#8217;d enjoy seeing people&#8217;s support/critiques for these below as well &#8212; may merit more focused posts another day.</p>
<p>Examples:</p>
<ul>
<li>The <a href="http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf">End-to-End Argument</a>, Saltzer/Reed/Clark, ICDCS &#8217;81.</li>
<li>Many examples in Lampson&#8217;s &#8220;<a href="http://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf">Hints for Computer System Design</a>&#8220;, SOSP &#8217;83.</li>
<li>&#8220;All problems in computer science can be solved by another level of indirection&#8221;. &#8212; also <a href="http://www.dmst.aueb.gr/dds/pubs/inbook/beautiful_code/html/Spi07g.html">attributed to Butler Lampson</a></li>
</ul>
<p>What else?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/522/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/522/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/522/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/522/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/522/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/522/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/522/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/522/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/522/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/522/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/522/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/522/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/522/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/522/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=522&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/09/22/maxim-ization/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2011/09/maxim.jpg" medium="image">
			<media:title type="html">maxim</media:title>
		</media:content>
	</item>
		<item>
		<title>Is Teaching MapReduce Healthy for Students?</title>
		<link>http://databeta.wordpress.com/2011/09/15/is-teaching-mapreduce-healthy-for-students/</link>
		<comments>http://databeta.wordpress.com/2011/09/15/is-teaching-mapreduce-healthy-for-students/#comments</comments>
		<pubDate>Fri, 16 Sep 2011 04:47:13 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[bloom]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[map reduce]]></category>
		<category><![CDATA[parallelism]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=497</guid>
		<description><![CDATA[I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates.  I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=497&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><img class="alignright" title="Another Mother logo" src="http://www.anothermother.org/wp-content/themes/AnotherMother/images/masthead_poster.gif" alt="Hadoop is not healthy for children and other living things." width="232" height="261" />I sat at Berkeley CS faculty lunch this past week with <a href="http://www.cs.berkeley.edu/~bh/">Brian Harvey</a> and <a href="http://www.cs.berkeley.edu/~ddgarcia/">Dan Garcia</a>, two guys who think hard about teaching computing to undergraduates.  I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.</p>
<p>Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by <a title="Brian Harvey teaches MapReduce" href="http://www.youtube.com/watch?v=mVXpvsdeuKU">introducing MapReduce and Hadoop in our initial 61A course</a>.  I have <a href="http://radar.oreilly.com/2008/11/the-commoditization-of-massive.html">argued elsewhere</a> that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.</p>
<p>But as a matter of both pedagogy and system design, I have begun to think that Google&#8217;s MapReduce model is not healthy for beginning students.  The basic issue is that Google&#8217;s narrow MapReduce API conflates<em> logical semantics</em> (define a function over all items in a collection) with an expensive <em>physical implementation</em> (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication.  But there&#8217;s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.</p>
<p>From an architectural point of view, <em>a good language for parallelism should expose pipelining</em>, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it.  So here we go.</p>
<p><span id="more-497"></span></p>
<p>The canonical example here is the relational join, which pairs up matching objects from two different input streams.  One of the early observations in shared-nothing parallel databases was that parallel joins can work in a fully streaming fashion even though they require all-to-all communication (<a title="Wilschut Apers" href="http://scholar.google.com/scholar?q=Dataflow+Query+Execution+in+a+Parallel+Main-Memory+Environment+(1991)">Wilschut and Apers&#8217; Pipelining Hash Join</a>).  This is impossible to implement in stock Hadoop.  Which means, for example, that the NSA—apparently a big Hadoop user—can&#8217;t use MapReduce to match live feeds of suspicious activity reports with a repository of intelligence on suspicious people.  (Whether you think it&#8217;s good or bad that the NSA can&#8217;t use Hadoop for this task may depend on your politics regarding national security, open source, or both.)</p>
<p>That example is easy to understand, but maybe too easy to dismiss as the special case of &#8220;stream processing&#8221;.  Sure, Hadoop wasn&#8217;t designed for that, so maybe you should dismiss my point as saying that somebody should just build a streaming edition of Hadoop for special uses.  (Oh yeah, <a href="http://code.google.com/p/hop/">somebody did that</a>.)</p>
<p>A more telling pedagogical example is the standard discussion of implementing PageRank in MapReduce.  What an awesome lecture: the two most famous pieces of Google technology in action at once!  The thing is, at a logical level PageRank is a simple self-join query with transitive closure: it pairs up items in the Nodes table with their neighbors in that same table, and outputs new weights until a stopping condition is reached. The matching of nodes and neighbors needs to be hash-partitioned across the network, but there is no reason to require the entire cluster to do a barrier synchronization at each iteration boundary.  The subtext of this lecture&#8217;s awesomeness is to deeply implant a misleading (and potentially very inefficient) idea about the need for synchronized execution in parallel linear algebra and query processing tasks.  That sounds like a bad way to teach.</p>
<p>And it goes deeper.  At a very basic level, <em>asynchronous parallel computation is fundamentally a matter of streaming rendezvous (join)</em> between event channels and buffers.  We converted this observation into a first-class feature of the <a href="http://bloom-lang.org">Bloom</a> language for distributed systems. When you write event-handling code in Bloom, you express it as the parallel join of a partitioned event stream with partitioned program state. Whether or not this metaphor makes sense to you (yet!), I assure you that it is fundamentally what goes on under the covers of any asynchronous distributed system: the matching of streams of requests or responses with stored state.  By teaching students the Google MapReduce model (as represented in Hadoop), we teach them that distributed joins need to block, and therefore cannot be the basis of the kind of streaming computation that servers must do by their very nature.  That is not only bad, it&#8217;s paradoxical: underneath the Hadoop implementation is a message handling dataflow that effectively does streaming joins.</p>
<p>So there—at Brian&#8217;s request I wrote down some cautionary words about using Hadoop in school.  Lest somebody read this as my declaration of allegiance to the &#8220;<a href="http://www.cs.washington.edu/homes/billhowe/mapreduce_a_major_step_backwards.html#comment-687">get off my lawn</a>&#8221; side of the AntiNoSQLdisestablishmentarianism argument, let me clarify my view on a few key things:</p>
<ul>
<li>Google MapReduce and the Hadoop open-source implementation opened the floodgates for discussing big data and parallelism in the core of the computing curriculum.  This has been a critical step in moving computing education into the era of abundant computing and data resources.  Bully for that.</li>
<li>I do understand that Google put barriers into MapReduce for fault tolerance purposes.  I also get that fault-tolerance is important in scale-out of parallel dataflows (we did <a title="Flux Fault Tolerance" href="http://db.cs.berkeley.edu/papers/sigmod04-fluxft.pdf">some work</a> on this) and that Google uses a lot of machines at once.  But as a practical matter, many Hadoop users run on a small handful of nodes and don&#8217;t need Google-scale fault tolerance features.  More to the point, when we teach computing, we shouldn&#8217;t focus on only one design point; we should focus on the fundamentals.  This is a case where the fundamentals get taught in a tangled way that only makes sense at extreme scale.</li>
<li>For the record, as languages go I like MapReduce about as much as I like SQL. That is to say I would invite them both to my birthday party. But I wouldn&#8217;t invite either to the prom.  My sweetheart here is <a href="http://bloom-lang.org">Bloom</a>.</li>
</ul>
<p>If you&#8217;ve been following our Bloom work, you know where this discussion is coming from: the <a href="http://databeta.wordpress.com/2010/10/28/the-calm-conjecture-reasoning-about-consistency/">CALM Theorem</a> says that the real reason to use coordination is to manage non-monotonic reasoning.  Many (most?) Reduce functions in the wild are in fact monotonic w.r.t. subsets of their inputs, but the standard Reduce API defies monotonicity analysis.  And introducing barriers or monotonic computation is a <a href="https://databeta.wordpress.com/2010/12/03/the-cron-principle/">waste of &#8220;time&#8221;</a>, in the temporal logical sense of the word.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/497/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/497/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/497/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/497/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/497/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/497/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/497/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=497&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/09/15/is-teaching-mapreduce-healthy-for-students/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://www.anothermother.org/wp-content/themes/AnotherMother/images/masthead_poster.gif" medium="image">
			<media:title type="html">Another Mother logo</media:title>
		</media:content>
	</item>
		<item>
		<title>MADlib goes beta!  Serious in-database analytics</title>
		<link>http://databeta.wordpress.com/2011/07/10/madlib-goes-beta-serious-in-database-analytics/</link>
		<comments>http://databeta.wordpress.com/2011/07/10/madlib-goes-beta-serious-in-database-analytics/#comments</comments>
		<pubDate>Sun, 10 Jul 2011 21:08:12 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[greenplum]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[MADlib]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[MAD]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=488</guid>
		<description><![CDATA[MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team! Forget your previous associations with low-tech SQL analytics, including so-called &#8220;business intelligence&#8221;, &#8220;olap&#8221;, &#8220;data cubes&#8221; and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=488&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a title="MADlib" href="http://madlib.net"><img class="alignright size-medium wp-image-489" title="MADlib logo" src="http://databeta.files.wordpress.com/2011/07/screen-shot-2011-07-10-at-4-27-26-pm.png?w=300&#038;h=110" alt="" width="300" height="110" /></a><a title="MADlib" href="http://madlib.net">MADlib</a> is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!</p>
<p>Forget your previous associations with low-tech SQL analytics, including so-called &#8220;business intelligence&#8221;, &#8220;olap&#8221;, &#8220;data cubes&#8221; and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of <a title="MADlib GitHub repo" href="https://github.com/madlib/madlib">the code</a> is written in SQL (a language that doesn&#8217;t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:</p>
<ul>
<li>standard statistical methods like multi-variate linear and logistic regressions,</li>
<li>supervised learning methods including support-vector machines, naive Bayes, and decision trees</li>
<li>unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation</li>
<li>descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification</li>
<li>statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.</li>
</ul>
<p>More methods are planned for future releases.  Myself, I&#8217;m working with Daisy Wang on merging <a title="Daisy Wang's latest SIGMOD paper" href="http://www.cs.berkeley.edu/~daisyw/sigmod11.pdf">her SQL-based Conditional Random Fields and Bayesian inference</a> implementations into the library for an upcoming release, to support sophisticated text processing.</p>
<p><span id="more-488"></span>MADlib 0.20beta was designed as a drop-in addition to the popular open-source PostgreSQL database.  The routines also all run massively parallel in the Greenplum database, which inherits the extensibility interfaces of PostgreSQL. (Non-commercial users can get the parallelism for free—no crippleware!—via Greenplum&#8217;s <a title="Greenplum Community Edition" href="http://www.greenplum.com/products/community-edition">Community Edition</a>.)</p>
<p>Obviously the first beta release is still experimental code.  But we&#8217;ve come a long way since our alpha announcement at the <a title="Strata 2011" href="http://strataconf.com/strata2011/">Strata conference</a> last February in both breadth and usability. For me, the most exciting aspect of the beta release is that we&#8217;re about ready to grow our developer community beyond the initial committers at Berkeley and EMC-Greenplum.  If you&#8217;re interested in adding methods to MADlib—or in developing ports for other DBMS platforms (something I&#8217;d love to see happen!)—please <a title="MADlib user forum" href="http://groups.google.com/group/madlib-user-forum">get in touch</a>.</p>
<p>And hats off to EMC-Greenplum for putting significant development resources behind this open-source effort.  I started this project at Greenplum before they were acquired, and have been happy to see EMC embrace it and push it further.  As <a title="O'Reilly Interview at Strata" href="http://www.youtube.com/watch?v=lSvI2UXCVHQ">I discussed at Strata last year</a>, I think this is a very healthy direction for open-source software development.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/488/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=488&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/07/10/madlib-goes-beta-serious-in-database-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2011/07/screen-shot-2011-07-10-at-4-27-26-pm.png?w=300" medium="image">
			<media:title type="html">MADlib logo</media:title>
		</media:content>
	</item>
		<item>
		<title>CACM Article on Jim Gray Search</title>
		<link>http://databeta.wordpress.com/2011/06/27/cacm-article-on-jim-gray-search/</link>
		<comments>http://databeta.wordpress.com/2011/06/27/cacm-article-on-jim-gray-search/#comments</comments>
		<pubDate>Mon, 27 Jun 2011 21:16:34 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[cloud]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[CACM]]></category>
		<category><![CDATA[crisis response]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[image processing]]></category>
		<category><![CDATA[Jim Gray]]></category>
		<category><![CDATA[social computing]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[Tenacious]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=475</guid>
		<description><![CDATA[The recent July 2011 issue of Communications of the ACM includes our article on the technical aspects of the search for Jim Gray&#8217;s boat Tenacious.  This was a hard article to write, for both technical and personal reasons. It took far too long to finish, so at some point it was time to just pack it in (at [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=475&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><img class="alignright" title="CACM July, 2011" src="http://portalparts.acm.org/1970000/1965724/cover/cover_full.jpg" alt="" width="132" height="171" />The recent July 2011 issue of <a title="CACM article on Jim Gray search" href="http://cacm.acm.org/magazines/2011/7/109892-searching-for-jim-gray/fulltext">Communications of the ACM</a> includes our article on the technical aspects of the search for Jim Gray&#8217;s boat Tenacious.  This was a hard article to write, for both technical and personal reasons. It took far too long to finish, so at some point it was time to just pack it in (at which point the CACM folks informed us it had to be cut in length by half, which delayed things further.  The longer version is up as a <a title="Berkeley Tech Report on Jim Gray Search" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-142.html">Berkeley tech report</a>.)</p>
<p>Meanwhile, some of the experience is even more relevant to current technology trends than it was 4 years ago, so hopefully folks interested in social computing, software engineering, image processing, crisis response, and other related areas will find something of use in there.</p>
<p>For those of you whose work is represented (or underrepresented) by the article, my apologies for its shortcomings.  I still don&#8217;t have the full picture of what happened&#8212;nobody does, really.  As a result I decided to avoid using personal names of volunteers in general to avoid attributing credit unevently. I know the result seems oddly impersonal.  Setting the tone of the article was as hard as capturing the content.</p>
<p>Meanwhile, I encourage you to add corrections and perspective to the article in the comment box at the end of the CACM link above. Comments are welcome here too, but they might not get as well-viewed or -archived.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/475/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/475/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/475/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/475/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/475/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/475/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/475/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/475/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/475/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/475/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/475/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/475/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/475/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/475/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=475&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/06/27/cacm-article-on-jim-gray-search/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://portalparts.acm.org/1970000/1965724/cover/cover_full.jpg" medium="image">
			<media:title type="html">CACM July, 2011</media:title>
		</media:content>
	</item>
		<item>
		<title>bud: bloom under development</title>
		<link>http://databeta.wordpress.com/2011/04/08/bud-bloom-under-development/</link>
		<comments>http://databeta.wordpress.com/2011/04/08/bud-bloom-under-development/#comments</comments>
		<pubDate>Fri, 08 Apr 2011 08:41:11 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[bloom]]></category>
		<category><![CDATA[boom]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[bud]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=438</guid>
		<description><![CDATA[Today was a big day in the BOOM group: we launched the alpha version of Bud: Bloom Under Development. If you&#8217;re new to this blog, Bloom is our new programming language for cloud computing and other distributed systems settings. Bud is the first fully-functional release of Bloom, implemented as a DSL in Ruby. I&#8217;ve written a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=438&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://bloom-lang.net"><img class="alignright" title="Bloom logo" src="http://www.bloom-lang.net/wp-content/uploads/2011/04/bloomlogoleft2.png" alt="" width="333" height="67" /></a>Today was a big day in the BOOM group: we launched the alpha version of <a title="Bud" href="http://www.bloom-lang.net/bud/">Bud: Bloom Under Development</a>. If you&#8217;re new to this blog, Bloom is our new programming language for cloud computing and other distributed systems settings. Bud is the first fully-functional release of Bloom, implemented as a DSL in Ruby.</p>
<p>I&#8217;ve written a lot about Bloom in <a href="http://boom.cs.berkeley.edu/papers.html">research papers</a> and on the new <a href="http://bloom-lang.net">Bloom website</a>, and I have lots to say about distributed programming that I won&#8217;t recap. Instead, I want to focus here on the tangible: working code. If you&#8217;re looking for something serious, check out the <a href="https://github.com/bloom-lang/bud/blob/master/docs/bfs.md">walkthrough of the bfs distributed filesystem</a>, a GFS clone. But to get the flavor, consider the following <em>two lines of code, </em>which implement what you might consider to be &#8220;hello, world&#8221; for distributed systems: a chat server.<br />
<span><br />
<blockquote><code><font size="2" color="white"> nodelist <span style="color:#fbde2d;">&lt;= </span>connect.<span style="color:#fbde2d;">payloads</span><br />
mcast <span style="color:#fbde2d;">&lt;~ </span>(mcast <span style="color:#fbde2d;">*</span> nodelist).<span style="color:#fbde2d;">pairs</span> { |m,n| [n.key, m.val] }</font></code></p></blockquote>
<p></span>That&#8217;s it.</p>
<p>The first line says &#8220;if you get a message on a channel called &#8216;connect&#8217;, remember the payload in a table called &#8216;nodelist&#8217;&#8221;.  The second says &#8220;if you get a message on the &#8216;mcast&#8217; channel, then forward its contents to each address stored in &#8216;nodelist&#8217;&#8221;. That&#8217;s all that&#8217;s needed for a bare-bones chat server.  Nice, right?</p>
<p><span id="more-438"></span><br />
The chat client code bootstraps by connecting to the server:</p>
<blockquote><p><code><font size="2" color="white">connect <span style="color:#fbde2d;">&lt;~</span> [[@server, [ip_port, @nick]]] </font></code></p></blockquote>
<p>and then runs the following logic:</p>
<blockquote><p><code><font size="2" color="white">mcast <span style="color:#fbde2d;">&lt;~</span> <span style="color:#fbde2d;">stdio do</span> |s|<br />
 &nbsp;&nbsp;[@server,<br />
 &nbsp;&nbsp;&nbsp;[ip_port, @nick, <span style="color:#4466ff;">Time</span>.<span style="color:#fbde2d;">new</span>.strftime(<span style="color:#00ff00;">"%I:%M.%S"</span>), s.line]]<br />
<span style="color:#fbde2d;">end</span><br />
<span style="color:#fbde2d;">stdio &lt;~</span> mcast { |m| [pretty_print(m.val)] }</font></code></p></blockquote>
<p>The first statement of this batch takes input from the terminal (&#8216;stdio&#8217;), formats it, and sends it to the server on the &#8216;mcast&#8217; channel.  The second statement takes messages from the &#8216;mcast&#8217; channel (forwarded by the server) and prints them on the terminal.</p>
<div>Hopefully this little (but working) example gives you a taste of why I&#8217;m excited about Bloom: I think it captures the essence of distributed programming in a clean, readable way.  <a href="https://github.com/bloom-lang/bud/tree/master/examples/chat">The full implementation</a> is a bit longer, but most of the extra code is simple Ruby boilerplate.
</div>
<p></p>
<div>It&#8217;s been a blast designing the language and coding up the Bud prototype with my <a href="https://github.com/bloom-lang">most excellent teammates</a> over the last 9 months.  Hats off, gents.</div>
<p></p>
<div>In subsequent posts I&#8217;ll highlight some of the tools we ship with Bud that help with the really hard stuff: using the CALM principle to reason about the consistency and non-determinism of your Bloom code.</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/438/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/438/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/438/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/438/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/438/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/438/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/438/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/438/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/438/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/438/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/438/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/438/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/438/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/438/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=438&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/04/08/bud-bloom-under-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://www.bloom-lang.net/wp-content/uploads/2011/04/bloomlogoleft2.png" medium="image">
			<media:title type="html">Bloom logo</media:title>
		</media:content>
	</item>
		<item>
		<title>DataWrangler</title>
		<link>http://databeta.wordpress.com/2011/02/02/datawrangler/</link>
		<comments>http://databeta.wordpress.com/2011/02/02/datawrangler/#comments</comments>
		<pubDate>Thu, 03 Feb 2011 06:03:24 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[database]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[data cleaning]]></category>
		<category><![CDATA[hci]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=427</guid>
		<description><![CDATA[I often hear that many of the leading data analysts in the field have PhDs in physics or biology or the like, rather than computer science.  Computer scientists are typically interested in methods; physical scientists are interested in data. Another thing I often hear is that a large fraction of the time spent by analysts [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=427&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/7202153@N03/4198114139/"><img class="size-medium wp-image-429  alignright" title="Wrangler" src="http://databeta.files.wordpress.com/2011/02/wrangler.jpg?w=300&#038;h=286" alt="" width="300" height="286" /></a></p>
<p>I often hear that many of the leading data analysts in the field have PhDs in physics or biology or the like, rather than computer science.  Computer scientists are typically interested in <em>methods</em>; physical scientists are interested in <em>data</em>.</p>
<p>Another thing I often hear is that a large fraction of the time spent by analysts — some say the majority of time — involves data preparation and cleaning: transforming formats, rearranging nesting structures, removing outliers, and so on.  (If you think this is easy, you&#8217;ve never had a stack of ad hoc Excel spreadsheets to load into a stat package or database!)</p>
<p>Putting these together, something is very wrong:  high-powered people are wasting most of their time doing low-function work.  And the challenge of improving this state of affairs has fallen in the cracks between the analysts and computer scientists.</p>
<p><a title="DataWrangler" href="http://vis.stanford.edu/wrangler">DataWrangler</a> is a new tool we&#8217;re developing to address this problem, which I demo&#8217;d today at the <a title="Strata" href="http://strataconf.com">O&#8217;Reilly Strata Conference</a>.  DataWrangler is an intelligent visual data transformation tool that lets users reshape, transform and clean data in an intuitive way that surprises most people who&#8217;ve worked with data.  As you manipulate data in a grid layout, the tool automatically infers information both about the data, and about your intentions for transforming the data.  It&#8217;s hard to describe, but the lead researcher on the project &#8212; Stanford PhD student Sean Kandel &#8212; has a quick video up on the <a title="Wrangler" href="http://vis.stanford.edu/wrangler">DataWrangler homepage</a> that shows how it works.  Sean has put DataWrangler live on the site as well.</p>
<p>Tackling these problems fundamentally requires a hybrid technical strategy.  Under the covers, DataWrangler is a heady mix of second-order logic, machine learning methods, and human-computer interaction design methodology.   We wrote <a title="Wrangler paper" href="http://vis.stanford.edu/papers/wrangler">a research paper</a> about it that will appear in this year&#8217;s SIGCHI.</p>
<p>If you&#8217;re interested in this space, also have a look at Shankar Raman&#8217;s very prescient <a title="Potter's Wheel" href="http://control.cs.berkeley.edu/abc/">Potter&#8217;s Wheel</a> work from a decade ago, the <a title="PADS" href="http://www.padsproj.org/">PADS</a> project at AT&amp;T and Princeton, <a title="Gulwani POPL '11" href="http://research.microsoft.com/en-us/um/people/sumitg/pubs/popl11-synthesis.pdf">recent research</a> from Sumit Gulwani at Microsoft Research, and David Huynh&#8217;s most excellent <a title="Google Refine" href="http://code.google.com/p/google-refine/">Google Refine</a>.  All good stuff!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/427/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/427/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/427/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/427/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/427/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/427/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/427/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/427/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/427/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/427/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/427/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/427/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/427/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/427/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=427&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/02/02/datawrangler/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2011/02/wrangler.jpg?w=300" medium="image">
			<media:title type="html">Wrangler</media:title>
		</media:content>
	</item>
		<item>
		<title>The CRON Principle: To Hell with Distributed Clocks</title>
		<link>http://databeta.wordpress.com/2010/12/03/the-cron-principle/</link>
		<comments>http://databeta.wordpress.com/2010/12/03/the-cron-principle/#comments</comments>
		<pubDate>Fri, 03 Dec 2010 14:53:43 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[bloom]]></category>
		<category><![CDATA[boom]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[causality]]></category>
		<category><![CDATA[CRON]]></category>
		<category><![CDATA[distributed clocks]]></category>
		<category><![CDATA[monotonicity]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=380</guid>
		<description><![CDATA[In today&#8217;s episode of the Twilight Zone, a young William Shatner stumbles into a time machine and travels back into the past. Cornered in a dark alley, he is threatened by a teenage hooligan waving a loaded pistol. A tussle ensues, and in trying to wrest the gun from his assailant, Shatner fires, killing him [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=380&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/34517490@N00/2743877537/"><img class="alignright size-medium wp-image-392" title="clocks" src="http://databeta.files.wordpress.com/2010/12/clocks.jpg?w=234&#038;h=240" alt="" width="234" height="240" /></a>In today&#8217;s episode of the Twilight Zone, a young William Shatner stumbles into a time machine and travels back into the past. Cornered in a dark alley, he is threatened by a teenage hooligan waving a loaded pistol. A tussle ensues, and in trying to wrest the gun from his assailant, Shatner fires, killing him dead. Examining the contents of the dead youth&#8217;s wallet, Bill comes to a shocking conclusion: <em>he has just killed his own grandfather</em>. Tight focus: Shatner howling soundlessly as he stares at his own hand flickering in and out of view.</p>
<p>Shatner? Or Not(Shatner)? Having now changed history, he could not have been born, meaning he could not have traveled back in time and changed history, meaning he was indeed born, meaning&#8230;?</p>
<p>You see where this goes.  It&#8217;s the old <a title="grandfather paradox" href="http://en.wikipedia.org/wiki/Grandfather_paradox" target="_blank">grandfather paradox</a>, a hoary chestnut of SciFi and AI.  Personally I side with Captain Kirk: <a title="Kirk quote" href="http://www.imdb.com/title/tt0708469/quotes" target="_blank">I don&#8217;t like mysteries. They give me a bellyache.</a> But whether or not you think a discussion of &#8220;p if Not(p)&#8221; is <a title="NYTimes on Paradoxical Truth" href="http://opinionator.blogs.nytimes.com/2010/11/28/paradoxical-truth/?scp=1&amp;sq=paradox&amp;st=cse">news that&#8217;s fit to print</a>, it is something to avoid in your software.  This is particularly tricky in distributed programming, where multiple machines have different clock settings, and those clocks may even turn backward on occasion. The theory of Distributed Systems is built on the notion of <a title="Lamport Clocks" href="http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks" target="_blank">Causality</a>, which enables programmers and programs to avoid doing unusual things like executing instructions in orders that could not have been specified by the program that generated them. Causality is established by distributed clock protocols. These protocols are often used to enforce causal orderings&#8211;i.e. to make machines wait for messages. And waiting for messages, as we know, is bad.</p>
<p>So I&#8217;m here to tell you today that Causality is overrated, and we can often skip the wait. To hell with distributed clocks: time travel can be fine.  In many cases it&#8217;s even fine to change history. Here&#8217;s the thing: Casuality is Required Only to control Non-monotonicity. I call this the CRON principle.</p>
<p><span id="more-380"></span></p>
<p>Let&#8217;s agree for purposes of discussion that p and Not(p) is not meaningful, and our Twilight Zone episode the usual logical nonsense dressed up with campy <a title="Schmacting definition" href="http://www.doubletongued.org/index.php/citations/schmacting_1/" target="_blank">schmacting</a>. Suppose we change the plot so that Shatner goes back in time and does something positive, in the logical sense of monotonicity: generating additional information without negating any information. For example, suppose he falls in love with a young woman in the past, they marry, and have a child. Then one day, to his horror, he realizes that his wife is his own grandmother! Distasteful, I agree, but logically monotonic! If you accidentally mate with your ancestors, it doesn&#8217;t matter (from a logical point of view) whether you use a time machine to do it. This is one direction of CRON: Purely monotonic programs will have consistent outcomes even in the face of &#8220;non-causal&#8221; orderings.</p>
<p>That is a big deal. It means that there is a large class of distributed programs where causal orderings&#8212;a foundational topic in distributed systems theory&#8212;are not required. Note that <a href="http://en.wikipedia.org/wiki/FO_(complexity)#Least_Fixed_Point_is_P" target="_blank">monotonic fixpoint logic (if you allow an ordered domain like the integers) can express <em>all polynomial-time programs</em></a>. And after all, who would run an exponential program in a distributed system?  So you might justifiably argue that distributed systems have been &#8220;wasting time&#8221; all these years for no good reason.</p>
<p>But let&#8217;s be real: not all polynomial-time programs are equally good, and it&#8217;s handy for programmers to be able to express non-monotonic things like &#8230; say &#8230; updating a variable (or shooting somebody evil). So when can we allow that to happen without enforcing the forward march of time?</p>
<p>Answer: when the consequences of the non-monotonic expression do not entail its antecedents. I.e. when the ability to shoot does not depend on the fate of the victim. Now in the physical world, we might worry a la chaos theory (or <a title="City on the Edge of Forever" href="http://en.wikipedia.org/wiki/The_City_on_the_Edge_of_Forever" target="_blank">Edith Keeler</a>) that everything is sensitive to the fate of everything else. In computer programs, though, we can plausibly guarantee the property we want here in many cases. We might be able to prove for a given program that, effectively, victims will never be ancestors of any shooter. When we can guarantee that&#8211;statically or dynamically&#8211;we can admit the non-monotonicity into the program without requiring an implementation of distributed clocks and waiting.</p>
<p>So when do we need distributed clocks?  Only when&#8211;exactly when&#8211;the grandfather paradox would arise in their absence. Causality is Required Only for controlling cycles in Non-monotonicity.</p>
<p>(This discussion comes out of my PODS talk on the <a href="http://databeta.wordpress.com/2010/06/04/the-declarative-imperative-pods-2010/">Declarative Imperative</a> from last June.  <a title="Bill Marczak" href="http://www.eecs.berkeley.edu/~wrm/" target="_blank">Bill Marczak</a>, <a title="Peter Alvaro" href="http://www.cs.berkeley.edu/~palvaro/" target="_blank">Peter Alvaro</a> and I are in the midst of writing up a formal proof. Input welcome!)</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/380/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/380/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/380/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/380/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/380/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/380/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/380/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/380/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/380/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/380/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/380/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/380/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/380/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/380/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=380&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2010/12/03/the-cron-principle/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2010/12/clocks.jpg?w=292" medium="image">
			<media:title type="html">clocks</media:title>
		</media:content>
	</item>
		<item>
		<title>The CALM Conjecture: Reasoning about Consistency</title>
		<link>http://databeta.wordpress.com/2010/10/28/the-calm-conjecture-reasoning-about-consistency/</link>
		<comments>http://databeta.wordpress.com/2010/10/28/the-calm-conjecture-reasoning-about-consistency/#comments</comments>
		<pubDate>Thu, 28 Oct 2010 17:08:58 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[bloom]]></category>
		<category><![CDATA[boom]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[ACID]]></category>
		<category><![CDATA[CALM]]></category>
		<category><![CDATA[eventual consistency]]></category>
		<category><![CDATA[logic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[paxos]]></category>
		<category><![CDATA[points of order]]></category>
		<category><![CDATA[static analysis]]></category>
		<category><![CDATA[transactions]]></category>
		<category><![CDATA[two-phase commit]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=372</guid>
		<description><![CDATA[12/16/2010: final version of CALM/Bloom paper for CIDR now posted Conventional Wisdom: In large distributed systems, perfect data consistency is too expensive to guarantee in general. &#8220;Eventually consistent&#8221; approaches are often a better choice, since temporary inconsistencies work out in most cases. Consistency mechanisms (transactions, quorums, etc.) should be reserved for infrequent, small-scale, mission-critical tasks. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=372&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong><a href="http://www.flickr.com/photos/25678284@N03/3840006969/"><img class="alignright size-medium wp-image-373" title="Calm Stones" src="http://databeta.files.wordpress.com/2010/10/calmstones.jpg?w=200&#038;h=300" alt="" width="200" height="300" /></a></strong><strong></strong></p>
<p><em>12/16/2010: </em><a title="CALM/Bloom paper" href="http://db.cs.berkeley.edu/papers/cidr11-bloom.pdf"><em>final version of CALM/Bloom paper</em></a><em> for CIDR now posted</em></p>
<p><strong>Conventional Wisdom:<br />
</strong>In large distributed systems, perfect data consistency is too expensive to guarantee in general. &#8220;Eventually consistent&#8221; approaches are often a better choice, since temporary inconsistencies work out in most cases. Consistency mechanisms (transactions, quorums, etc.) should be reserved for infrequent, small-scale, mission-critical tasks.</p>
<p>Most computer systems designers agree on this at some level (once you get past the NoSQL vs. ACID sloganeering). But like lots of well-intentioned design maxims, it&#8217;s not so easy to translate into practice &#8212; all kinds of unavoidable tactical questions pop up:</p>
<p><strong>Questions:</strong></p>
<ul>
<li>Exactly where in my multifaceted system is eventual consistency &#8220;good enough&#8221;?</li>
<li>How do I know that my &#8220;mission-critical&#8221; software isn&#8217;t tainted by my &#8220;best effort&#8221; components?</li>
<li>How do I maintain my design maxim as software evolves? For example, how can the junior programmer in year n of a project reason about whether their piece of the code maintains the system&#8217;s overall consistency requirements?</li>
</ul>
<p>If you think you have answers to those questions, I&#8217;d love to hear them. And then I&#8217;ll raise the stakes, because I have a better challenge for you: can you write down your answers in an algorithm?</p>
<p><strong>Challenge:</strong><br />
Write a program checker that will either &#8220;bless&#8221; your code&#8217;s inconsistency as provably acceptable, or identify the locations of unacceptable consistency bugs.</p>
<p>The CALM Conjecture is my initial answer to that challenge.</p>
<p><span id="more-372"></span>CALM stands for &#8220;Consistency As Logical Monotonicity&#8221;. The idea is that the family of eventually consistent programs are exactly those that can be expressed in monotonic logic. By contrast, distributed non-monotonicity (e.g. destructive state modification, aggregation or &#8220;reduce&#8221;) can and must be resolved via distributed coordination logic (e.g. two-phase-commit, Paxos). The idea that &#8220;temporary inconsistencies work out&#8221; amounts to ensuring that the data in question is contained within a properly-protected monotonic component of the system.</p>
<p>Presuming that CALM can be proven with sufficient generality (restricted formal versions of it are obvious, but I&#8217;m quite sure there&#8217;s more that can be done), the next step is to translate this idea into a useful program checker. Like many other verification tasks, this is only tractable in a sufficiently high-level language &#8212; in this case, one where the constructs of the language can be translated into an underlying logic. Bloom is our language along these lines, and we have the initial program checks in place. Given a Bloom program, we can do the following:</p>
<ol>
<li>bless programs as monotonic, and hence safe to run coordination-free</li>
<li>identify non-monotonic &#8220;Points of Order&#8221; in Bloom programs. These can be resolved by either of the following (which we intend to automate):
<ul>
<li>add coordination logic (e.g. quorum consensus) to enforce the ordering, or</li>
<li>augment the program to tag downstream data as &#8220;tainted&#8221; with potential inconsistency</li>
</ul>
</li>
<li>visualize the Points of Order in a dependency graph, to help programmers reason about restructuring their code for more efficient consistency enforcement.</li>
</ol>
<p>We wrote up our initial ideas on this topic in a <del>short <a title="CALM submission to CIDR 2010" href="http://db.cs.berkeley.edu/jmh/calm-cidr-short.pdf">submission to CIDR 2010</a></del> <a title="CALM/Bloom paper" href="http://db.cs.berkeley.edu/papers/cidr11-bloom.pdf">CIDR 2010 paper</a>, including an intro to the current state of the Bloom language, and an example of analyzing a replicated shopping-cart application.  This follows from the discussion in my <a title="The Declarative Imperative" href="http://www.sigmod.org/publications/sigmod-record/1003/p05.article.hellerstein.pdf">companion paper</a> to my <a title="PODS 2010 Keynote video" href="http://hosted.mediasite.com/mediasite/Viewer/?peid=123584ea3d4141ea8169b97d5e454d331d">PODS keynote talk</a> [<a title="PODS 2010 keynote PDF" href="http://db.cs.berkeley.edu/jmh/talks/podskeynote10.pdf">slides</a>].</p>
<p>I&#8217;d love feedback on these ideas, which are still a work in progress.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/372/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/372/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/372/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/372/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/372/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/372/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/372/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/372/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/372/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/372/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/372/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/372/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/372/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/372/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=372&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2010/10/28/the-calm-conjecture-reasoning-about-consistency/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2010/10/calmstones.jpg?w=200" medium="image">
			<media:title type="html">Calm Stones</media:title>
		</media:content>
	</item>
		<item>
		<title>Testing for Failure in the Cloud: FATE and DESTINI</title>
		<link>http://databeta.wordpress.com/2010/10/15/testing-for-failure-in-the-cloud-fate-and-destini/</link>
		<comments>http://databeta.wordpress.com/2010/10/15/testing-for-failure-in-the-cloud-fate-and-destini/#comments</comments>
		<pubDate>Fri, 15 Oct 2010 15:08:00 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=363</guid>
		<description><![CDATA[By now it&#8217;s a truism in cloud computing and internet infrastructure that component failure happens frequently. It&#8217;s simple statistics: (lots of components) * (a small failure rate per component) = a high component failure rate across the collection. People now routinely architect distributed systems for this reality. That&#8217;s necessary but not sufficient: you also need correct [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=363&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/36179943@N00/33279565/"><img class="alignright size-medium wp-image-365" title="Boarding Pass to Fate and Destini" src="http://databeta.files.wordpress.com/2010/10/titanic.jpg?w=168&#038;h=240" alt="" width="168" height="240" /></a>By now it&#8217;s a truism in cloud computing and internet infrastructure that component failure happens frequently. It&#8217;s simple statistics: (lots of components) * (a small failure rate per component) = a high component failure rate across the collection. People now routinely architect distributed systems for this reality.</p>
<p>That&#8217;s necessary but not sufficient: you also need correct failure-handling protocols and faithful implementations.</p>
<p>So, do today&#8217;s popular distributed systems handle component failure well? Or, taking the longer view: what kinds of tools will help the engineers who build those systems ensure that they handle component failure well?</p>
<p>My postdoc <a title="Haryadi Gunawi" href="http://www.eecs.berkeley.edu/~haryadi/">Haryadi Gunawi</a> and his team have taken some big steps to answer these questions, and written them up in their report on <a title="FATE and DESTINI TR" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-127.html">FATE and DESTINI: A Framework for Cloud Recovery Testing.</a> They take a systematic combinatorial approach to generating faults (FATE) and a formal approach to specifying correctness (DESTINI) that grows out of our work on declarative languages. Upshot: research that produces real tools, which help developers find (and then fix) real failure-handling bugs, including 16 new bug reports to HDFS (7 design bugs and 9 implementation bugs). Pretty nice, given the intricacies of failure-recovery protocols.</p>
<p><span id="more-363"></span>Haryadi conceived of and drove this project, and <a title="SQCK" href="http://www.cs.wisc.edu/adsl/Publications/sqck-osdi08.html">like his earlier work fixing the sorry state of file system checkers</a>, it takes good clean research designs and uses them to improve substantially on real-world practice.</p>
<p>Haryadi started the project like a social scientist, combing through the Jira issue-tracker reports for HDFS and classifying the recovery bugs.  Then he set about generalizing issues, designing techniques, and building tools.  After that he used the Jira&#8217;s to make sure his tools were getting good coverage (they found the already-reported bugs automatically) and generating tangible benefits (surfacing new bugs). He then transitioned from his first experimental setting (HDFS) to two new ones: Cassandra and Zookeeper. Results there are preliminary but look promising.</p>
<p>In the end, the whole package feels simple, sensible, and useful.  And it&#8217;s exactly because of the way he combined the grounding in practice with the elegance of his research ideas,  Now that the initial results are written up, Haraydi is beginning the discussions with Yahoo, Facebook, Cloudera and others to get this stuff out in the field.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/363/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/363/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/databeta.wordpress.com/363/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/databeta.wordpress.com/363/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/databeta.wordpress.com/363/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/databeta.wordpress.com/363/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/databeta.wordpress.com/363/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/databeta.wordpress.com/363/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/databeta.wordpress.com/363/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/databeta.wordpress.com/363/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/databeta.wordpress.com/363/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/databeta.wordpress.com/363/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/databeta.wordpress.com/363/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/databeta.wordpress.com/363/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&amp;blog=5435607&amp;post=363&amp;subd=databeta&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2010/10/15/testing-for-failure-in-the-cloud-fate-and-destini/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2010/10/titanic.jpg?w=210" medium="image">
			<media:title type="html">Boarding Pass to Fate and Destini</media:title>
		</media:content>
	</item>
	</channel>
</rss>
