<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Data Beta</title>
	<atom:link href="http://databeta.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://databeta.wordpress.com</link>
	<description>on computing and data .. in permanent beta</description>
	<lastBuildDate>Mon, 08 Apr 2013 02:23:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='databeta.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Data Beta</title>
		<link>http://databeta.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://databeta.wordpress.com/osd.xml" title="Data Beta" />
	<atom:link rel='hub' href='http://databeta.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Announcing Trifacta</title>
		<link>http://databeta.wordpress.com/2012/10/04/trifactalaunch/</link>
		<comments>http://databeta.wordpress.com/2012/10/04/trifactalaunch/#comments</comments>
		<pubDate>Thu, 04 Oct 2012 11:30:30 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[startups]]></category>
		<category><![CDATA[trifacta]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=595</guid>
		<description><![CDATA[The big news around here today is the public announcement of Trifacta, a company I&#8217;ve been quietly cooking over the last few months with colleagues Jeff Heer and Sean Kandel of Stanford. Trifacta is taking on an important and satisfying challenge: to build a new generation of user-centric data management software that is beautiful, powerful, and eminently [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=595&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://trifacta.com"><img class="alignright  wp-image-600" title="Trifacta" src="http://databeta.files.wordpress.com/2012/10/trifacta.png?w=188&#038;h=184" alt="" width="188" height="184" /></a>The big news around here today is the public announcement of <a href="http://trifacta.com">Trifacta</a>, a company I&#8217;ve been quietly cooking over the last few months with colleagues <a href="http://jheer.org">Jeff Heer</a> and <a href="http://www.stanford.edu/~skandel/">Sean Kandel</a> of Stanford. Trifacta is taking on an important and satisfying challenge: to build a new generation of user-centric data management software that is beautiful, powerful, and eminently useful.</p>
<p>Before I talk more about the background let me say this: <a href="http://trifacta.com/jobs">We Are Hiring</a>. We&#8217;re looking for people with passion and talent in Interaction Design, Data Visualization, Databases, Distributed Systems, Languages, and Machine Learning. We&#8217;re looking for folks who want to reach across specialties, and work together to build integrated, rich, and deeply satisfying software. We&#8217;ve got top-shelf funding and a sun-soaked office in the heart of SOMA in San Francisco, and we&#8217;re building a company with clear, tangible value. It&#8217;s early days and the fun is ahead. If you ever considered joining a data startup, this is the one. <a href="http://trifacta.com/jobs">Get in touch</a>.</p>
<p><span id="more-595"></span></p>
<p>The genesis of the company goes back many years. I&#8217;ve known Jeff since his days a grad student at Berkeley. Since then of course he&#8217;s made a huge splash both in the HCI research community and in the open source world via tools like <a href="http://d3js.org">D3.js</a> and <a href="http://mbostock.github.com/protovis/">Protovis</a>. After serving on Jeff&#8217;s thesis committee and watching him move to a faculty position at Stanford, I was determined to continue working with him.</p>
<p>As part of the <a href="http://deepresearch.org">d^p project</a>, Jeff and I began co-advising Sean Kandel. Sean has proven to be a research monster, with an unbroken string of accepted papers during 3 short years at Stanford. He also brought his experience as a financial analyst to the table, and a thirst for entrepreneurship.</p>
<p>Trifacta is the next phase of this collaboration. The name refers to the 3 issues that need to be addressed simultaneously to achieve this vision: People, Data, and Computation.  We&#8217;re entering a period in computing history in which Data has become ubiquitous and Computation has turned into an inexpensive commodity. At this point, People are the last bottleneck to useful data analysis &#8212; both because human skills don&#8217;t improve with Moore&#8217;s Law, and because the growth in the other resources has increased the demand for people to deal with data and do analysis.</p>
<p>Another day I will write more about Trifacta&#8217;s initial products, but suffice it to say for now that our mission is to build software that ameliorates the human bottlenecks in analysis. Our goal is both to remove data drudgery from the workflows of experienced data scientists, and to enable self-service solutions for end-users who otherwise would not be able to take advantage of new sources of data. It&#8217;s an inherently inter-disciplinary problem, with interaction and visualization at its core, but with unique demands on systems and inference algorithms to drive an intelligent user experience over large volumes of data. And of course there&#8217;s a role for the kind of high level languages for scalable systems that I&#8217;ve been focused on for years with <a href="http://bloom-lang.net">Bloom</a> and the like as well.</p>
<p>If you made it this far, please wander over to the <a href="http://trifacta.com">Trifacta website</a> for more info. We&#8217;d love to follow up.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/595/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/595/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=595&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2012/10/04/trifactalaunch/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2012/10/trifacta.png" medium="image">
			<media:title type="html">Trifacta</media:title>
		</media:content>
	</item>
		<item>
		<title>Bill Marczak&#8217;s work on Bahraini citizen surveillance</title>
		<link>http://databeta.wordpress.com/2012/09/08/bill-marczaks-work-on-bahraini-citizen-surveillance/</link>
		<comments>http://databeta.wordpress.com/2012/09/08/bill-marczaks-work-on-bahraini-citizen-surveillance/#comments</comments>
		<pubDate>Sun, 09 Sep 2012 04:59:35 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[students]]></category>
		<category><![CDATA[bahrain]]></category>
		<category><![CDATA[marczak]]></category>
		<category><![CDATA[press]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=575</guid>
		<description><![CDATA[Bill Marczak, a PhD student in my group, does interesting research on algebraic programming languages, which I hope to describe in more detail here soon. But Bill has recently received significant attention for work he did in his spare time&#8212;a dramatically successful cyber-espionage effort to expose government misuse of commercial surveillance software in Bahrain, the nation [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=575&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div class="wp-caption alignright" style="width: 346px"><a href="http://www.nytimes.com/2012/08/31/technology/finspy-software-is-tracking-political-dissidents.html"><img src="http://graphics8.nytimes.com/images/2012/08/31/business/Hackjp2/Hackjp2-articleLarge.jpg" alt="" width="336" height="207" /></a><p class="wp-caption-text">Bill Marczak, right, in NY Times</p></div>
<p><a href="http://www.nytimes.com/2012/08/31/technology/finspy-software-is-tracking-political-dissidents.html">Bill Marczak</a>, a PhD student in my group, does interesting research on algebraic programming languages, which I hope to describe in more detail here soon.</p>
<p>But Bill has recently received significant attention for work he did in his spare time&#8212;a dramatically successful cyber-espionage effort to expose government misuse of commercial surveillance software in Bahrain, the nation where Bill attended high school. The story picked up major press coverage in venues including <a href="http://www.bloomberg.com/news/2012-07-25/cyber-attacks-on-activists-traced-to-finfisher-spyware-of-gamma.html">Bloomberg</a> and the <a href="http://www.nytimes.com/2012/08/31/technology/finspy-software-is-tracking-political-dissidents.html">New York Times</a>, which also ran a more detailed article in their <a href="http://bits.blogs.nytimes.com/2012/08/31/how-two-amateur-sleuths-looked-for-finspy-software/">Bits Blog</a>.</p>
<p>I&#8217;m always happy to see the press pick up on my students&#8217; work, but this one is special.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/575/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/575/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=575&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2012/09/08/bill-marczaks-work-on-bahraini-citizen-surveillance/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://graphics8.nytimes.com/images/2012/08/31/business/Hackjp2/Hackjp2-articleLarge.jpg" medium="image" />
	</item>
		<item>
		<title>An open letter to Matt</title>
		<link>http://databeta.wordpress.com/2012/06/20/an-open-letter-to-matt/</link>
		<comments>http://databeta.wordpress.com/2012/06/20/an-open-letter-to-matt/#comments</comments>
		<pubDate>Wed, 20 Jun 2012 10:25:02 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[blog]]></category>
		<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=577</guid>
		<description><![CDATA[Matt Welsh of Google—formerly of Harvard, Berkeley and Cornell—is a deservedly well-read blogger in the computing community.  He&#8217;s also somebody I&#8217;ve admired since his early days in grad school as a smart, authentic person. Matt&#8217;s been working through his transition from Harvard Professor to Googler in public over the last year or so, and it&#8217;s [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=577&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><em><a href="http://www.flickr.com/photos/estherase/34575328/"><img class="alignright  wp-image-578" title="smilefrown" src="http://databeta.files.wordpress.com/2012/06/smilefrown.jpg?w=320&#038;h=240" alt="" width="320" height="240" /></a><a title="Volatile and Decentralized blog" href="http://matt-welsh.blogspot.com">Matt Welsh</a> of Google—formerly of Harvard, Berkeley and Cornell—is a deservedly well-read blogger in the computing community.  He&#8217;s also somebody I&#8217;ve admired since his early days in grad school as a smart, authentic person.</em></p>
<p><em>Matt&#8217;s been working through his transition from Harvard Professor to Googler in public over the last year or so, and it&#8217;s been interesting to watch what he says, and the discussion it provokes.  <a href="http://matt-welsh.blogspot.com/2012/06/startup-university.html">His latest post</a> was a little more acid than usual though, with respect to the value of academic computer science.  My response got pretty long, and in the end I figured it&#8217;d be better to toss it up in my own space.</em></p>
<p>Matt:</p>
<p>Rather than run down work you don&#8217;t like—including maybe your own prior work, as assessed on one of your dark days—think about the academic work over the last 50 years that you admire the hell out of. I know you could name a few heroes. I bet a bunch of your blog&#8217;s readers could get together and name a whole lot more. Now imagine the university system hadn&#8217;t been around and reasonably well-funded at the time, because it was considered &#8220;inefficient when it comes to producing real products that shape the world&#8221;.   It&#8217;s sad to consider.</p>
<p>Here&#8217;s another thing you and your readers should consider: Forget efficiency. At least, forget it on the timescale you measure in your current job. Instead, aspire to do work that is as groundbreaking and important as the best work in the history of the field. And at the same time, inspire generations of brilliant students to do work that is even better—better than your very best. That&#8217;s what great universities are for, Matt. Remember? Sure you do. And yes—it&#8217;s goddamn audacious. As well it should be.</p>
<p><span id="more-577"></span></p>
<p>Now, can you get up and be actively audacious every day? Nobody can, not even the Greats. So we take a portfolio approach at all levels: lots of universities, lots of research grants, lots of individual faculty and students, and—for some individuals—lots of different career phases to see where your changing interests and skills fit best. It&#8217;s good to move around. But all the while, you have to respect the portfolio, Matt. It&#8217;s been working fabulously over the long term. And it&#8217;s done so on a tiny fraction of the budget of corporate America, a tiny fraction of the tax base.</p>
<p>The funny thing about your timing here is that the short-term view is actually really rosy right now—times are <em>great</em> for academic/industrial fluidity in computing. Companies and VCs and pundits are flocking to academic leaders for ideas and guidance. Academics are more aware than ever about what&#8217;s going on in industry, and some of the really good ones have taken the plunge into companies to do the tech transfer. Times are really good, even on the small timescale.</p>
<p>Can things get better? Sure they can. Is the university research model perfect? Of course not. We should have a portfolio of experiments there too, including new models for tech transfer and collaboration between academia, industry, and venture funding. (I <a href="http://databeta.wordpress.com/2012/04/09/madlib-sigmod/">blogged about one of my experiments</a> recently). So by all means go back to your post and distill down some of the constructive parts.  Spend time on some of them.  You&#8217;re not alone in wanting to experiment with these models and improve the pipeline, and there&#8217;s a good constructive conversation to have.</p>
<p>Meanwhile, stay positive! Life is good. Research is good! You have a big platform there (and I don&#8217;t mean all those computers you can log into). Use it for good. Remember your heroes. Emulate and inspire.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/577/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/577/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=577&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2012/06/20/an-open-letter-to-matt/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2012/06/smilefrown.jpg" medium="image">
			<media:title type="html">smilefrown</media:title>
		</media:content>
	</item>
		<item>
		<title>Metablogging MADlib opening @ SIGMOD</title>
		<link>http://databeta.wordpress.com/2012/04/09/madlib-sigmod/</link>
		<comments>http://databeta.wordpress.com/2012/04/09/madlib-sigmod/#comments</comments>
		<pubDate>Mon, 09 Apr 2012 15:55:31 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[MADlib]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=559</guid>
		<description><![CDATA[When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I&#8217;m involved with.  So I wrote up a discussion of MADlib, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=559&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://databeta.files.wordpress.com/2011/07/screen-shot-2011-07-10-at-4-27-26-pm.png"><img class="alignright size-full wp-image-489" title="MADlib logo" src="http://databeta.files.wordpress.com/2011/07/screen-shot-2011-07-10-at-4-27-26-pm.png?w=510" alt=""   /></a>When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I&#8217;m involved with.  So I wrote up a <a href="http://wp.sigmod.org/?p=344">discussion of MADlib</a>, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote a <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-38.html">paper on the design and use of MADlib</a>, which made my writing job a bit easier.) I&#8217;m optimistic about MADlib closing a gap between algorithm researchers and working data scientists, using familiar SQL as a vector for adoption on both fronts.</p>
<p><span id="more-559"></span>I kicked off MADlib as a part-time consulting project for Greenplum during my sabbatical in 2010-2011.  As I built out the first two methods (FM and CountMin sketches) and an installer, Greenplum started assembling a team of their own engineers and data scientists to overlap with and eventually replace me when I returned to campus.  They also developed a roadmap of additional methods that their customers wanted in the field.  Eighteen months later, Greenplum now contributes the bulk of the labor, management and expertise for the project, and has built bridges to leading academics as well.</p>
<p>While I&#8217;ve encouraged Greenplum&#8217;s investment all along, it has been equally important to me that MADlib avoid becoming Yet Another Proprietary Library for Analytics.  There are a bunch of those libraries provided by DBMS vendors and third parties; they are all-but-invisible to researchers, and are immutable black boxes to data scientists in the field. So while some of them may be quite good (who knows?!) they don&#8217;t create or benefit from a network effect in the technical community.  Given the current <a href="http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation">excitement about Big Data and scarcity of data scientists</a>, that kind of isolation could be deadly to both adoption and evolution.</p>
<p>As a result of those considerations, MADlib is open source, and I&#8217;ve insisted all along that it be maintained on at least one open-source platform: PostgreSQL.  I&#8217;m also genuinely eager to see MADlib ports to other massively-parallel commercial DBMSs.  This will require some work, but it&#8217;s tractable and well-scoped engineering work, and I hope that incentives will arise  to get it done and expand the scope of the community.</p>
<p>Finally, I&#8217;m happy to see academics like <a href="http://pages.cs.wisc.edu/~chrisre/">Chris Ré</a>, <a href="http://www.cise.ufl.edu/~daisyw/">Daisy Wang</a> and their students starting to contribute to the project.  I hope their research will help out data scientists in the field, and generate the kind of user feedback that is so hard to get in academia. I also hope that other researchers will join them and get involved.  I think it&#8217;s great experience to write meaty research code that actually ships, and I hope MADlib can be a mechanism to encourage more of that experience in the community.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/559/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/559/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=559&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2012/04/09/madlib-sigmod/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2011/07/screen-shot-2011-07-10-at-4-27-26-pm.png" medium="image">
			<media:title type="html">MADlib logo</media:title>
		</media:content>
	</item>
		<item>
		<title>Quantifying Eventual Consistency via PBS</title>
		<link>http://databeta.wordpress.com/2012/01/10/pbs-quantifying-eventual-consistency-of-replicas/</link>
		<comments>http://databeta.wordpress.com/2012/01/10/pbs-quantifying-eventual-consistency-of-replicas/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 18:29:53 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[boom]]></category>
		<category><![CDATA[consistency]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[bailis]]></category>
		<category><![CDATA[cassandra]]></category>
		<category><![CDATA[datastax]]></category>
		<category><![CDATA[eventual consistency]]></category>
		<category><![CDATA[linkedin]]></category>
		<category><![CDATA[pbs]]></category>
		<category><![CDATA[quorum systems]]></category>
		<category><![CDATA[replica consistency]]></category>
		<category><![CDATA[venkataraman]]></category>
		<category><![CDATA[yammer]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=528</guid>
		<description><![CDATA[If you follow this blog, you know that my BOOM group has spent a lot of time in the past couple years formalizing eventual consistency (EC) for distributed programs, via the CALM theorem and practical tools for analyzing Bloom programs. In recent months, my student Peter Bailis and his teammate Shivaram Venkataraman took a different [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=528&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://databeta.files.wordpress.com/2012/01/92755998_b14771616a.jpg"><img class="alignright  wp-image-549" title="Copy" src="http://databeta.files.wordpress.com/2012/01/92755998_b14771616a.jpg?w=350&#038;h=263" alt="Copy" width="350" height="263" /></a>If you follow this blog, you know that my <a href="http://boom.cs.berkeley.edu">BOOM</a> group has spent a lot of time in the past couple years formalizing eventual consistency (EC) for distributed programs, via the <a title="CALM theorem post" href="http://databeta.wordpress.com/2010/10/28/the-calm-conjecture-reasoning-about-consistency/">CALM theorem</a> and <a title="Budplot and Budviz" href="https://github.com/bloom-lang/bud/blob/master/docs/visualizations.md">practical tools</a> for analyzing <a href="http://bloom-lang.org">Bloom</a> programs.</p>
<p>In recent months, my student <a href="http://www.cs.berkeley.edu/~pbailis/">Peter Bailis</a> and his teammate <a href="http://amplab.cs.berkeley.edu/author/shivaram/">Shivaram Venkataraman</a> took a different tack on the whole EC analysis problem which they call <strong>PBS: Probabilistically Bounded Staleness</strong>. The results are interesting, and extremely relevant to current practice.  (See, for example, the very nice <a href="http://www.datastax.com/dev/blog/your-ideal-performanceconsistency-tradeoff">blog post</a> by folks at <a href="http://www.datastax.com">DataStax</a>).</p>
<p>Many people today deal with EC in the specific context of replica consistency, particularly in distributed NoSQL-style Key-Value Stores (KVSs). It is typical to configure these stores with so-called &#8220;partial&#8221; quorum replication, to get a comfortable mix of low latency with reasonable availability. The term &#8220;partial&#8221; signifies that you are not guaranteed consistency of writes by these configurations &#8212; at best they guarantee a form of eventual consistency of final writes, but readers may well read stale data along the way. Lots of people are deploying these configurations in the field, but there&#8217;s little information on how often the approach messes up, and how badly.</p>
<p>Jumping off from earlier <a title="Malkhi  et al on Probabilistic Quorums" href="http://dl.acm.org/citation.cfm?id=259458&amp;dl=">theoretical work on probabilistic quorum systems</a>, Peter and Shivaram answered two natural questions about how these systems should perform in current practice:</p>
<ol>
<li><strong>How many versions ago?</strong>  On expectation, if you do a read in a partial-quorum KVS, how many versions behind are you? Peter and Shivaram answer this one definitively, via a closed-form mathematical analysis.</li>
<li><strong>How stale on the (wall-)clock?</strong>  On expectation, if you do a read in a partial-quorum KVS, how out-of-date will your version be in terms of wall-clock time? Answering this one requires modeling a read/write workload in wall-clock time, as well as system parameters like replica propagation (&#8220;anti-entropy&#8221;). Peter and Shivaram address this with a Monte Carlo model, and run the model with parameters grounded in real-world performance numbers generously provided by two of our most excellent colleagues: <a href="http://twitter.com/#!/strlen">Alex Feinberg</a> at <a href="http://www.linkedin.com">LinkedIn</a> and <a href="http://codahale.com/">Coda Hale</a> at <a href="http://www.yammer.com">Yammer</a> (both of whom also guest-lectured in my <a href="http://programthecloud.github.com/">Programming the Cloud</a> course last fall.)  Peter and Shivaram validated their models in practice using <a href="http://cassandra.apache.org/">Cassandra</a>, a widely-used KVS.</li>
</ol>
<p>On the whole, PBS shows that being sloppy about consistency doesn&#8217;t bite you often or badly &#8212; especially if you&#8217;re in a single datacenter and you use SSDs. But things get more complex with magnetic disks, garbage collection delays (grr), and wide-area replication.</p>
<p>Interested in more detail?  You can check out two things:</p>
<ul>
<li>The paper (currently under submission to a conference) is <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-4.pdf">available as a Berkeley Tech Report</a>.</li>
<li>Peter put up a <a href="http://www.cs.berkeley.edu/~pbailis/projects/pbs/">web-based version of the Monte Carlo simulation</a> that allows you to specify quorum parameters and workload parameters, and observe the tradeoff between those parameters and the probability of various levels of staleness.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/528/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/528/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=528&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2012/01/10/pbs-quantifying-eventual-consistency-of-replicas/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2012/01/92755998_b14771616a.jpg" medium="image">
			<media:title type="html">Copy</media:title>
		</media:content>
	</item>
		<item>
		<title>Maxim-ization</title>
		<link>http://databeta.wordpress.com/2011/09/22/maxim-ization/</link>
		<comments>http://databeta.wordpress.com/2011/09/22/maxim-ization/#comments</comments>
		<pubDate>Fri, 23 Sep 2011 00:50:35 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=522</guid>
		<description><![CDATA[For a class I&#8217;m teaching, I&#8217;d like to collect a list of favorite &#8220;maxims&#8221; or &#8220;aphorisms&#8221; for computer systems. I&#8217;d be very grateful if you would add your favorites below to the comments, preferably with a link to a source that either introduces or references the maxim.  It&#8217;s OK to agree or disagree with the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=522&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://databeta.files.wordpress.com/2011/09/maxim.jpg"><img class="alignright size-full wp-image-523" title="maxim" src="http://databeta.files.wordpress.com/2011/09/maxim.jpg?w=510" alt=""   /></a>For a class I&#8217;m teaching, I&#8217;d like to collect a list of favorite &#8220;maxims&#8221; or &#8220;aphorisms&#8221; for computer systems.</p>
<p>I&#8217;d be very grateful if you would add your favorites below to the comments, preferably with a link to a source that either introduces or references the maxim.  It&#8217;s OK to agree or disagree with the maxim.</p>
<p>I&#8217;d enjoy seeing people&#8217;s support/critiques for these below as well &#8212; may merit more focused posts another day.</p>
<p>Examples:</p>
<ul>
<li>The <a href="http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf">End-to-End Argument</a>, Saltzer/Reed/Clark, ICDCS &#8217;81.</li>
<li>Many examples in Lampson&#8217;s &#8220;<a href="http://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf">Hints for Computer System Design</a>&#8220;, SOSP &#8217;83.</li>
<li>&#8220;All problems in computer science can be solved by another level of indirection&#8221;. &#8212; also <a href="http://www.dmst.aueb.gr/dds/pubs/inbook/beautiful_code/html/Spi07g.html">attributed to Butler Lampson</a></li>
</ul>
<p>What else?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/522/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/522/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=522&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/09/22/maxim-ization/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2011/09/maxim.jpg" medium="image">
			<media:title type="html">maxim</media:title>
		</media:content>
	</item>
		<item>
		<title>Is Teaching MapReduce Healthy for Students?</title>
		<link>http://databeta.wordpress.com/2011/09/15/is-teaching-mapreduce-healthy-for-students/</link>
		<comments>http://databeta.wordpress.com/2011/09/15/is-teaching-mapreduce-healthy-for-students/#comments</comments>
		<pubDate>Fri, 16 Sep 2011 04:47:13 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[bloom]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[map reduce]]></category>
		<category><![CDATA[parallelism]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=497</guid>
		<description><![CDATA[I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates.  I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=497&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><img class="alignright" title="Another Mother logo" src="http://www.anothermother.org/wp-content/themes/AnotherMother/images/masthead_poster.gif" alt="Hadoop is not healthy for children and other living things." width="232" height="261" />I sat at Berkeley CS faculty lunch this past week with <a href="http://www.cs.berkeley.edu/~bh/">Brian Harvey</a> and <a href="http://www.cs.berkeley.edu/~ddgarcia/">Dan Garcia</a>, two guys who think hard about teaching computing to undergraduates.  I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.</p>
<p>Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by <a title="Brian Harvey teaches MapReduce" href="http://www.youtube.com/watch?v=mVXpvsdeuKU">introducing MapReduce and Hadoop in our initial 61A course</a>.  I have <a href="http://radar.oreilly.com/2008/11/the-commoditization-of-massive.html">argued elsewhere</a> that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.</p>
<p>But as a matter of both pedagogy and system design, I have begun to think that Google&#8217;s MapReduce model is not healthy for beginning students.  The basic issue is that Google&#8217;s narrow MapReduce API conflates<em> logical semantics</em> (define a function over all items in a collection) with an expensive <em>physical implementation</em> (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication.  But there&#8217;s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.</p>
<p>From an architectural point of view, <em>a good language for parallelism should expose pipelining</em>, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it.  So here we go.</p>
<p><span id="more-497"></span></p>
<p>The canonical example here is the relational join, which pairs up matching objects from two different input streams.  One of the early observations in shared-nothing parallel databases was that parallel joins can work in a fully streaming fashion even though they require all-to-all communication (<a title="Wilschut Apers" href="http://scholar.google.com/scholar?q=Dataflow+Query+Execution+in+a+Parallel+Main-Memory+Environment+(1991)">Wilschut and Apers&#8217; Pipelining Hash Join</a>).  This is impossible to implement in stock Hadoop.  Which means, for example, that the NSA—apparently a big Hadoop user—can&#8217;t use MapReduce to match live feeds of suspicious activity reports with a repository of intelligence on suspicious people.  (Whether you think it&#8217;s good or bad that the NSA can&#8217;t use Hadoop for this task may depend on your politics regarding national security, open source, or both.)</p>
<p>That example is easy to understand, but maybe too easy to dismiss as the special case of &#8220;stream processing&#8221;.  Sure, Hadoop wasn&#8217;t designed for that, so maybe you should dismiss my point as saying that somebody should just build a streaming edition of Hadoop for special uses.  (Oh yeah, <a href="http://code.google.com/p/hop/">somebody did that</a>.)</p>
<p>A more telling pedagogical example is the standard discussion of implementing PageRank in MapReduce.  What an awesome lecture: the two most famous pieces of Google technology in action at once!  The thing is, at a logical level PageRank is a simple self-join query with transitive closure: it pairs up items in the Nodes table with their neighbors in that same table, and outputs new weights until a stopping condition is reached. The matching of nodes and neighbors needs to be hash-partitioned across the network, but there is no reason to require the entire cluster to do a barrier synchronization at each iteration boundary.  The subtext of this lecture&#8217;s awesomeness is to deeply implant a misleading (and potentially very inefficient) idea about the need for synchronized execution in parallel linear algebra and query processing tasks.  That sounds like a bad way to teach.</p>
<p>And it goes deeper.  At a very basic level, <em>asynchronous parallel computation is fundamentally a matter of streaming rendezvous (join)</em> between event channels and buffers.  We converted this observation into a first-class feature of the <a href="http://bloom-lang.org">Bloom</a> language for distributed systems. When you write event-handling code in Bloom, you express it as the parallel join of a partitioned event stream with partitioned program state. Whether or not this metaphor makes sense to you (yet!), I assure you that it is fundamentally what goes on under the covers of any asynchronous distributed system: the matching of streams of requests or responses with stored state.  By teaching students the Google MapReduce model (as represented in Hadoop), we teach them that distributed joins need to block, and therefore cannot be the basis of the kind of streaming computation that servers must do by their very nature.  That is not only bad, it&#8217;s paradoxical: underneath the Hadoop implementation is a message handling dataflow that effectively does streaming joins.</p>
<p>So there—at Brian&#8217;s request I wrote down some cautionary words about using Hadoop in school.  Lest somebody read this as my declaration of allegiance to the &#8220;<a href="http://www.cs.washington.edu/homes/billhowe/mapreduce_a_major_step_backwards.html#comment-687">get off my lawn</a>&#8221; side of the AntiNoSQLdisestablishmentarianism argument, let me clarify my view on a few key things:</p>
<ul>
<li>Google MapReduce and the Hadoop open-source implementation opened the floodgates for discussing big data and parallelism in the core of the computing curriculum.  This has been a critical step in moving computing education into the era of abundant computing and data resources.  Bully for that.</li>
<li>I do understand that Google put barriers into MapReduce for fault tolerance purposes.  I also get that fault-tolerance is important in scale-out of parallel dataflows (we did <a title="Flux Fault Tolerance" href="http://db.cs.berkeley.edu/papers/sigmod04-fluxft.pdf">some work</a> on this) and that Google uses a lot of machines at once.  But as a practical matter, many Hadoop users run on a small handful of nodes and don&#8217;t need Google-scale fault tolerance features.  More to the point, when we teach computing, we shouldn&#8217;t focus on only one design point; we should focus on the fundamentals.  This is a case where the fundamentals get taught in a tangled way that only makes sense at extreme scale.</li>
<li>For the record, as languages go I like MapReduce about as much as I like SQL. That is to say I would invite them both to my birthday party. But I wouldn&#8217;t invite either to the prom.  My sweetheart here is <a href="http://bloom-lang.org">Bloom</a>.</li>
</ul>
<p>If you&#8217;ve been following our Bloom work, you know where this discussion is coming from: the <a href="http://databeta.wordpress.com/2010/10/28/the-calm-conjecture-reasoning-about-consistency/">CALM Theorem</a> says that the real reason to use coordination is to manage non-monotonic reasoning.  Many (most?) Reduce functions in the wild are in fact monotonic w.r.t. subsets of their inputs, but the standard Reduce API defies monotonicity analysis.  And introducing barriers or monotonic computation is a <a href="https://databeta.wordpress.com/2010/12/03/the-cron-principle/">waste of &#8220;time&#8221;</a>, in the temporal logical sense of the word.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/497/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=497&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/09/15/is-teaching-mapreduce-healthy-for-students/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://www.anothermother.org/wp-content/themes/AnotherMother/images/masthead_poster.gif" medium="image">
			<media:title type="html">Another Mother logo</media:title>
		</media:content>
	</item>
		<item>
		<title>MADlib goes beta!  Serious in-database analytics</title>
		<link>http://databeta.wordpress.com/2011/07/10/madlib-goes-beta-serious-in-database-analytics/</link>
		<comments>http://databeta.wordpress.com/2011/07/10/madlib-goes-beta-serious-in-database-analytics/#comments</comments>
		<pubDate>Sun, 10 Jul 2011 21:08:12 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[greenplum]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[MADlib]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[MAD]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=488</guid>
		<description><![CDATA[MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team! Forget your previous associations with low-tech SQL analytics, including so-called &#8220;business intelligence&#8221;, &#8220;olap&#8221;, &#8220;data cubes&#8221; and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=488&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a title="MADlib" href="http://madlib.net"><img class="alignright size-medium wp-image-489" title="MADlib logo" src="http://databeta.files.wordpress.com/2011/07/screen-shot-2011-07-10-at-4-27-26-pm.png?w=300&#038;h=110" alt="" width="300" height="110" /></a><a title="MADlib" href="http://madlib.net">MADlib</a> is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!</p>
<p>Forget your previous associations with low-tech SQL analytics, including so-called &#8220;business intelligence&#8221;, &#8220;olap&#8221;, &#8220;data cubes&#8221; and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of <a title="MADlib GitHub repo" href="https://github.com/madlib/madlib">the code</a> is written in SQL (a language that doesn&#8217;t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:</p>
<ul>
<li>standard statistical methods like multi-variate linear and logistic regressions,</li>
<li>supervised learning methods including support-vector machines, naive Bayes, and decision trees</li>
<li>unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation</li>
<li>descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification</li>
<li>statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.</li>
</ul>
<p>More methods are planned for future releases.  Myself, I&#8217;m working with Daisy Wang on merging <a title="Daisy Wang's latest SIGMOD paper" href="http://www.cs.berkeley.edu/~daisyw/sigmod11.pdf">her SQL-based Conditional Random Fields and Bayesian inference</a> implementations into the library for an upcoming release, to support sophisticated text processing.</p>
<p><span id="more-488"></span>MADlib 0.20beta was designed as a drop-in addition to the popular open-source PostgreSQL database.  The routines also all run massively parallel in the Greenplum database, which inherits the extensibility interfaces of PostgreSQL. (Non-commercial users can get the parallelism for free—no crippleware!—via Greenplum&#8217;s <a title="Greenplum Community Edition" href="http://www.greenplum.com/products/community-edition">Community Edition</a>.)</p>
<p>Obviously the first beta release is still experimental code.  But we&#8217;ve come a long way since our alpha announcement at the <a title="Strata 2011" href="http://strataconf.com/strata2011/">Strata conference</a> last February in both breadth and usability. For me, the most exciting aspect of the beta release is that we&#8217;re about ready to grow our developer community beyond the initial committers at Berkeley and EMC-Greenplum.  If you&#8217;re interested in adding methods to MADlib—or in developing ports for other DBMS platforms (something I&#8217;d love to see happen!)—please <a title="MADlib user forum" href="http://groups.google.com/group/madlib-user-forum">get in touch</a>.</p>
<p>And hats off to EMC-Greenplum for putting significant development resources behind this open-source effort.  I started this project at Greenplum before they were acquired, and have been happy to see EMC embrace it and push it further.  As <a title="O'Reilly Interview at Strata" href="http://www.youtube.com/watch?v=lSvI2UXCVHQ">I discussed at Strata last year</a>, I think this is a very healthy direction for open-source software development.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/488/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=488&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/07/10/madlib-goes-beta-serious-in-database-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://databeta.files.wordpress.com/2011/07/screen-shot-2011-07-10-at-4-27-26-pm.png?w=300" medium="image">
			<media:title type="html">MADlib logo</media:title>
		</media:content>
	</item>
		<item>
		<title>CACM Article on Jim Gray Search</title>
		<link>http://databeta.wordpress.com/2011/06/27/cacm-article-on-jim-gray-search/</link>
		<comments>http://databeta.wordpress.com/2011/06/27/cacm-article-on-jim-gray-search/#comments</comments>
		<pubDate>Mon, 27 Jun 2011 21:16:34 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[cloud]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[CACM]]></category>
		<category><![CDATA[crisis response]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[image processing]]></category>
		<category><![CDATA[Jim Gray]]></category>
		<category><![CDATA[social computing]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[Tenacious]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=475</guid>
		<description><![CDATA[The recent July 2011 issue of Communications of the ACM includes our article on the technical aspects of the search for Jim Gray&#8217;s boat Tenacious.  This was a hard article to write, for both technical and personal reasons. It took far too long to finish, so at some point it was time to just pack it in (at [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=475&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><img class="alignright" title="CACM July, 2011" src="http://portalparts.acm.org/1970000/1965724/cover/cover_full.jpg" alt="" width="132" height="171" />The recent July 2011 issue of <a title="CACM article on Jim Gray search" href="http://cacm.acm.org/magazines/2011/7/109892-searching-for-jim-gray/fulltext">Communications of the ACM</a> includes our article on the technical aspects of the search for Jim Gray&#8217;s boat Tenacious.  This was a hard article to write, for both technical and personal reasons. It took far too long to finish, so at some point it was time to just pack it in (at which point the CACM folks informed us it had to be cut in length by half, which delayed things further.  The longer version is up as a <a title="Berkeley Tech Report on Jim Gray Search" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-142.html">Berkeley tech report</a>.)</p>
<p>Meanwhile, some of the experience is even more relevant to current technology trends than it was 4 years ago, so hopefully folks interested in social computing, software engineering, image processing, crisis response, and other related areas will find something of use in there.</p>
<p>For those of you whose work is represented (or underrepresented) by the article, my apologies for its shortcomings.  I still don&#8217;t have the full picture of what happened&#8212;nobody does, really.  As a result I decided to avoid using personal names of volunteers in general to avoid attributing credit unevently. I know the result seems oddly impersonal.  Setting the tone of the article was as hard as capturing the content.</p>
<p>Meanwhile, I encourage you to add corrections and perspective to the article in the comment box at the end of the CACM link above. Comments are welcome here too, but they might not get as well-viewed or -archived.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/475/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/475/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=475&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/06/27/cacm-article-on-jim-gray-search/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://portalparts.acm.org/1970000/1965724/cover/cover_full.jpg" medium="image">
			<media:title type="html">CACM July, 2011</media:title>
		</media:content>
	</item>
		<item>
		<title>bud: bloom under development</title>
		<link>http://databeta.wordpress.com/2011/04/08/bud-bloom-under-development/</link>
		<comments>http://databeta.wordpress.com/2011/04/08/bud-bloom-under-development/#comments</comments>
		<pubDate>Fri, 08 Apr 2011 08:41:11 +0000</pubDate>
		<dc:creator>jmh</dc:creator>
				<category><![CDATA[bloom]]></category>
		<category><![CDATA[boom]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[bud]]></category>

		<guid isPermaLink="false">http://databeta.wordpress.com/?p=438</guid>
		<description><![CDATA[Today was a big day in the BOOM group: we launched the alpha version of Bud: Bloom Under Development. If you&#8217;re new to this blog, Bloom is our new programming language for cloud computing and other distributed systems settings. Bud is the first fully-functional release of Bloom, implemented as a DSL in Ruby. I&#8217;ve written a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=438&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://bloom-lang.net"><img class="alignright" title="Bloom logo" src="http://www.bloom-lang.net/wp-content/uploads/2011/04/bloomlogoleft2.png" alt="" width="333" height="67" /></a>Today was a big day in the BOOM group: we launched the alpha version of <a title="Bud" href="http://www.bloom-lang.net/bud/">Bud: Bloom Under Development</a>. If you&#8217;re new to this blog, Bloom is our new programming language for cloud computing and other distributed systems settings. Bud is the first fully-functional release of Bloom, implemented as a DSL in Ruby.</p>
<p>I&#8217;ve written a lot about Bloom in <a href="http://boom.cs.berkeley.edu/papers.html">research papers</a> and on the new <a href="http://bloom-lang.net">Bloom website</a>, and I have lots to say about distributed programming that I won&#8217;t recap. Instead, I want to focus here on the tangible: working code. If you&#8217;re looking for something serious, check out the <a href="https://github.com/bloom-lang/bud/blob/master/docs/bfs.md">walkthrough of the bfs distributed filesystem</a>, a GFS clone. But to get the flavor, consider the following <em>two lines of code, </em>which implement what you might consider to be &#8220;hello, world&#8221; for distributed systems: a chat server.<br />
<span><br />
<blockquote><code><font size="2" color="white"> nodelist <span style="color:#fbde2d;">&lt;= </span>connect.<span style="color:#fbde2d;">payloads</span><br />
mcast <span style="color:#fbde2d;">&lt;~ </span>(mcast <span style="color:#fbde2d;">*</span> nodelist).<span style="color:#fbde2d;">pairs</span> { |m,n| [n.key, m.val] }</font></code></p></blockquote>
<p></span>That&#8217;s it.</p>
<p>The first line says &#8220;if you get a message on a channel called &#8216;connect&#8217;, remember the payload in a table called &#8216;nodelist&#8217;&#8221;.  The second says &#8220;if you get a message on the &#8216;mcast&#8217; channel, then forward its contents to each address stored in &#8216;nodelist&#8217;&#8221;. That&#8217;s all that&#8217;s needed for a bare-bones chat server.  Nice, right?</p>
<p><span id="more-438"></span><br />
The chat client code bootstraps by connecting to the server:</p>
<blockquote><p><code><font size="2" color="white">connect <span style="color:#fbde2d;">&lt;~</span> [[@server, [ip_port, @nick]]] </font></code></p></blockquote>
<p>and then runs the following logic:</p>
<blockquote><p><code><font size="2" color="white">mcast <span style="color:#fbde2d;">&lt;~</span> <span style="color:#fbde2d;">stdio do</span> |s|<br />
 &nbsp;&nbsp;[@server,<br />
 &nbsp;&nbsp;&nbsp;[ip_port, @nick, <span style="color:#4466ff;">Time</span>.<span style="color:#fbde2d;">new</span>.strftime(<span style="color:#00ff00;">"%I:%M.%S"</span>), s.line]]<br />
<span style="color:#fbde2d;">end</span><br />
<span style="color:#fbde2d;">stdio &lt;~</span> mcast { |m| [pretty_print(m.val)] }</font></code></p></blockquote>
<p>The first statement of this batch takes input from the terminal (&#8216;stdio&#8217;), formats it, and sends it to the server on the &#8216;mcast&#8217; channel.  The second statement takes messages from the &#8216;mcast&#8217; channel (forwarded by the server) and prints them on the terminal.</p>
<div>Hopefully this little (but working) example gives you a taste of why I&#8217;m excited about Bloom: I think it captures the essence of distributed programming in a clean, readable way.  <a href="https://github.com/bloom-lang/bud/tree/master/examples/chat">The full implementation</a> is a bit longer, but most of the extra code is simple Ruby boilerplate.
</div>
<p></p>
<div>It&#8217;s been a blast designing the language and coding up the Bud prototype with my <a href="https://github.com/bloom-lang">most excellent teammates</a> over the last 9 months.  Hats off, gents.</div>
<p></p>
<div>In subsequent posts I&#8217;ll highlight some of the tools we ship with Bud that help with the really hard stuff: using the CALM principle to reason about the consistency and non-determinism of your Bloom code.</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/databeta.wordpress.com/438/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/databeta.wordpress.com/438/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=databeta.wordpress.com&#038;blog=5435607&#038;post=438&#038;subd=databeta&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://databeta.wordpress.com/2011/04/08/bud-bloom-under-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/893729486f1eb230165d761492e95451?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jmh</media:title>
		</media:content>

		<media:content url="http://www.bloom-lang.net/wp-content/uploads/2011/04/bloomlogoleft2.png" medium="image">
			<media:title type="html">Bloom logo</media:title>
		</media:content>
	</item>
	</channel>
</rss>
