Skip navigation

For a class I’m teaching, I’d like to collect a list of favorite “maxims” or “aphorisms” for computer systems.

I’d be very grateful if you would add your favorites below to the comments, preferably with a link to a source that either introduces or references the maxim.  It’s OK to agree or disagree with the maxim.

I’d enjoy seeing people’s support/critiques for these below as well — may merit more focused posts another day.


What else?

Hadoop is not healthy for children and other living things.I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates.  I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.

Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by introducing MapReduce and Hadoop in our initial 61A course.  I have argued elsewhere that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.

But as a matter of both pedagogy and system design, I have begun to think that Google’s MapReduce model is not healthy for beginning students.  The basic issue is that Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication.  But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.

From an architectural point of view, a good language for parallelism should expose pipelining, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it.  So here we go.

Read More »

MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!

Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:

  • standard statistical methods like multi-variate linear and logistic regressions,
  • supervised learning methods including support-vector machines, naive Bayes, and decision trees
  • unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
  • descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
  • statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.

More methods are planned for future releases.  Myself, I’m working with Daisy Wang on merging her SQL-based Conditional Random Fields and Bayesian inference implementations into the library for an upcoming release, to support sophisticated text processing.

Read More »

The recent July 2011 issue of Communications of the ACM includes our article on the technical aspects of the search for Jim Gray’s boat Tenacious.  This was a hard article to write, for both technical and personal reasons. It took far too long to finish, so at some point it was time to just pack it in (at which point the CACM folks informed us it had to be cut in length by half, which delayed things further.  The longer version is up as a Berkeley tech report.)

Meanwhile, some of the experience is even more relevant to current technology trends than it was 4 years ago, so hopefully folks interested in social computing, software engineering, image processing, crisis response, and other related areas will find something of use in there.

For those of you whose work is represented (or underrepresented) by the article, my apologies for its shortcomings.  I still don’t have the full picture of what happened—nobody does, really.  As a result I decided to avoid using personal names of volunteers in general to avoid attributing credit unevently. I know the result seems oddly impersonal.  Setting the tone of the article was as hard as capturing the content.

Meanwhile, I encourage you to add corrections and perspective to the article in the comment box at the end of the CACM link above. Comments are welcome here too, but they might not get as well-viewed or -archived.

Today was a big day in the BOOM group: we launched the alpha version of Bud: Bloom Under Development. If you’re new to this blog, Bloom is our new programming language for cloud computing and other distributed systems settings. Bud is the first fully-functional release of Bloom, implemented as a DSL in Ruby.

I’ve written a lot about Bloom in research papers and on the new Bloom website, and I have lots to say about distributed programming that I won’t recap. Instead, I want to focus here on the tangible: working code. If you’re looking for something serious, check out the walkthrough of the bfs distributed filesystem, a GFS clone. But to get the flavor, consider the following two lines of code, which implement what you might consider to be “hello, world” for distributed systems: a chat server.

nodelist <= connect.payloads
mcast <~ (mcast * nodelist).pairs { |m,n| [n.key, m.val] }

That’s it.

The first line says “if you get a message on a channel called ‘connect’, remember the payload in a table called ‘nodelist'”. The second says “if you get a message on the ‘mcast’ channel, then forward its contents to each address stored in ‘nodelist'”. That’s all that’s needed for a bare-bones chat server.  Nice, right?

Read More »

I often hear that many of the leading data analysts in the field have PhDs in physics or biology or the like, rather than computer science.  Computer scientists are typically interested in methods; physical scientists are interested in data.

Another thing I often hear is that a large fraction of the time spent by analysts — some say the majority of time — involves data preparation and cleaning: transforming formats, rearranging nesting structures, removing outliers, and so on.  (If you think this is easy, you’ve never had a stack of ad hoc Excel spreadsheets to load into a stat package or database!)

Putting these together, something is very wrong:  high-powered people are wasting most of their time doing low-function work.  And the challenge of improving this state of affairs has fallen in the cracks between the analysts and computer scientists.

DataWrangler is a new tool we’re developing to address this problem, which I demo’d today at the O’Reilly Strata Conference.  DataWrangler is an intelligent visual data transformation tool that lets users reshape, transform and clean data in an intuitive way that surprises most people who’ve worked with data.  As you manipulate data in a grid layout, the tool automatically infers information both about the data, and about your intentions for transforming the data.  It’s hard to describe, but the lead researcher on the project — Stanford PhD student Sean Kandel — has a quick video up on the DataWrangler homepage that shows how it works.  Sean has put DataWrangler live on the site as well.

Tackling these problems fundamentally requires a hybrid technical strategy.  Under the covers, DataWrangler is a heady mix of second-order logic, machine learning methods, and human-computer interaction design methodology.   We wrote a research paper about it that will appear in this year’s SIGCHI.

If you’re interested in this space, also have a look at Shankar Raman’s very prescient Potter’s Wheel work from a decade ago, the PADS project at AT&T and Princeton, recent research from Sumit Gulwani at Microsoft Research, and David Huynh’s most excellent Google Refine.  All good stuff!

In today’s episode of the Twilight Zone, a young William Shatner stumbles into a time machine and travels back into the past. Cornered in a dark alley, he is threatened by a teenage hooligan waving a loaded pistol. A tussle ensues, and in trying to wrest the gun from his assailant, Shatner fires, killing him dead. Examining the contents of the dead youth’s wallet, Bill comes to a shocking conclusion: he has just killed his own grandfather. Tight focus: Shatner howling soundlessly as he stares at his own hand flickering in and out of view.

Shatner? Or Not(Shatner)? Having now changed history, he could not have been born, meaning he could not have traveled back in time and changed history, meaning he was indeed born, meaning…?

You see where this goes.  It’s the old grandfather paradox, a hoary chestnut of SciFi and AI.  Personally I side with Captain Kirk: I don’t like mysteries. They give me a bellyache. But whether or not you think a discussion of “p if Not(p)” is news that’s fit to print, it is something to avoid in your software.  This is particularly tricky in distributed programming, where multiple machines have different clock settings, and those clocks may even turn backward on occasion. The theory of Distributed Systems is built on the notion of Causality, which enables programmers and programs to avoid doing unusual things like executing instructions in orders that could not have been specified by the program that generated them. Causality is established by distributed clock protocols. These protocols are often used to enforce causal orderings–i.e. to make machines wait for messages. And waiting for messages, as we know, is bad.

So I’m here to tell you today that Causality is overrated, and we can often skip the wait. To hell with distributed clocks: time travel can be fine.  In many cases it’s even fine to change history. Here’s the thing: Casuality is Required Only to control Non-monotonicity. I call this the CRON principle.

Read More »

12/16/2010: final version of CALM/Bloom paper for CIDR now posted

Conventional Wisdom:
In large distributed systems, perfect data consistency is too expensive to guarantee in general. “Eventually consistent” approaches are often a better choice, since temporary inconsistencies work out in most cases. Consistency mechanisms (transactions, quorums, etc.) should be reserved for infrequent, small-scale, mission-critical tasks.

Most computer systems designers agree on this at some level (once you get past the NoSQL vs. ACID sloganeering). But like lots of well-intentioned design maxims, it’s not so easy to translate into practice — all kinds of unavoidable tactical questions pop up:


  • Exactly where in my multifaceted system is eventual consistency “good enough”?
  • How do I know that my “mission-critical” software isn’t tainted by my “best effort” components?
  • How do I maintain my design maxim as software evolves? For example, how can the junior programmer in year n of a project reason about whether their piece of the code maintains the system’s overall consistency requirements?

If you think you have answers to those questions, I’d love to hear them. And then I’ll raise the stakes, because I have a better challenge for you: can you write down your answers in an algorithm?

Write a program checker that will either “bless” your code’s inconsistency as provably acceptable, or identify the locations of unacceptable consistency bugs.

The CALM Conjecture is my initial answer to that challenge.

Read More »

By now it’s a truism in cloud computing and internet infrastructure that component failure happens frequently. It’s simple statistics: (lots of components) * (a small failure rate per component) = a high component failure rate across the collection. People now routinely architect distributed systems for this reality.

That’s necessary but not sufficient: you also need correct failure-handling protocols and faithful implementations.

So, do today’s popular distributed systems handle component failure well? Or, taking the longer view: what kinds of tools will help the engineers who build those systems ensure that they handle component failure well?

My postdoc Haryadi Gunawi and his team have taken some big steps to answer these questions, and written them up in their report on FATE and DESTINI: A Framework for Cloud Recovery Testing. They take a systematic combinatorial approach to generating faults (FATE) and a formal approach to specifying correctness (DESTINI) that grows out of our work on declarative languages. Upshot: research that produces real tools, which help developers find (and then fix) real failure-handling bugs, including 16 new bug reports to HDFS (7 design bugs and 9 implementation bugs). Pretty nice, given the intricacies of failure-recovery protocols.

Read More »

I don’t usually post about business deals on my blog. But today’s acquisition of Greenplum by EMC is too close to home not to comment. I’ve been involved as a technical advisor at Greenplum for almost three years, and joined the EMC technical advisory board this spring — so I have some interest in the deal.
Below is my take on things from the technical side. Note that I’m not privy to any private information about the deal, and I’m generally more interested in the tech than the finance. No need to try and read financial tea leaves here — there aren’t any. This is a computer scientist’s view of the technology implications.  Here goes: