Skip navigation

Category Archives: distributed systems

Original Author: Nick Youngson - link to - http://www.nyphotographic.com/

Original Image: http://www.picpedia.org/highway-signs/c/calm.html

For folks who care about what’s possible in distributed computing: Peter Alvaro and I wrote an introduction to the CALM Theorem and subsequent work that is now up on arXiv. The CALM Theorem formally characterizes the class of programs that can achieve distributed consistency without the use of coordination.

I spent a good fraction of my academic life in the last decade working on a deeper understanding of how to program the cloud and other large-scale distributed systems. I was enormously lucky to collaborate with and learn from amazing friends over this period in the BOOM project, and see our work picked up and extended by new friends and colleagues.

Our research was motivated by simple questions, chief among them this:

Q: “What is the hardest thing about distributed systems?”
A: “Coordination and consistency.”

Protocols like Two-Phase Commit, Paxos and their myriad offspring are celebrated for being tricky, and as such form the backbone of academic classes on distributed computing. But trickiness is not a hallmark of good software design. In practice, coordination is the source of much of the complexity and inefficiency of distributed systems, and it is avoided when possible by good engineers.

So we moved to a more fundamental question:

Q: When can we correctly avoid coordination, and when are we absolutely required to use it?
A (circa 2010): Unknown.

Surprisingly, this computability question was one that the pioneers of distributed systems never answered, at least not in any sense of algorithms or program semantics. The discussion in the literature was cast in terms of “memory models” or “storage consistency” guarantees so low down the stack as to be irrelevant and unhelpful to most application designers.

In a keynote talk at PODS 2010, I proposed an answer to this open question. I conjectured—based on my team’s experience with streaming queries and declarative networking—that coordination was needed if and only if you had a computational task that could not be expressed with monotonic logic. I called this idea CALM: Consistency as Logical Monotonicity. Not long thereafter a formalization and proof of the CALM Theorem was provided by Ameloot, Neven and Van den Bussche over at Hasselt University in Belgium. Related work ensued across both sides of the Atlantic on additional theoretical results and practical uses of the idea for program analysis.

I sense that this body of work deserves more attention today, when distributed computing is becoming the norm rather than the exception. CALM provides a formal basis for a myriad of conversations over the last 15 years regarding what is possible to get correct with “eventual consistency”, “noSQL”, “commutativity”, “ACID 2.0”, “CRDTs” and other pragmatics. It provides the nuanced answer to screeds about “beating the CAP Theorem”. It also lays the groundwork for what we did later with the Bloom language: provide a programming model where the really hard issues of distributed programming are first-order concerns of the language and its syntax.

To bring these issues to a wider audience, I sat down with the inimitable Peter Alvaro to write up what we hope is an approachable but sufficiently meaty intro to the CALM Theorem, its implications, and the many open questions remaining. It took a while for this to get to the top of our stacks, but the paper is now up on arXiv.

We’re spinning up a new generation of work on cloud programming here at Berkeley’s RISELab that builds on these lessons. Watch this space!

escharian_stairs_fbtl;dr: Colleagues at Berkeley and I have a new paper on the state of serverless computing that will appear at CIDR ’19. It celebrates the arrival of public-facing autoscaling cloud programming, but critiques the current serverless offerings for thwarting the hallmarks of what makes the cloud exciting: data-centric and distributed computing. We hope the paper will start a constructive discussion on how to expose the right programming APIs and runtimes for the cloud.


I’ve been fascinated by the potential of cloud computing for a decade now, prior to starting the Berkeley Orders Of Magnitude (BOOM) project. The cloud is the machine of a dream: more data and computing power than anyone could ever need, available to everyone.

Since the beginning, I’ve felt that a new platform like the cloud needs new programming languages that suit its “physics”. Once that is achieved, unexpected innovation will follow from the creativity of a world of developers. In the case of the cloud, the physical reality is a deeply data-rich, massively distributed, heterogeneous computing environment with the ability to grow and shrink consumption on demand. We have never had a computer with this power or this programming complexity.

After 10 years of people writing cloud programs in legacy sequential languages like Java, the public cloud providers are finally proposing a  programming model for the cloud. They are calling it Serverless Computing, or more descriptively “Functions as a Service” (FaaS). As an interface to the unprecedented potential of the cloud, FaaS today is a disappointment. Current FaaS offerings do provide a taste of the power of autoscaling, but they have fatal flaws when it comes to the basic physics of the cloud: they make it impossible to do serious distributed computing, and crazy expensive/slow to work with data at scale. This is not a roadmap for harnessing the creativity of the developer community.

I sat down with colleagues at Berkeley to write what we think is a tough but fair assessment of where Serverless Computing is today, and describe the changes we think are needed to provide a programming model that matches the cloud’s physics and unlocks its potential. The paper will appear at CIDR 2019 in January, but a preprint is available on arXiv. We’re looking forward to the conversation that ensues.

We also have constructive research ongoing in this domain … watch this space for more!

ANHU-1024x768-Alan-VernonThere’s fast and there’s fast. This post is about Anna*, a key/value database design from our team at Berkeley that’s got phenomenal speed and buttery smooth scaling, with an unprecedented range of consistency guarantees. Details are in our upcoming ICDE18 paper on Anna.

Conventional wisdom (or at least Jeff Dean wisdom) says that you have to redesign your system every time you scale by 10x. As researchers, we asked the counter-cultural question:

What would it take to build a key-value store that would excel across many orders of magnitude of scale, from a single multicore box to the global cloud?

Turns out this kind of curiosity can lead to a system with pretty interesting practical implications.

Read More »

2000px-Typing_monkey.svgAs mentioned in my previous post, Peter Alvaro turned in his PhD thesis a month back, and is now in full swing as a professor at UC Santa Cruz. In the midst of that nifty academic accomplishment, he succeeded in taking the last chapter of his thesis from our BOOM project out of the ivory tower and into use at Netflix. Peter’s collaborators at Netflix recently blogged about the use of his ideas alongside their (very impressive) testing infrastructure, famously known as ChaosMonkey, a.k.a. the Netflix Simian Army.  This also generated some press, vaguely inappropriate headline and all.  Their work raised the flag on five potential failure scenarios in a single request interface.  That’s nice.  Even nicer is what would have happened without the ideas from Peter’s research:

Brute force exploration of this space would take 2^100 iterations (roughly 1 with 30 zeros following), whereas our approach was able to explore it in ~200 experiments.

It always feels good when your good ideas carve 28 zeroes of performance off the end of standard practice.

I encourage you to read the Netflix blog post, as well as Peter’s paper on the research prototype, Molly.  Or you can watch his RICON 2015 keynote on the subject.

computer on fireA major source of frustration in distributed programming is that contemporary software tools—think compilers and debuggers—have little to say about the really tricky bugs that distributed systems developers face.  Sure, compilers can find type and memory errors, and debuggers can single-step you through sequential code snippets. But how do they help with distributed systems issues?  In some sense, they don’t help at all with the stuff that matters—things like:

  • Concurrency: Does your code have bugs due to race conditions?  Don’t forget that a distributed system is a parallel system!
  • Consistency: Are there potential consistency errors in your program due to replicated state? Can you get undesirable non-deterministic outcomes based on network delays?  What about the potential for the awful “split-brain” scenario where the state of multiple machines gets irrevocably out of sync?
  • Coordination performance: Do you have performance issues due to overly-aggressive coordination or locking? Can you avoid expensive coordination without incurring bugs like the ones above?

These questions are especially tricky if you use services or libraries, where you don’t necessarily know how state and communication are managed.  What code can you trust, and what about that code do you need to know to trust it?

Peter Alvaro has been doing groundbreaking work in the space, and recently started taking the veil off his results.  This is a big deal. Read More »

Photo Credit: Karthick R via Compfight cc

We just finished writing up an overview of our most recent thinking about distributed consistency. The paper is entitled Consistency Without Borders, and it’s going to appear in the ACM SoCC conference next month in Silicon Valley.

It starts with two things we all know:

  1. Strong distributed consistency can be expensive and dangerous. (My favorite exposition: the LADIS ’08 conference writeup by Birman, Chockler and van Renesse. See especially the quotes from James Hamilton and Randy Shoup. And note that recent work like Spanner changes little: throughput of 10’s to 100’s of updates per second is only useful at the fringes.)
  2. Managing coordination in application logic is fraught with software engineering peril: you have to spec, build, test and maintain special-case, cross-stack distributed reasoning over time. Here be dragons.

The point of the paper is to try to reorient the community to explore the design space in between these extremes. Distributed consistency is one of the biggest CS problems of our day, and the technical community is spending way too much of its energy at these two ends of the design space.

We’ll be curious to hear feedback here, and at the conference.

CopyIf you follow this blog, you know that my BOOM group has spent a lot of time in the past couple years formalizing eventual consistency (EC) for distributed programs, via the CALM theorem and practical tools for analyzing Bloom programs.

In recent months, my student Peter Bailis and his teammate Shivaram Venkataraman took a different tack on the whole EC analysis problem which they call PBS: Probabilistically Bounded Staleness. The results are interesting, and extremely relevant to current practice.  (See, for example, the very nice blog post by folks at DataStax).

Many people today deal with EC in the specific context of replica consistency, particularly in distributed NoSQL-style Key-Value Stores (KVSs). It is typical to configure these stores with so-called “partial” quorum replication, to get a comfortable mix of low latency with reasonable availability. The term “partial” signifies that you are not guaranteed consistency of writes by these configurations — at best they guarantee a form of eventual consistency of final writes, but readers may well read stale data along the way. Lots of people are deploying these configurations in the field, but there’s little information on how often the approach messes up, and how badly.

Jumping off from earlier theoretical work on probabilistic quorum systems, Peter and Shivaram answered two natural questions about how these systems should perform in current practice:

  1. How many versions ago?  On expectation, if you do a read in a partial-quorum KVS, how many versions behind are you? Peter and Shivaram answer this one definitively, via a closed-form mathematical analysis.
  2. How stale on the (wall-)clock?  On expectation, if you do a read in a partial-quorum KVS, how out-of-date will your version be in terms of wall-clock time? Answering this one requires modeling a read/write workload in wall-clock time, as well as system parameters like replica propagation (“anti-entropy”). Peter and Shivaram address this with a Monte Carlo model, and run the model with parameters grounded in real-world performance numbers generously provided by two of our most excellent colleagues: Alex Feinberg at LinkedIn and Coda Hale at Yammer (both of whom also guest-lectured in my Programming the Cloud course last fall.)  Peter and Shivaram validated their models in practice using Cassandra, a widely-used KVS.

On the whole, PBS shows that being sloppy about consistency doesn’t bite you often or badly — especially if you’re in a single datacenter and you use SSDs. But things get more complex with magnetic disks, garbage collection delays (grr), and wide-area replication.

Interested in more detail?  You can check out two things:

Hadoop is not healthy for children and other living things.I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates.  I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.

Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by introducing MapReduce and Hadoop in our initial 61A course.  I have argued elsewhere that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.

But as a matter of both pedagogy and system design, I have begun to think that Google’s MapReduce model is not healthy for beginning students.  The basic issue is that Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication.  But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.

From an architectural point of view, a good language for parallelism should expose pipelining, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it.  So here we go.

Read More »

The recent July 2011 issue of Communications of the ACM includes our article on the technical aspects of the search for Jim Gray’s boat Tenacious.  This was a hard article to write, for both technical and personal reasons. It took far too long to finish, so at some point it was time to just pack it in (at which point the CACM folks informed us it had to be cut in length by half, which delayed things further.  The longer version is up as a Berkeley tech report.)

Meanwhile, some of the experience is even more relevant to current technology trends than it was 4 years ago, so hopefully folks interested in social computing, software engineering, image processing, crisis response, and other related areas will find something of use in there.

For those of you whose work is represented (or underrepresented) by the article, my apologies for its shortcomings.  I still don’t have the full picture of what happened—nobody does, really.  As a result I decided to avoid using personal names of volunteers in general to avoid attributing credit unevently. I know the result seems oddly impersonal.  Setting the tone of the article was as hard as capturing the content.

Meanwhile, I encourage you to add corrections and perspective to the article in the comment box at the end of the CACM link above. Comments are welcome here too, but they might not get as well-viewed or -archived.

Today was a big day in the BOOM group: we launched the alpha version of Bud: Bloom Under Development. If you’re new to this blog, Bloom is our new programming language for cloud computing and other distributed systems settings. Bud is the first fully-functional release of Bloom, implemented as a DSL in Ruby.

I’ve written a lot about Bloom in research papers and on the new Bloom website, and I have lots to say about distributed programming that I won’t recap. Instead, I want to focus here on the tangible: working code. If you’re looking for something serious, check out the walkthrough of the bfs distributed filesystem, a GFS clone. But to get the flavor, consider the following two lines of code, which implement what you might consider to be “hello, world” for distributed systems: a chat server.

nodelist <= connect.payloads
mcast <~ (mcast * nodelist).pairs { |m,n| [n.key, m.val] }

That’s it.

The first line says “if you get a message on a channel called ‘connect’, remember the payload in a table called ‘nodelist'”. The second says “if you get a message on the ‘mcast’ channel, then forward its contents to each address stored in ‘nodelist'”. That’s all that’s needed for a bare-bones chat server.  Nice, right?

Read More »