If you follow this blog, you know that my BOOM group has spent a lot of time in the past couple years formalizing eventual consistency (EC) for distributed programs, via the CALM theorem and practical tools for analyzing Bloom programs.
In recent months, my student Peter Bailis and his teammate Shivaram Venkataraman took a different tack on the whole EC analysis problem which they call PBS: Probabilistically Bounded Staleness. The results are interesting, and extremely relevant to current practice. (See, for example, the very nice blog post by folks at DataStax).
Many people today deal with EC in the specific context of replica consistency, particularly in distributed NoSQL-style Key-Value Stores (KVSs). It is typical to configure these stores with so-called “partial” quorum replication, to get a comfortable mix of low latency with reasonable availability. The term “partial” signifies that you are not guaranteed consistency of writes by these configurations — at best they guarantee a form of eventual consistency of final writes, but readers may well read stale data along the way. Lots of people are deploying these configurations in the field, but there’s little information on how often the approach messes up, and how badly.
Jumping off from earlier theoretical work on probabilistic quorum systems, Peter and Shivaram answered two natural questions about how these systems should perform in current practice:
- How many versions ago? On expectation, if you do a read in a partial-quorum KVS, how many versions behind are you? Peter and Shivaram answer this one definitively, via a closed-form mathematical analysis.
- How stale on the (wall-)clock? On expectation, if you do a read in a partial-quorum KVS, how out-of-date will your version be in terms of wall-clock time? Answering this one requires modeling a read/write workload in wall-clock time, as well as system parameters like replica propagation (“anti-entropy”). Peter and Shivaram address this with a Monte Carlo model, and run the model with parameters grounded in real-world performance numbers generously provided by two of our most excellent colleagues: Alex Feinberg at LinkedIn and Coda Hale at Yammer (both of whom also guest-lectured in my Programming the Cloud course last fall.) Peter and Shivaram validated their models in practice using Cassandra, a widely-used KVS.
On the whole, PBS shows that being sloppy about consistency doesn’t bite you often or badly — especially if you’re in a single datacenter and you use SSDs. But things get more complex with magnetic disks, garbage collection delays (grr), and wide-area replication.
Interested in more detail? You can check out two things:
- The paper (currently under submission to a conference) is available as a Berkeley Tech Report.
- Peter put up a web-based version of the Monte Carlo simulation that allows you to specify quorum parameters and workload parameters, and observe the tradeoff between those parameters and the probability of various levels of staleness.