Skip navigation

Monthly Archives: October 2010

12/16/2010: final version of CALM/Bloom paper for CIDR now posted

Conventional Wisdom:
In large distributed systems, perfect data consistency is too expensive to guarantee in general. “Eventually consistent” approaches are often a better choice, since temporary inconsistencies work out in most cases. Consistency mechanisms (transactions, quorums, etc.) should be reserved for infrequent, small-scale, mission-critical tasks.

Most computer systems designers agree on this at some level (once you get past the NoSQL vs. ACID sloganeering). But like lots of well-intentioned design maxims, it’s not so easy to translate into practice — all kinds of unavoidable tactical questions pop up:

Questions:

  • Exactly where in my multifaceted system is eventual consistency “good enough”?
  • How do I know that my “mission-critical” software isn’t tainted by my “best effort” components?
  • How do I maintain my design maxim as software evolves? For example, how can the junior programmer in year n of a project reason about whether their piece of the code maintains the system’s overall consistency requirements?

If you think you have answers to those questions, I’d love to hear them. And then I’ll raise the stakes, because I have a better challenge for you: can you write down your answers in an algorithm?

Challenge:
Write a program checker that will either “bless” your code’s inconsistency as provably acceptable, or identify the locations of unacceptable consistency bugs.

The CALM Conjecture is my initial answer to that challenge.

Read More »

By now it’s a truism in cloud computing and internet infrastructure that component failure happens frequently. It’s simple statistics: (lots of components) * (a small failure rate per component) = a high component failure rate across the collection.¬†People now routinely architect distributed systems for this reality.

That’s necessary but not sufficient: you also need correct failure-handling protocols and faithful implementations.

So, do today’s popular distributed systems handle component failure well? Or, taking the longer view: what kinds of tools will help the engineers who build those systems ensure that they handle component failure well?

My postdoc Haryadi Gunawi and his team have taken some big steps to answer these questions, and written them up in their report on FATE and DESTINI: A Framework for Cloud Recovery Testing. They take a systematic combinatorial approach to generating faults (FATE) and a formal approach to specifying correctness (DESTINI) that grows out of our work on declarative languages. Upshot: research that produces real tools, which help developers find (and then fix) real failure-handling bugs, including 16 new bug reports to HDFS (7 design bugs and 9 implementation bugs). Pretty nice, given the intricacies of failure-recovery protocols.

Read More »