By now it’s a truism in cloud computing and internet infrastructure that component failure happens frequently. It’s simple statistics: (lots of components) * (a small failure rate per component) = a high component failure rate across the collection. People now routinely architect distributed systems for this reality.
That’s necessary but not sufficient: you also need correct failure-handling protocols and faithful implementations.
So, do today’s popular distributed systems handle component failure well? Or, taking the longer view: what kinds of tools will help the engineers who build those systems ensure that they handle component failure well?
My postdoc Haryadi Gunawi and his team have taken some big steps to answer these questions, and written them up in their report on FATE and DESTINI: A Framework for Cloud Recovery Testing. They take a systematic combinatorial approach to generating faults (FATE) and a formal approach to specifying correctness (DESTINI) that grows out of our work on declarative languages. Upshot: research that produces real tools, which help developers find (and then fix) real failure-handling bugs, including 16 new bug reports to HDFS (7 design bugs and 9 implementation bugs). Pretty nice, given the intricacies of failure-recovery protocols.
Haryadi conceived of and drove this project, and like his earlier work fixing the sorry state of file system checkers, it takes good clean research designs and uses them to improve substantially on real-world practice.
Haryadi started the project like a social scientist, combing through the Jira issue-tracker reports for HDFS and classifying the recovery bugs. Then he set about generalizing issues, designing techniques, and building tools. After that he used the Jira’s to make sure his tools were getting good coverage (they found the already-reported bugs automatically) and generating tangible benefits (surfacing new bugs). He then transitioned from his first experimental setting (HDFS) to two new ones: Cassandra and Zookeeper. Results there are preliminary but look promising.
In the end, the whole package feels simple, sensible, and useful. And it’s exactly because of the way he combined the grounding in practice with the elegance of his research ideas, Now that the initial results are written up, Haraydi is beginning the discussions with Yahoo, Facebook, Cloudera and others to get this stuff out in the field.