Debugging Distributed Programs with Blazes

A major source of frustration in distributed programming is that contemporary software tools—think compilers and debuggers—have little to say about the really tricky bugs that distributed systems developers face. Sure, compilers can find type and memory errors, and debuggers can single-step you through sequential code snippets. But how do they help with distributed systems issues? In some sense, they don’t help at all with the stuff that matters—things like:

Concurrency: Does your code have bugs due to race conditions? Don’t forget that a distributed system is a parallel system!
Consistency: Are there potential consistency errors in your program due to replicated state? Can you get undesirable non-deterministic outcomes based on network delays? What about the potential for the awful “split-brain” scenario where the state of multiple machines gets irrevocably out of sync?
Coordination performance: Do you have performance issues due to overly-aggressive coordination or locking? Can you avoid expensive coordination without incurring bugs like the ones above?

These questions are especially tricky if you use services or libraries, where you don’t necessarily know how state and communication are managed. What code can you trust, and what about that code do you need to know to trust it?

Peter Alvaro has been doing groundbreaking work in the space, and recently started taking the veil off his results. This is a big deal.Peter built a program analysis tool called Blazes that can identify concurrency and consistency points in distributed code, and automatically propose coordination fixes, choosing among message ordering approaches and barrier approaches.

Blazes draws inspiration from the CALM theorem, but Blazes is much more than a formalism—it’s a tool for analyzing working code. In fact, a cool thing about Blazes is that it’s language-agnostic, and can be integrated into any framework based on a pattern of messaging between components: this includes dataflow and stream programming models like Apache Storm, logic programming models like Bloom, actor-style programming, and any SOA approach based on orchestrating a fixed set of services. (Logic languages like Bloom that are amenable to automated CALM analysis make Blazes effortless to use. But even with interfaces like Storm’s, programmers only have to provide a small set of annotations to get a Blazes analysis.)

Peter presented a paper on Blazes at ICDE 2014 last month; slides are here from his talk. He’d love to get feedback, and to get examples of Storm or Bloom programs that could use Blazes analysis.

PS: Yes, Peter is working on Fault Tolerance too. Stay tuned!

Data in Beta