A major source of frustration in distributed programming is that contemporary software tools—think compilers and debuggers—have little to say about the really tricky bugs that distributed systems developers face. Sure, compilers can find type and memory errors, and debuggers can single-step you through sequential code snippets. But how do they help with distributed systems issues? In some sense, they don’t help at all with the stuff that matters—things like:
- Concurrency: Does your code have bugs due to race conditions? Don’t forget that a distributed system is a parallel system!
- Consistency: Are there potential consistency errors in your program due to replicated state? Can you get undesirable non-deterministic outcomes based on network delays? What about the potential for the awful “split-brain” scenario where the state of multiple machines gets irrevocably out of sync?
- Coordination performance: Do you have performance issues due to overly-aggressive coordination or locking? Can you avoid expensive coordination without incurring bugs like the ones above?
These questions are especially tricky if you use services or libraries, where you don’t necessarily know how state and communication are managed. What code can you trust, and what about that code do you need to know to trust it?