Skip navigation

ANHU-1024x768-Alan-VernonThere’s fast and there’s fast. This post is about Anna*, a key/value database design from our team at Berkeley that’s got phenomenal speed and buttery smooth scaling, with an unprecedented range of consistency guarantees. Details are in our upcoming ICDE18 paper on Anna.

Conventional wisdom (or at least Jeff Dean wisdom) says that you have to redesign your system every time you scale by 10x. As researchers, we asked the counter-cultural question:

What would it take to build a key-value store that would excel across many orders of magnitude of scale, from a single multicore box to the global cloud?

Turns out this kind of curiosity can lead to a system with pretty interesting practical implications.

The key design point for answering our question centered on an ongoing theme in my research group over the last many years: designing distributed systems that avoid coordination. We’ve developed fundamental theory (the CALM Theorem), language design (Bloom), program checkers (Blazes), and transactional protocols (HATs, Invariant Confluence). But until now we hadn’t demonstrated the kind of performance and scale these principles can achieve across both multicore and cloud environments. Consider that domino fallen.

Side-note: Here’s what you need to know about coordination. In every computer — from a multicore chip on your phone to a cloud data center — many threads of execution are running at once. Almost every piece of software you run wastes enormous time coordinating with other threads to “leave its swimlane” … usually to modify bits of shared data. If each thread would just “stay in its swimlane”, all the threads would run at full speed. Don’t let anybody tell you otherwise (including peddlers of so-called “lock-free” data structures): updating shared data requires coordination and is a #GoSlow button for your software.

Anna offers world-beating speed at many scales. The paper includes numbers showing it beating Redis by over 10x on a single AWS instance, and beating Cassandra by 10x across the globe on a standard interactive benchmark. To get down to the details, we also benchmarked Anna against stronger contenders on a single-node batch request benchmark, to really see how fast it could go at its core task of puts and gets. Here Anna beats the pants off the competition, including comparable “state of the art” performance-oriented KVS systems: it was up to 700x faster than Masstree, up to 800x Intel’s “lock-free” TBB hash table. In fairness, those systems provide linearizable consistency and Anna does not. But Anna was still up to 126x faster than a “hogwild”-style completely inconsistent C++ hashtable due to cache locality for private state, while providing quite attractive coordination-free consistency. And when you want to scale up to the cloud (which Anna does but those systems cannot), you can’t realistically maintain linearizability anyhow. More on Anna’s consistency models in a moment.

Anna gets its performance and its scalability from its fully coordination-free implementation of simple actors with private state, which only communicate via background gossip. Essentially it’s a distributed system deployed across cores as well as nodes. Coordination-freeness keeps every actor in its swimlane doing useful work: in high contention workloads we see 90% of Anna’s cycles going toward serving requests. For the same workloads, systems like Masstree and Intel TBB get well below 10% of their cycles serving requests—that’s because they spend over 90% of their waiting on coordination. However, even for low-contention workloads those systems suffer from processor cache misses due to shared memory.

I like Anna’s speed, but what’s also interesting is the palette of degrees of consistency Anna can achieve at that speed. A couple years back, we published the HATs work showing that there is a rich space of distributed consistency and transactional isolation that can (in principle) be achieved coordination-free. This includes fairly rich things like causal consistency or Read Committed transactions. We get this rich consistency in Anna with a very clean codebase, by porting design patterns of monotone lattice composition from Bloom to C++. The state of each Anna actor is a monotone lattice composition. Anna is the first system to offer all these consistency levels, and the various choices differ in only a couple dozen lines of C++ each. And thank goodness—because simplicity is key to this kind of speed.

Anna is a prototype and we learned a ton doing it. I think the lessons of what we did apply well beyond key-value databases to any distributed system that manages internal state—basically everything. We’re now actively working on an extended system, codename Bedrock, based on Anna. Bedrock will provide a hands-off, cost-effective version of this design in the cloud, which we’ll be open-sourcing and supporting more aggressively. Watch this space!

Credits:

  • Chenggang Wu was the fearless leader and key developer on Anna; Jose Faleiro and I were involved in the design. Props to Chenggang!
  • Thanks to old friends Peter Bailis (of HATs fame) and Neil Conway (of Bloom lattice fame) for feedback on this post, which builds directly on their influential earlier work!

* A native of California, Anna’s Hummingbird is the fastest animal alive relative to its size—a record established in prior research from UC Berkeley.

One thing we did not have to think about much in Anna was fast asynch messaging across cores and nodes—we got this from a lightweight usage of 0MQ, which is fabulous. Hats off to the 0MQ team!

Advertisement

8 Comments

  1. There us already a database product called Bedrock, using SQLite:

    http://bedrockdb.com/

  2. http://db.cs.berkeley.edu/jmh/papers/anna_ieee18.pdf :
    The t operator induces a partial order between elements of S. For any two elements a, b in S, if t(a, b) = b, then we say that b’s order is higher than a, i.e. a ≺ b.
    Commutativity: t(a, b) = t(b, a) ∀a, b ∈ S

  3. Amazing! If you want to go even faster, you could consider Aeron instead of 0MQ.

  4. From the paper: Anna actors engage in epoch-based key exchange to propagate key updates at a given actor to other masters in the key’s replication group.

    Does this mean that a write is safely replicated only after a *successful* key exchange with N replicas? Is it possible to ensure quorum writes or reads, to ensure that a write is safely replicated?

    • If I understand your question correctly, the visibility of writes to other readers (akin to what you call “safe” or “successful”) depends on the consistency model you configure. Anna is quite flexible in that regard. Anna does not require quora, but easily supports them — it’s natural to define a composite lattice that includes a vector of ACKs for an update, which would grow as nodes gossip. You may want to look back at the cited paper by Bailis et al (htttp://www.vldb.org/pvldb/vol7/p181-bailis.pdf) which digs in more deeply regarding what’s possible to do coordination-free and what isn’t.

  5. Hi Joe, hope you are doing well!

    I read your post & paper on Anna – it makes some very interesting observations! Cool stuff. I have some comments.

    First, I’m not sure “partitioning with replication” is the golden solution at all levels of distribution. Within one machine, I’ve found shared memory architectures to be very efficient, skew-tolerant, and scalable. At SIGMOD 2018, we have a new key-value store called FASTER (https://www.microsoft.com/en-us/research/publication/faster-concurrent-key-value-store-place-updates/) that achieves bare-metal performance in a shared memory setting: ~160M ops/sec when the working set fits in main memory. While at the same time providing support for larger-than-memory data, persistence, and consistency! In other words, it is possible to “have it all” at this level of scale. Apart from a new concurrent hash index, FASTER exploits a novel concurrent “hybrid log” for records, that seamlessly combines in-place updates and copy-on-write for performance. With FASTER’s design, concurrency control for mutable records is left to the application, which may choose to partition the input workload or not — the systems handles both cases just fine.

    W.r.t. replication, Anna’s replication factor seems to be, at least currently, chosen statically for the entire database: not per key, and not changing over time. The key-value store deployments I have seen in practice are extremely memory hungry, and deciding on a static replication factor for the store would just not work due to the blow up in memory utilization. Your future work point on selecting hot keys for replication sounds very interesting, and I would love to see what you guys come up with in this space!

    W.r.t. skew, I think there is more of a spectrum of skewed-ness, and carefully designed shared memory can win for a good range of that spectrum (on one machine) – including no skew and realistic skew. Side note: a zipf factor of 4 is IMHO a bit too skewed to be meaningful … I would have loved to see a better understanding in the paper, of the crossover point where replication becomes the preferred choice over strict partitioning.

    In Trill, I had chosen a partitioned design (a streaming grouped aggregate is basically a 100% read-modify-write workload), but with FASTER as the per-node “state store” we can now get a much better overall solution that handles memory and skew nicely, and avoids shuffle, all the while providing great performance and consistency.

    Regards,
    Badrish

    • Hi Badrish,

      Thank you for your insightful comment!

      It’s very nice to see that with novel memory management techniques, FASTER is able to deliver high performance, fault-tolerance, and consistency at the same time within a multi-core machine.

      There are two high-level design goals that separate Anna and FASTER.

      First of all, for Anna, we set out to explore an execution model that’s truly coordination-free; each thread accepts requests, performs computation and sends out response without communicating or waiting for other threads. We believe having a coordination-free execution model is the key to fully exploiting multi-core parallelism within a single machine, and scale out smoothly to a distributed setting. We acknowledge that the fundamental caveat of having a coordination-free execution model is that strong consistencies (linearizability, serializability) are not achievable. Anna instead offers a wide-spectrum of coordination-free consistencies taxonomized in Bailis’s HAT paper (http://www.vldb.org/pvldb/vol7/p181-bailis.pdf).

      In addition, in Anna we focus on exploring a unified architecture that works at any scale, from a single multi-core machine to NUMA to a geo-distributed setting. Under this goal, architectures that rely on shared memory within a machine (including FASTER) need to be redesigned as we move to a distributed setting. This complicates the software, and can introduce challenges in maintaining consistency as the execution model within nodes and across nodes are now different.

      Anna currently focuses on workloads that fit in memory. For larger-than-memory data, we believe Anna can benefit from the hybrid-logging technique in FASTER for efficiently persisting data to stable storage.

      With respect to replication. It is true that Anna employs a single replication factor across all keys, and this can sometimes be suboptimal (wasting memory and increasing gossip overhead under low contention). To address this limitation, the follow-up project to Anna, Bedrock, has a monitoring system that dynamically detects hot keys and cold keys, and performs selective key replication. Hot keys are replicated across different threads(nodes) to spread the load, while cold keys’ replication factors are minimized to save memory space and reduce gossip overhead.

      With respect to skew, we think that in real workloads where the workload intensity, skewness, and hotspot can change over time, there no longer exists a “crossover” point where replication becomes preferable against partitioning and vice versa. Instead, systems should be able to scale elastically and perform selective replication to adapt to changes in workload. We explore these mechanisms in Bedrock and have seen promising results. Stay tuned!

      Best,
      Chenggang

  6. Is the production version ready yet? Any estimate on when will the codenamed open-sourced project BedRock be available? I’m willing to contribute so that there is a production-ready Anna DB.

    My contact: hi@avisri.com


4 Trackbacks/Pingbacks

  1. […] Anna: A Crazy Fast, Super-Scalable, Flexibly Consistent KVS 3 by dankohn1 | 0 comments on Hacker News. […]

  2. […] Facebook […]

  3. […] This article cross-posted from the DataBeta blog. […]

  4. […] to 800x faster than Intel’s “lock-free” TBB hash table. You can find the previous blog post here and the full paper here. We refer to that version of Anna as “Anna v0.” In this post, we […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: