Skip navigation

Category Archives: database

Cross-posted from the Berkeley RISELab Blog; comments should go there…

A key aspect of the RISELab agenda is to aggressively harness data—lots of it, both historical and live. Of course bits in computers don’t provide value on their own. We need a broader context for data: where it came from, what it represents, and how it gets used. Traditionally, people called this metadata: the data about our data.

Requirements for metadata have changed drastically in recent years in response to technology trends. There’s an emerging groundswell to address these new requirements and explore new opportunities. This includes our work on the broader notion of data context in the Ground system.

Metadata Megafail!How should data-driven organizations respond to these changing requirements?  In the tradition of Berkeley advice like how to build a bad research center and how to give a bad talk, let me offer some pointedly lousy ideas for a future-facing metadata strategy.  Hopefully by reading this and doing the opposite, you’ll be walking the path toward a healthy metadata future.

Without further ado, here are 3 easy steps to Megafail with Metadata.


3 Steps to Metadata Megafail

Step 1. Metadata First

Reject any data that doesn’t come well-prepared with metadata in advance.

There’s a famous old slogan in computing: “Garbage In, Garbage Out”, a.k.a. #GIGO. What a great slogan! Wikipedia dates it back to 1957, and like many ideas from back then, it’s time to bring it back.

How can you tell if you have garbage data coming in? It breaks the rules on data format, content or sourcing. Which means it’s critical to formalize those rules in advance of loading any data, a strategy I like to call #MetadataFirst.  

What’s that you say? 99% of your data arrives without meaningful metadata? My goodness—stay vigilant! If you have data with any whiff of garbage, then by all means throw that data away.

Now you might say that it’s a lot cheaper to store data now than it was in 1957. So why not keep data around even if it’s dirty and poorly-described? And you might say that analysts armed with AI-assisted data wrangling and analytics techniques can extract valuable signals from noisy raw data. So why not “figure out” useful metadata over time

Sure, you might say those things. But you shouldn’t! Because really… can you guarantee that those “signals” you’re extracting are true? You can’t! There are no airtight guarantees to be had unless they were enforced at the time of data collection. Because if it was garbage coming in, well … you know the saying.

#GIGO #MetadataFirst!

Step 2. Lock it up, Lock it in.

Deploy a metadata system that is as inflexible and proprietary as possible.

As we know, metadata is a critical asset: the key to enabling people to discover relevant data, assess its content, and figure out how to make use of it. A critical asset like metadata should not be left lying around. It should be locked away in a safe place, where it is difficult to access. I call that place #MetaJail.

You’re probably wondering how you can make a great metajail—one that renders your metadata really inaccessible. I’m here to help.

To begin, a good metajail should be prescriptive: it should impose a complex model that all metadata must conform to. This ensures that a diverse range of organizations—from biolabs to banks to bureaucracies—can all have an equally difficult time coercing their metadata into the provided model.

A great metajail is not only prescriptive, it’s proprietary. Vendor-specific solutions ensure both that metadata is locked up, and that the larger organization is locked in to the vendor as well. Even better is to put your metajail in a proprietary smokestack. Choose a vendor whose metajail only works with their other components: prep, ingest, storage, analytics, charting, etc. This is like a MetaPenalColony!

I hope it goes without saying that you should be the warden of the metajail. Wherever possible, require manual approvals for authorization. Then people who need to use data will have to go through you to get access. This maximizes your control over metadata, and hence your control over data, and hence your control over people! All of which increases your job security and keeps change at bay.

#MetaJail for the #MegaFail!

Step 3. One Truth

Ensure that the metadata enforces a single interpretation of data.

Traditional metadata solutions are often characterized as the Single Source of Truth for an organization. This is another great slogan from the past. #SSOT!

I like everything about this slogan. “SS” gets things started on an aggressive note, emphasizing central control in a single source (#MetaJail!) and rejection of any unapproved sources (#GIGO!).

But this step is not just a rehash of the previous two; it adds the final T: “Truth”. Now there’s a word that we don’t hear enough of in modern computing. The terrific thing about managing the Truth is that we get to define it. Remember, organizations are trying to do more and more with data, across an increasing number of use cases. Resist this trend! It is mayhem in the making. Consider a humble log file from a webserver. The Marketing department uses it to optimize the website layout. The IT department wants to use it to understand and predict server downtimes. The Online Services department wants to use it to build a product recommender system. Each one of these groups wants to extract different features and schemas from the log file, resulting in different metadata.

As you may know, Chief Marketing Officers supposedly spend more on IT than CIOs. (Which means, one assumes, that Marketing should control the Truth for your organization.) Hence in our example, the metadata describing the web logs should be defined by the Marketing department, and the log files should be transformed to suit that use case. Other representations of that information should be rejected as metadata Falsehoods.

So mandate a #SSOT, and manage Truth. Ask yourself: are you #SSoTOrNot?


Better Advice

At the risk of stating the obvious … all of the above steps are terrible ideas. But sometimes it’s good to rule some things out before we decide what we should do.

Want to know more about our serious ideas for metadata management? Have a look at our initial paper on Ground from CIDR 2017As a brief excerpt, here are some design requirements for evaluating a modern metadata or data context system. It should be:

    1. Model-Agnostic. For broad adoption and usage, a data context system cannot impose opinions on metadata modeling.
    2. Relativistic. It should be possible to record multiple interpretations of the same data, allowing users and applications to agree to disagree.
    3. Politically Neutral. The system should not be controlled by a single vendor or interest group. Like Internet protocols, data context is a “narrow waist”. To succeed, it must interoperate with a wide range of services and systems from third parties, some of which have yet to be born.
    4. Immutable. Data context must be immutable and versioned; updating stored context in place is tantamount to erasing history.
    5. Scalable. It is a common misconception that metadata is small. In many settings, the data context is far larger than the data itself.

Ground’s design is based on collaborations with colleagues from the field including Awake Networks, Capital One, Cloudera, Dataguise, LinkedIn, SkyHigh Networks and Trifacta. Slides from the conference talk are available as well.

Advertisements

berkeleysunIt’s been a while since I’ve taken the time to write a blog post here. If there’s one topic that deserves a catchup post in the last few months, it’s the end of an era for my former students Peter Alvaro and Peter Bailis—henceforth Professor Peter A of UC Santa Cruz, and Professor Peter B of Stanford. Each of them officially turned in their dissertation in December. Both spanned an impressive range of theory, practice and genuine applicability.  There’s tons of good stuff in each document. Here’s a bit of an overview with references for those of you who might not be diving in to read them cover-to-cover.

Peter Alvaro was a pillar of my BOOM research project from its inception. His thesis is entitled Data-centric Programming for Distributed Systems, and it covers a beautiful arc of results:

  • It starts with his insightful exploration of commit and consensus protocols in a declarative language, and his collaboration on the BOOM Analytics work that build a ridiculously high-function HDFS clone in ridiculously few lines of code and hours of developer time
  • It also includes his foundational design of the Dedalus logic for distributed programming that has become a touchstone for the database theory community, in addition to our team
  • and his contributions to the Bloom language including the core semantics and many pragmatic features
  • It covers in depth his work on the Blazes system for analyzing eventual consistency at the level of program semantics both for Bloom and for dataflow languages like Storm, and automatically synthesizing coordination code where needed including a high-performance solution called sealing
  • and finally it presents his work on Lineage Driven Fault Injection (LDFI) and the Molly prototype, which extracted new benefits from declarative programming in large-scale testing, and was recently adapted for use at Netflix.

The thesis leaves out a bunch of additional work he did at Berkeley, including contributions to the much-cited MapReduce Online effort, and his work on distributed system testing with BloomUnit. But what I’ll remember most from his graduate years is the team-teaching we did on Programming the Cloud, where we used our work on Bloom to get undergraduates learning the fundamentals of distributed systems via live coding. This was without question the most creative and audacious teaching I’ve been involved with, and it worked surprisingly well thanks in large part to Peter’s hard work and more importantly his warm and thoughtful spirit. I’m excited to see Peter A teaching it again this coming quarter at UC Santa Cruz.

Peter Bailis’ thesis is called Coordination Avoidance in Distributed Databases, and it’s a timely tour de force of fertile ideas found in what many considered a picked-over wasteland—transaction processing. Peter’s thesis includes a range of big ideas married to practical observations, including:

  • An empirical level-set on the costs of coordination in modern distributed databases.
  • The notion of Invariant Confluence, which attacks the distributed database problem of consistency  without coordination by taking Church-Rosser graphs and applying them to databases with invariants.
  • An analysis of Invariant Confluence in the wild, via mining Github repos with Ruby on Rails apps to determine how “real” programmers tradeoff application constraints and database constraints.  Not only did Peter do the legwork here to understand what programmers do, he brought it home to force us all to ask the questions of why  they do what they do, and how the push and pull of technical communities can lead to better outcomes.
  • A new and very sensible (if you’re into that kind of thing) weak isolation level for transactions called Read Atomic, with a range of efficient implementations for distributed systems via RAMP protocols.

Peter B’s thesis also leaves out a range of important work he did at Berkeley, including the popular PBS statistical-empirical explanation of why NoSQL stores seem to work, his bolt-on causal consistency work and analysis, the initial design of the Velox model-serving system with colleagues in the AMPLab, and his popular shaming of the SQL transaction world by exposing how few SQL systems provide ACID transactions by default (or at all). I remember with gratitude how Peter took on half the work of teaching graduate databases at Berkeley (the first offering in years!) while I was deeply involved in running Trifacta. And finally, it has been a bracing dose of research and academic politics having him join in the latest edition of Readings in Database Systems; he did it with grace and intelligence.

Without question the best part of teaching at Berkeley is the students you get to work with. Peter & Peter: it has been a great pleasure. I suspect that being colleagues could be even more fun. Good to have you both still in town!

computer on fireA major source of frustration in distributed programming is that contemporary software tools—think compilers and debuggers—have little to say about the really tricky bugs that distributed systems developers face.  Sure, compilers can find type and memory errors, and debuggers can single-step you through sequential code snippets. But how do they help with distributed systems issues?  In some sense, they don’t help at all with the stuff that matters—things like:

  • Concurrency: Does your code have bugs due to race conditions?  Don’t forget that a distributed system is a parallel system!
  • Consistency: Are there potential consistency errors in your program due to replicated state? Can you get undesirable non-deterministic outcomes based on network delays?  What about the potential for the awful “split-brain” scenario where the state of multiple machines gets irrevocably out of sync?
  • Coordination performance: Do you have performance issues due to overly-aggressive coordination or locking? Can you avoid expensive coordination without incurring bugs like the ones above?

These questions are especially tricky if you use services or libraries, where you don’t necessarily know how state and communication are managed.  What code can you trust, and what about that code do you need to know to trust it?

Peter Alvaro has been doing groundbreaking work in the space, and recently started taking the veil off his results.  This is a big deal. Read More »

Photo Credit: Karthick R via Compfight cc

We just finished writing up an overview of our most recent thinking about distributed consistency. The paper is entitled Consistency Without Borders, and it’s going to appear in the ACM SoCC conference next month in Silicon Valley.

It starts with two things we all know:

  1. Strong distributed consistency can be expensive and dangerous. (My favorite exposition: the LADIS ’08 conference writeup by Birman, Chockler and van Renesse. See especially the quotes from James Hamilton and Randy Shoup. And note that recent work like Spanner changes little: throughput of 10’s to 100’s of updates per second is only useful at the fringes.)
  2. Managing coordination in application logic is fraught with software engineering peril: you have to spec, build, test and maintain special-case, cross-stack distributed reasoning over time. Here be dragons.

The point of the paper is to try to reorient the community to explore the design space in between these extremes. Distributed consistency is one of the biggest CS problems of our day, and the technical community is spending way too much of its energy at these two ends of the design space.

We’ll be curious to hear feedback here, and at the conference.

The big news around here today is the public announcement of Trifacta, a company I’ve been quietly cooking over the last few months with colleagues Jeff Heer and Sean Kandel of Stanford. Trifacta is taking on an important and satisfying challenge: to build a new generation of user-centric data management software that is beautiful, powerful, and eminently useful.

Before I talk more about the background let me say this: We Are Hiring. We’re looking for people with passion and talent in Interaction Design, Data Visualization, Databases, Distributed Systems, Languages, and Machine Learning. We’re looking for folks who want to reach across specialties, and work together to build integrated, rich, and deeply satisfying software. We’ve got top-shelf funding and a sun-soaked office in the heart of SOMA in San Francisco, and we’re building a company with clear, tangible value. It’s early days and the fun is ahead. If you ever considered joining a data startup, this is the one. Get in touch.

Read More »

When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I’m involved with.  So I wrote up a discussion of MADlib, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote a paper on the design and use of MADlib, which made my writing job a bit easier.) I’m optimistic about MADlib closing a gap between algorithm researchers and working data scientists, using familiar SQL as a vector for adoption on both fronts.

Read More »

CopyIf you follow this blog, you know that my BOOM group has spent a lot of time in the past couple years formalizing eventual consistency (EC) for distributed programs, via the CALM theorem and practical tools for analyzing Bloom programs.

In recent months, my student Peter Bailis and his teammate Shivaram Venkataraman took a different tack on the whole EC analysis problem which they call PBS: Probabilistically Bounded Staleness. The results are interesting, and extremely relevant to current practice.  (See, for example, the very nice blog post by folks at DataStax).

Many people today deal with EC in the specific context of replica consistency, particularly in distributed NoSQL-style Key-Value Stores (KVSs). It is typical to configure these stores with so-called “partial” quorum replication, to get a comfortable mix of low latency with reasonable availability. The term “partial” signifies that you are not guaranteed consistency of writes by these configurations — at best they guarantee a form of eventual consistency of final writes, but readers may well read stale data along the way. Lots of people are deploying these configurations in the field, but there’s little information on how often the approach messes up, and how badly.

Jumping off from earlier theoretical work on probabilistic quorum systems, Peter and Shivaram answered two natural questions about how these systems should perform in current practice:

  1. How many versions ago?  On expectation, if you do a read in a partial-quorum KVS, how many versions behind are you? Peter and Shivaram answer this one definitively, via a closed-form mathematical analysis.
  2. How stale on the (wall-)clock?  On expectation, if you do a read in a partial-quorum KVS, how out-of-date will your version be in terms of wall-clock time? Answering this one requires modeling a read/write workload in wall-clock time, as well as system parameters like replica propagation (“anti-entropy”). Peter and Shivaram address this with a Monte Carlo model, and run the model with parameters grounded in real-world performance numbers generously provided by two of our most excellent colleagues: Alex Feinberg at LinkedIn and Coda Hale at Yammer (both of whom also guest-lectured in my Programming the Cloud course last fall.)  Peter and Shivaram validated their models in practice using Cassandra, a widely-used KVS.

On the whole, PBS shows that being sloppy about consistency doesn’t bite you often or badly — especially if you’re in a single datacenter and you use SSDs. But things get more complex with magnetic disks, garbage collection delays (grr), and wide-area replication.

Interested in more detail?  You can check out two things:

Hadoop is not healthy for children and other living things.I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates.  I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.

Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by introducing MapReduce and Hadoop in our initial 61A course.  I have argued elsewhere that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.

But as a matter of both pedagogy and system design, I have begun to think that Google’s MapReduce model is not healthy for beginning students.  The basic issue is that Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication.  But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.

From an architectural point of view, a good language for parallelism should expose pipelining, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it.  So here we go.

Read More »

MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!

Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:

  • standard statistical methods like multi-variate linear and logistic regressions,
  • supervised learning methods including support-vector machines, naive Bayes, and decision trees
  • unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
  • descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
  • statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.

More methods are planned for future releases.  Myself, I’m working with Daisy Wang on merging her SQL-based Conditional Random Fields and Bayesian inference implementations into the library for an upcoming release, to support sophisticated text processing.

Read More »

The recent July 2011 issue of Communications of the ACM includes our article on the technical aspects of the search for Jim Gray’s boat Tenacious.  This was a hard article to write, for both technical and personal reasons. It took far too long to finish, so at some point it was time to just pack it in (at which point the CACM folks informed us it had to be cut in length by half, which delayed things further.  The longer version is up as a Berkeley tech report.)

Meanwhile, some of the experience is even more relevant to current technology trends than it was 4 years ago, so hopefully folks interested in social computing, software engineering, image processing, crisis response, and other related areas will find something of use in there.

For those of you whose work is represented (or underrepresented) by the article, my apologies for its shortcomings.  I still don’t have the full picture of what happened—nobody does, really.  As a result I decided to avoid using personal names of volunteers in general to avoid attributing credit unevently. I know the result seems oddly impersonal.  Setting the tone of the article was as hard as capturing the content.

Meanwhile, I encourage you to add corrections and perspective to the article in the comment box at the end of the CACM link above. Comments are welcome here too, but they might not get as well-viewed or -archived.