Metadata Megafail: Messing up Your Data Strategy in 3 Easy Steps

Cross-posted from the Berkeley RISELab Blog; comments should go there…

A key aspect of the RISELab agenda is to aggressively harness data—lots of it, both historical and live. Of course bits in computers don’t provide value on their own. We need a broader context for data: where it came from, what it represents, and how it gets used. Traditionally, people called this metadata: the data about our data.

Requirements for metadata have changed drastically in recent years in response to technology trends. There’s an emerging groundswell to address these new requirements and explore new opportunities. This includes our work on the broader notion of data context in the Ground system.

How should data-driven organizations respond to these changing requirements? In the tradition of Berkeley advice like how to build a bad research center and how to give a bad talk, let me offer some pointedly lousy ideas for a future-facing metadata strategy. Hopefully by reading this and doing the opposite, you’ll be walking the path toward a healthy metadata future.

Without further ado, here are 3 easy steps to Megafail with Metadata.

3 Steps to Metadata Megafail

Step 1. Metadata First

Reject any data that doesn’t come well-prepared with metadata in advance.

There’s a famous old slogan in computing: “Garbage In, Garbage Out”, a.k.a. #GIGO. What a great slogan! Wikipedia dates it back to 1957, and like many ideas from back then, it’s time to bring it back.

How can you tell if you have garbage data coming in? It breaks the rules on data format, content or sourcing. Which means it’s critical to formalize those rules in advance of loading any data, a strategy I like to call #MetadataFirst.

What’s that you say? 99% of your data arrives without meaningful metadata? My goodness—stay vigilant! If you have data with any whiff of garbage, then by all means throw that data away.

Now you might say that it’s a lot cheaper to store data now than it was in 1957. So why not keep data around even if it’s dirty and poorly-described? And you might say that analysts armed with AI-assisted data wrangling and analytics techniques can extract valuable signals from noisy raw data. So why not “figure out” useful metadata over time

Sure, you might say those things. But you shouldn’t! Because really… can you guarantee that those “signals” you’re extracting are true? You can’t! There are no airtight guarantees to be had unless they were enforced at the time of data collection. Because if it was garbage coming in, well … you know the saying.

#GIGO #MetadataFirst!

Step 2. Lock it up, Lock it in.

Deploy a metadata system that is as inflexible and proprietary as possible.

As we know, metadata is a critical asset: the key to enabling people to discover relevant data, assess its content, and figure out how to make use of it. A critical asset like metadata should not be left lying around. It should be locked away in a safe place, where it is difficult to access. I call that place #MetaJail.

You’re probably wondering how you can make a great metajail—one that renders your metadata really inaccessible. I’m here to help.

To begin, a good metajail should be prescriptive: it should impose a complex model that all metadata must conform to. This ensures that a diverse range of organizations—from biolabs to banks to bureaucracies—can all have an equally difficult time coercing their metadata into the provided model.

A great metajail is not only prescriptive, it’s proprietary. Vendor-specific solutions ensure both that metadata is locked up, and that the larger organization is locked in to the vendor as well. Even better is to put your metajail in a proprietary smokestack. Choose a vendor whose metajail only works with their other components: prep, ingest, storage, analytics, charting, etc. This is like a MetaPenalColony!

I hope it goes without saying that you should be the warden of the metajail. Wherever possible, require manual approvals for authorization. Then people who need to use data will have to go through you to get access. This maximizes your control over metadata, and hence your control over data, and hence your control over people! All of which increases your job security and keeps change at bay.

#MetaJail for the #MegaFail!

Step 3. One Truth

Ensure that the metadata enforces a single interpretation of data.

Traditional metadata solutions are often characterized as the Single Source of Truth for an organization. This is another great slogan from the past. #SSOT!

I like everything about this slogan. “SS” gets things started on an aggressive note, emphasizing central control in a single source (#MetaJail!) and rejection of any unapproved sources (#GIGO!).

But this step is not just a rehash of the previous two; it adds the final T: “Truth”. Now there’s a word that we don’t hear enough of in modern computing. The terrific thing about managing the Truth is that we get to define it. Remember, organizations are trying to do more and more with data, across an increasing number of use cases. Resist this trend! It is mayhem in the making. Consider a humble log file from a webserver. The Marketing department uses it to optimize the website layout. The IT department wants to use it to understand and predict server downtimes. The Online Services department wants to use it to build a product recommender system. Each one of these groups wants to extract different features and schemas from the log file, resulting in different metadata.

As you may know, Chief Marketing Officers supposedly spend more on IT than CIOs. (Which means, one assumes, that Marketing should control the Truth for your organization.) Hence in our example, the metadata describing the web logs should be defined by the Marketing department, and the log files should be transformed to suit that use case. Other representations of that information should be rejected as metadata Falsehoods.

So mandate a #SSOT, and manage Truth. Ask yourself: are you #SSoTOrNot?

Better Advice

At the risk of stating the obvious … all of the above steps are terrible ideas. But sometimes it’s good to rule some things out before we decide what we should do.

Want to know more about our serious ideas for metadata management? Have a look at our initial paper on Ground from CIDR 2017. As a brief excerpt, here are some design requirements for evaluating a modern metadata or data context system. It should be:

Model-Agnostic. For broad adoption and usage, a data context system cannot impose opinions on metadata modeling.
Relativistic. It should be possible to record multiple interpretations of the same data, allowing users and applications to agree to disagree.
Politically Neutral. The system should not be controlled by a single vendor or interest group. Like Internet protocols, data context is a “narrow waist”. To succeed, it must interoperate with a wide range of services and systems from third parties, some of which have yet to be born.
Immutable. Data context must be immutable and versioned; updating stored context in place is tantamount to erasing history.
Scalable. It is a common misconception that metadata is small. In many settings, the data context is far larger than the data itself.

Ground’s design is based on collaborations with colleagues from the field including Awake Networks, Capital One, Cloudera, Dataguise, LinkedIn, SkyHigh Networks and Trifacta. Slides from the conference talk are available as well.

Data in Beta