Enterprise Data and the Holy Grail

TL;DR – A core principle of systems engineering and software engineering isn’t readily applied (or even recognized!) among people who build, operate, and maintain enterprise data ecosystems… leading us to inadvertently make fragile systems more fragile, and confusing data flows even more confusing.

The Holy Grail is a legendary artifact in Christian mythology. Typically imagined to be the cup used by Jesus at the Last Supper, it is said to possess miraculous powers, and represents an ultimate object of achievement or desire. Over the centuries, it’s been the subject of numerous literary works, from medieval romances to modern novels. King Arthur launched a quest to find it… so did Indiana Jones. The Holy Grail has enduring appeal as a symbol of the ultimate treasure, the solution to the most vexing of problems.

And from what I’ve observed, there’s a Holy Grail in enterprise data management. It’s so elusive that it’s sitting in reach of nearly everyone, and yet, it shimmers in and out of existence and is only rarely spotted even in the most mature of organizations. It’s separation of concerns (SoC). (I talked about this in my presentation at the MIT Chief Data Officer conference last week, illustrating how quality and triage information can “pop” from the business perspective when your underlying system adheres to SoC.)

Separation of concerns is an essential principle in software and systems engineering that promotes modular design. The idea is to break a program into distinct sections that each address a specific concern or function. The modules should be loosely coupled (to reduce the risk that a change made in one component, like a schema or a UI, creates unexpected changes in other components) and highly cohesive (related functions are grouped together).

Every time (except one) that I’ve asked a data team how they’re implementing SoC, I get blank stares. Then I ask “where do you keep your raw data?” and “where do you keep your clean data?” and “how do you know the difference?” and you can see the gears start to turn when people realize that they pretty much don’t distinguish what’s essential from what’s trash.

While it’s table stakes in software and network architecture, the Frankenstein systems we’ve evolved to for managing data and producing analytics tend to mash together all the steps in a data flow (in varied and very creative ways). Products that manage data delivery to the end user, like Tableau and PowerBI, often embed scripts that acquire, transform, clean, test, and produce new information… instead of just packaging up data and information so that users can slice, dice, and visualize it. Our data ecosystems are often tightly coupled and not cohesive at all.

Without separation of concerns, we lose two critical things: 1) the ability to reuse raw or clean data to generate new information that becomes data products, which leads to tremendous duplication of effort across a company (especially over time), and 2) the ability to collect diagnostics that quickly illuminate the root cause of each failure by pinpointing the stage at which they occur. By applying separation of concerns, we can look independently at each stage:

Arrival/acquisition tasks generate raw data
Integration tasks structure, clean, and transform raw data into clean data
Analysis and processing tasks use clean data to produce entirely new information (data products)
Clean data and data products are delivered to end users in a multitude of ways, using a variety of styles (reports, dashboards, UIs)
Stakeholders engage with those delivery mechanisms to generate insights and business value

Are your pipelines failing? The way you respond will be totally different (and require a lot less effort) if you know that those pipelines share a common task or element. Why is one business user happy with the data and another one totally unhappy with the same data? Being able to look at diagnostics through a lens of what’s important to each person can illuminate the reason.

Postscript: Someone asked me “who was the ONE? what was unique about THEM?” – and the answer is, they were a huge fan of Delta Lake (that comes with a built-in SoC conceptual model) and the Cube semantic layer (which does too). They were specifically looking for ways to keep the layers in their data ecosystem distinct.

Quality and Innovation

Enterprise Data and the Holy Grail

Rate this:

Leave a Reply Cancel reply

I’m Nicole

Let’s connect

Get Notifications

Recent posts

Sausage Program

Psychological Forces in Data Management

How Data Loses Value Over Time

Process in Service of Value

Looking Ahead to 2025

The Scariest Part of Corporate Halloween