What is data?
As data practitioners, we get this question a lot. Well, not literally what is it? But rather, what makes a data practice different from normal software development? Why is a Data Engineer different from a Software Engineer? Can we just apply normal software development practices to data projects? As data is an increasingly central component of software applications, do these distinctions even matter? Yes, they do matter. Data is different because, in the concoction of people, things, and software that comprises a modern business, it is what carries meaning.
Most of the software we use nowadays, including “personal” computing applications like word processors, are behind the scenes moving data over networks. It’s increasingly common for them to hit all three Vs of “big” data: volume, velocity, and variety. Even before the explosion of generative models this year, many applications included some form of AI. And within businesses, reporting dashboards and concepts like IT/OT convergence are blurring the line between applications and analytics. The ubiquity of data makes it even more important to understand the difference between software concerns and data concerns.
What is the difference between data and application development?
An essential quality of data is that it describes things in the world. It carries meaning beyond its immediate context. This is true in science, where scientists use data to test hypotheses about nature. It is also true in business, where marketers use data to sell products more efficiently. If the data doesn’t contain reliable information, or if we can’t understand and interpret the information, then the data has no value.
In contrast, application development seeks to shape the world. Developers identify problems and build software that solves the problem by changing how (a small part of) the world works. We prefer to assess whether we have been successful using data, but ultimately the proof is in the pudding. The measure of value is contained in how it intervenes in the real world and shapes outcomes, and this is reflected in how we write user stories during development. For the most part, if the user is happy, we’re happy. It doesn’t matter if it “breaks the rules.” In some ways, it may be better if it does.
What about the data flowing through applications?
We can think of it as coming in two types: data that stays close to home, and data that is emitted to or ingested from other sources. The difference is closely connected to the concept of Bounded Contexts in Domain-Driven Design. We define Bounded Contexts using a Ubiquitous Language that is internally consistent within the context but may not be consistent with other contexts. We strive for context boundaries that have high internal cohesion and low coupling between contexts.
The philosophy of Domain-Driven Design tells us that we should name things within our software using the natural language of the domain, but the reason to do this is that it makes it easier for humans to reason about the software. The internal representations within a Bounded Context don’t need to be meaningful to anyone on the outside. All that matters is that they are internally consistent, and this includes data. We can use integers or natural language names to keep track of enumerations. The software doesn’t care.
Preserving Meaning and Value
It’s once the data leaves a Bounded Context that meaning becomes fragile. Imagine we have an application in one context that produces data and a second in another context that consumes it. Once the data leaves the first context, its meaning has to be fully self-contained. Otherwise, there would be an unacceptably high level of coupling between the contexts. The value of the data becomes independent of either piece of software. While the relationship between producers and consumers is conceptually simple, making it work in an enterprise setting requires unique expertise.
As data practitioners, we need to preserve meaning through accuracy, transparency, and interoperability. We apply our craft to systems of many parts—collection, transformation, storage, transport, metadata, and analysis—that work across contexts. When we build software that uses this data, success or failure depends, as always, on how well it solves users’ business problems. What makes data different is the value also relies on meaning that can become degraded in subtle ways.
Meaningful data isn’t just crucial for data practitioners, but also for business leaders. It enables seamless fulfillment of business requirements and reduces technical debt. It allows your teams to resolve issues faster without having to go through the entire pipeline. It also empowers your employees and makes them more autonomous, while providing you with a deeper understanding of your organization and customers.
Because the producers and consumers of data are the ones who know the most about its meaning and value, we approach data projects using the tools of Domain Driven Design. We also favor an Enterprise Data Framework based on the concepts of Data Mesh.
If your organization is overwhelmed by data that seems to have lost its meaning—a form of dark data—we’d love to help!