In a previous post, I explored a key difference between data and software. Software shapes the world, and data conveys meaning about it. They are different concerns—with distinct yet interconnected business outcomes—and developers and business leaders alike need to be conscious of the difference. In this post, I want to explore how we can treat these concerns in a unified way. If you have been following SingleStone lately, you won’t be surprised to learn that Domain Driven Design (DDD) is a big part of our thinking about both topics.
DDD for Data
DDD started as a way to align software architecture with business objectives. Through concepts like Bounded Contexts and Ubiquitous Language, DDD provides a framework for making software easier to build and maintain by creating high cohesion and low coupling. If one part of the business changes, then only the corresponding part of the software changes. We’ve found that this works well for software projects because it facilitates everything from high-level strategic planning to naming things.
It’s been less clear how to apply DDD to data projects. Data has to be able to carry meaning across many contexts. DDD has established patterns for dealing with communications between specific contexts, but they seem better suited for transactional information flow. Anticorruption layers, for example, aren’t an appealing approach to managing data.
Introducing Data Mesh
Fortunately, Zhamak Dehghani has brought a great deal of clarity to this problem with her Data Mesh architecture. The key idea here is that we’ve figured out how to make software—the part that shapes the world—well-aligned with the business. But when it comes to the part that carries meaning about the world we toss it over the wall to teams that are separated from the teams that produce and consume the data. The data teams respond by building one-size-fits-all architectures that treat all data as if it's the same.
This approach to data isn’t obviously wrong. Data has to hold onto meaning independently of producing and consuming contexts, so it seems like the right idea to put it into a generic ether that sort of floats around all the more specific Bounded Contexts. The problem is that it rarely works. Dehghani suggests this is because it’s the context-aligned teams that understand meaning, communicating this meaning to other teams is hard, and folks aren’t incentivized to do it with sufficient fidelity to make the data usable.
The right solution, according to the Data Mesh philosophy, is to make data one of the products of the domain that produces it. It then becomes the responsibility of teams working in that domain to preserve meaning and satisfy consumers. Because consumers care a lot about meaning, and producers know the most about it, it is much easier to cut out the data-specific teams in the middle. If data engineering expertise is required, then data engineers should be added to the producing or consuming teams.
Standardization and Economies of Scale
That sounds good, but there are a couple of things that appear to be lost: standardization and economies of scale.
Let’s look at standardization first
If we are going to distribute responsibility for making data available to domain-aligned product teams, we gain some responsiveness but run the risk that individual teams will shirk responsibility or come up with overly-divergent solutions. We solve this problem by creating a unified data governance program that is closely tailored to the needs of the enterprise and coordinates activities and ensures that everyone is doing their part. Rather than enforcing it after the fact, we “shift governance left” through DevSecOps practices that help domain-aligned product teams easily integrate best practices into their projects. For example, we establish automated pipelines that flag issues and apply preventive controls before pushing changes to production.
Next, economies of scale
It’s also desirable that we not lose economies of scale by spreading out responsibility. Building and maintaining an enterprise data warehouse or data lake is a lot of effort. Does it really make sense for each domain to have its own “mini-warehouse”? It does. Much of the development burden of centralized warehouses is that they create tight coupling among unrelated concerns. One size fits all solutions end up requiring hundreds of alterations to squeeze in oddly-shaped use cases. Just as cloud services facilitate micro-service architectures in software, they make it straightforward to maintain many small, loosely-coupled data systems.
Recent developments in data tooling have made a distributed approach much easier. For example, dbt allows small teams to create reproducible, testable, cloud-native transformation workflows that naturally embed domain knowledge and governance standards. We use it along with complementary tools such as Airflow and Meltano in an Enterprise Data Platform that allows us to rapidly prototype and deploy data products that don’t depend on a slowly-changing centralized warehouse. While one needs to be careful about controlling costs and limiting complexity, a rule of thumb of one or two services per bounded context seems manageable.
Like software, it all comes down to context
Data is different from software. It has meaning that must be preserved beyond the context that created it, and this poses special challenges for developer teams. Nevertheless, data and software are unified by the fact that context is key to maximizing value. Domain Driven Design can help us control these challenges for the same reasons it helps manage complexity in software architecture. The meaning of data comes from the Bounded Context that produces it, and the product teams within that context understand that meaning. New approaches to data governance and architecture enable these teams to build and maintain data services that best serve the needs of consumers.
If your data warehouse does everything but make data available to teams that need it, we’d like to help.