The Brittle Nature of Data Warehouses

What factors make data warehouses so brittle?
In my prior blog I summarized organizational pressures that put stresses on data warehouse ecosystems. In this post, I summarize some predominate technical and architectural problems that plague DW ecosystems.
I think you’ll agree that, for many reasons and by many measures, data warehouses haven’t fully delivered on their promise. Let’s examine the four main issues that traditional data warehouse ecosystems have struggled with.
Scope of Data Replication
Most DW ecosystems are built to move massive batches of data from source systems through some staging architecture to a series of databases.
Out of all the data moved about, how much of them will be used to make business decisions? Usually, only a small percentage of the data are ever used.
So why bother? The TCO for extracting, copying, converting, transferring, transforming, integrating, propagating, backing-up, loading, and verifying the data skyrockets far beyond its value and injects significant risk and brittleness into the entire ecosystem.
Why not load only those transactions that have meaningful data that will actually be used to make real business decisions?
Too Many “Moving Parts.”
There are myriad locations within a typical DW ecosystem where mistakes are made, inaccurate data are created, errors occur and defects are injected, such as:
- ETL transforms
- Metadata repositories (rarely integrated or synchronized)
- Data models and schemas
- Data quality and augmentation routines
- Exception handling infrastructure
- Reports (created and used once)
We all know that change is inevitable. Due to the complexity of this architectural model, changes trigger ripple effects that cannot be fully tested or even understood. Obviously, the more moving parts there are, and the more interdependent they, the more brittleness increases. We must simplify! More on that in future blogs.
Data Quality Management Complexity
Data quality problems arise all over the place within an enterprise. Errors are injected in the strangest places, but the “fixes” can be even stranger, depending on the endpoint systems’ architectures, the skills and abilities of the software engineers etc.
One of the biggest problems is that we let the errors occur, then we try to fix them as data flows into a data warehouse. Isn’t that a little too late? Isn’t this like letting cyanide into the municipal water supply and trying to stop the effects by making ice cubes with filtered water?
A much better time to focus on data quality is at the time of the transaction. So why don’t we do that and only check data quality as data flow into the DW?
Tight Coupling
Traditional thinking has been that, in order to get data from a system, there needs to be intimate knowledge of a system’s internal schema, and that the DW should know about the private rules, keys, structures, constraints, etc. This is no longer true.
A better approach is to hide private details of a source system behind a service. This decreases how much of the ecosystem is affected whenever there is a change to a source system’s schema. Then, we can implement common, standardized schemas for enforcing business rules and ignore the idiosyncrasies of the myriad sources.
There’s still more to explore on the data warehouse front. What has your experience been?
1 Responses »
Trackbacks
Leave a Response

Entries(RSS)