A New Approach: Information Theory Applied to Data Quality for MDM

Information Theory may be the answer to measure data quality.
As we discussed in a previous post, data professionals lack a common methodology to measure data quality objectively in terms of scientifically defined metrics and compare data sets across systems, departments and enterprises.
Two existing approaches, the Fit-for-Purpose Approach and the Data Profiling Approach, each have their strengths and weaknesses. However, a third approach based on information theory may be better suited to meet these ongoing challenges and beneficially complement the first two approaches
Master Data Service (MDS) defines its primary function as creating the “golden view” of the master entities. We will assume that MDS has been successfully created and maintains the “golden view” of entity F in the data hub. This “golden record” can be dynamic or persistent.
There exist a number of data sources across the enterprise with the data corresponding to domain F. This includes the source systems that feed the data hub and other data sources that are not integrated with the data hub.
We will define an external dataset f which data quality is to be quantified with respect to F. For the purpose of this discussion f can represent any data set such as a single data source or multiple sources.
Our goal is to compare the source data set f with the entity data set F. The data quality of the data set f will be characterized by how well it represents the benchmark entity F defined as the “golden view” for the data in domain F. We are making an assumption here that the “golden view” was created algorithmically and then validated by the data stewards.
In Information Theory the information quantity associated with the entity F is expressed in terms of the entropy:
H(F) = - ∑ Pk log Pk
(1)
where Pk are the probabilities of the attribute (token) values in the “golden” data set F. Index “K” runs over all records in F and all attributes. The base in the log function is 2.
H(F) represents the quantity of information in the “golden” representation of entity F.
Similarly for the comparison data set f
H(f) = - ∑ pi log pi
(2)
We will use small “p” for the probabilities associated with f while capital letter “P” is used for the probabilities characterizing the “golden” entity record.
Mutual entropy J(f,F) characterizes how well f represents F.
J(f,F) = H(f) + H(F) – H(f,F)
(3)
In (3) H(f,F) is the joint entropy of f and F. It is expressed in terms of probabilities of combined events, e.g. the probability that the name = “Smith” in “the golden record” F and name = “Schmidt” in the source record linked to the same entity. The behavior of J qualifies this function as a good candidate quantifying the data quality of f with respect to F. When the data quality is low, the correlation between f and F is low. In an extreme case of a very low data quality f doesn’t correlate with F and these variables are independent. Then
H(f,F) = H(f) + H(F) (4)
and
J(f,F) = 0 (5)
If f represents F extremely well, e.g. f = F, then H(f) = H(F) = H(f,F) and
J(f,F) = H(F) (6)
We define Data Quality of f with respect to F by the following equation:
DQ(f,F) = J(f,F)/H(F) (7)
With this definition of data quality DQ changes from 0 to 1, where 0 indicates the data quality of f is minimal; f does not represent F. When DQ = 1 f perfectly represents F and the data quality of f with respect to F is 100%, and therefore f represents F perfectly well.
The approach can also be used to determine partial attribute/token level data quality. This will provide additional insights into what causes most significant data quality issues.
The data quality improvement should be done iteratively. Changes in the data source data may impact the “golden record”. Then equations (1) and (7) are applied again to recalculate the data quantity and data quality characteristics.
In the next post, we’ll examine some of the Benefits and Scenarios for the Information Theory Approach to Data Quality.
This is part 2 of a four-part data quality series. Catch up with the other posts:
Part 1: The Existing Data Quality Approaches: What are they still missing?
Part 3: Benefits and Scenarios for the Information Theory Approach to Data Quality
Part 4: Data Quality Reporting for Senior Management, Board of Directors and Regulatory Agencies
7 Responses »
Trackbacks
Leave a Response







Entries(RSS)
Hi there.
I noticed this post some days ago, but did not get the picture.
Having a quiet Saturday morning and also been through the following scenario post I must admit, I’m still lost.
Perhaps I’m stupid, but could you please explain in more detail how
• the entity data set F
• the benchmark entity F
• the “golden view” for the data in domain F
• “the golden record” F
is established.
To me, that’s the challenge. The rest is math.
Henrik,
In this blog we introduced a way to quantify the data quality of a data set f against the golden record F.
One of the most practical scenarios is when there is a data hub with cleansed data. We refer to the data hub as recordset F. There exist multiple source systems with "duty" data. The recordset representing a source system is referred to as f. If the data in the source system is good, f represents F well. If the quality of data in the source system is bad, f doesn't represent F well.
The third (and final) blog of this series will discuss specific benefits and scenarios. They may refine some of the questions you have
As you say, "the rest is math". That's true. I wouldn't discard the importance of math though. I strongly believe that Data Quality and Data Governance need math. Both DQ and DG have to become more scientific and less buzz word oriented.
We would welcome a discussion to get deeper in the matter.
Best,
Larry
Thanks Larry.
Don’t get me wrong. I am actually also a “math” person and do see the need of measuring the uncertainty of data. I agree that buzzwords don’t help. I also find your comments on the previous post around the too many number of rules to be defined and managed very true.
But how do we actually establish the benchmark – the capital F. If we have a cleansed hub with “good quality data” then of course you may compare that with other sources – and sure, we may have a data integration challenge, which is very much about technology.
The challenges with establishing F is the big conundrum as I see it. Questions are how we fulfil multiple purposes in F, in what degree F must represent the real world, what are the costs of establishing F – what are the business value. Is F the one truth or may there exist multiple versions the truth being F.
In order to understand the spin of this article please take into account that building F (the benchmark = entity resolution) is what a data hub does. We build F very well leveraging advanced algorithms and powerful user interfaces. Our software and methodology have made a great progress in how to build F. That said, there is an important qualifier here: we build F in the data hub and not directly in the source systems where the data (f) is created and maintained. Our customers want to achieve data quality in all systems across the enterprise and care about the quality of data in the hub only because it helps them solve their problems in other data sources. This is why the DQ metrics measuring DQ in the sources against the hub are on the critical path for our DQ strategy. This is critical for 80% of enterprise MDM programs.
Please keep in mind that the data hub is a service that actually builds cleansed data F as opposed to just storing the data cleansed elsewhere. Please read my article that elaborates on this http://www.sdtimes.com/GUEST_VIEW_OLD_THINKING_DOES_A_DISSERVICE_TO_NEW_DATA_HUBS/By_LARRY_DUBOV/About_DATABASES/33828. In essence the data hub is a powerful DQ toolset with only one limitation: it cleanses the data and creates “golden record” in itself and not across the enterprise. Even though the data hub has data integration capabilities, it doesn’t really solve the problem.
Multiple legacy systems that create and update information are not necessarily synchronized with the data hub. Typically it can't be done automatically. Source systems' "owners" that create f don't allow this. It is a reality from over 200 hundred implementations. Our customers want to retain control over their data in the source systems. Therefore it is not just a data integration issue but rather a data governance issue: how to synchronize the source systems with the hub. In order to start and maintain a continuous synchronization with the data hub the DQ improvement process should be defined, built, progress measured, and accountabilities for the progress established. The metrics here are on the critical path. If the data quality in the source systems against the data hub is not continuously measured, you can't effectively execute a data quality improvement process. This is critical in our data quality vision for MDM.
I hope this explains our DQ position
It's a good discussion. I think the formulas are focused on syntax, but not necessarily context. In that light, they make sense. But, you have to take context into account to provide data quality. Because of context, the instant you build it, F starts to become f-like. (If I understand it correctly.) In the case of CDI for example, people get married, divorced, move and even die. An outside data source (f) from, say the post office or other government agency may be a more authoritative source if they track those things and your company doesn't. Your f source may contain poor address information, but excellent e-mail data that you’d like to integrate into F. It may be more authoritative in some areas and not others.
The data governance practitioners must decide which data source they trust more. They must have a process of remediation, which many MDM solutions provide, and a process to take the best parts of matching records down to the entity level and build a golden record. That’s the holy grail of MDM.
When the data on the source changes, the hub receives an update with a sub-secod delay. The hub is configured based on data governance/business requirements. If a certain data source is the tructed source for a given attribute, the data hub is aware of this business rule and will change the value of the attribute instantly. Consequently, at any point in time (with a sub-second delay) the hub holds the best data to the best of the enterprise's knowledge. This makes the hub the benchmark for all practical DQ purposes.