Defining Data Quality Metrics: Uniqueness, Completeness, Latency & Consistency

Establish data quality metrics that measure uniqueness, completeness, latency and consistency
Last week, I discussed some of the requirements for establishing data quality metrics for your MDM program. Now, let’s define each of the most common categories.
Uniqueness
Two types of uniqueness metrics should be considered for MDM projects.
Uniqueness in the source systems can be defined as the ratio of the number of “golden records” to the number of records in the source linked to the corresponding "golden records."
Uniqueness in the data hub is measured in terms of the level of confidence (tolerance) expressed through the probabilities of false positive and false negative matches. These characteristics can be evaluated based on the ROC (Receiver Operating Characteristics) curve techniques that can help evaluate the confidence level for false positives (e.g. 0.01%, false negatives, e.g. 3%). These methods can also be used to quantify the impacts of issues and specific data quality improvements
An additional metric associated with the level of tolerance to false negatives and false positives is the width of the clerical review area, expressed in terms of the number data stewardship tasks. Lower rates for false positive and false negative matches may require a higher number of data stewardship tasks that are to be resolved manually.
Completeness
MDM data hub matching engines can score two records to measure their similarity. The same mechanism can be used when a record is self-scored. In this case, the score characterizes record completeness from the entity resolution and matching perspectives.
Average self-score and 10% quintile can be used as the default measures of data completeness and utilized by the data governance organization to monitor record completeness of large data sets. The data governance organization can establish a policy threshold for record completeness. Then the number of records below the completeness threshold becomes a completeness metrics.
Latency
MDM data hubs typically store the timestamps of changes in the source systems and the time when each of the changes hit the data hub.
You can use several metrics, including average delay, 5% quintile, or the number or percent of records with latency above a threshold set by your policy.
The latency thresholds may depend on the source system. For instance, it is expected that latency for new systems can be near real time while the latency thresholds for legacy systems can be minutes, hours or even days.
Cross-source Consistency with the Data Hub
Typically an MDM program establishes a data hub as the enterprise benchmark for the enterprise client/party data, product data, other master entities and relationships between master entities. Therefore it is critical to develop a set of metrics that characterize to what extent the source systems are consistent with the data hub. Cross-source consistency metrics are discussed in a couple recent blog posts:
A New Approach: Information Theory Applied to Data Quality for MDM
Quantifying Data Quality with Information Theory
Do you agree with these assertions, or do you define things differently? Next week, we’ll dig into the remaining categories of data quality metrics, including the differences between standardization and validation, availability, user adoption and reference data.
1 Responses »
Trackbacks
Leave a Response







Entries(RSS)