What is MDM – Data Quality (Matching)

The most common issue hampering successful Master Data Management initiatives is a lack of data quality. Many organisations spend tens of millions on fancy new MDM applications and discover that, once they have taken on the data, they have simply another data source containing duplicated data that is not fit for purpose. Might as well have saved the money!

Navigate the complexities of MDM, data integration, and data quality. Our article ‘MDM, Data Integration, or Data Quality: Breaking It Down‘ provides a comprehensive overview of their roles and significance in managing your organization’s data.

Of course, data quality is about much more than matching. But let us accept that, for MDM, the principle goal is to create a single, unique record for each object in our master – be they Material, Contract, Rental Unit, Client or Supplier. So our objective, from a data quality perspective, must be the same.

What is data matching?
- Data Linking
Common mistakes when matching data
So what is the answer?
Three steps to ensure reliable matches
Conclusion

What is Data Matching?

Data matching is the process of comparing and aligning data from different sources or datasets to identify and establish relationships between them. It involves searching for similarities, patterns, or matches in the data elements to determine if they refer to the same entity or share a common attribute. The goal of data matching is to merge or link related data records from various sources into a unified view, enabling more accurate and comprehensive analysis, reporting, and decision-making.

Data matching is also known as data linking.

When I talk about matching I am talking about creating a key (or keys) to link all related records. So duplicate records may still exist but will be identified as duplicates and will share common keys with the master record.

Data matching techniques often involve comparing fields or attributes within datasets, such as names, addresses, phone numbers, dates of birth, or unique identifiers. Various algorithms and approaches are used to perform data matching, including exact matching (identical values), fuzzy matching (allowing for variations or discrepancies), and probabilistic matching (assigning probabilities to potential matches based on the likelihood of similarity).

Data matching is widely used in various domains, such as customer relationship management, fraud detection, identity verification, data integration, deduplication, and record linkage. By eliminating duplicates, resolving inconsistencies, and creating reliable connections between related data, organizations can enhance data quality, improve operational efficiency, and gain valuable insights from their data assets.

Common mistakes when matching data

Three common mistakes are forced by the simple matching strategies imposed by many systems

Reliance on existing”unique” values.

It is tempting to assume that records have been correctly labelled, particularly when it comes to so-called unique identifiers or foreign keys.

So invoice number 505 will always be the same invoice, right?

Well, what about if the two invoice 505s have different issue dates?

What is they are issued by different suppliers?

What if they are identical except for the invoiced amount?

What if the invoice number is not captured?

Master data frequently comes from different systems or sources, and will inherit and exacerbate all data quality issues that are not addressed during data take on. In our experience, we have picked up many issues with duplicated or invalid values in so-called unique identifiers that make them an unreliable indicator of uniqueness. Similarly, manual errors made when mapping foreign keys will cause huge inconsistencies if they are assumed accurate for the purposes of creating the Master.

The bottom line! Any match process that relies on a single attribute is unreliable! Ideally, at least three attributes should be compared.

The use of absolute matching (e.g. SQL where x=y ).

Absolute matching assumes that two identical elements are the same and, conversely, that two dissimilar attributes are not the same.

So invoice (1) 505 is the same as invoice (2) 505 but different from invoice (3) 000505.

And Mr Smith cannot possibly be the same as Joe Smith.

Real data, particularly when captured in different systems, follow different standards and patterns. Of course, data standardisation carried out as part of a data quality initiative can (and should) be used to improve the quality of data – but not all data can be standardised and absolute matching will fail.

This approach also fails to address the question: “What if one (or both) of the attributes being compared are blank/null?”

Are two empty fields the same?

The use of MAGIC…..

OK, no one uses magic for matching.

But a lot of tools rely on very complex, statistical matching algorithms that rely on the weightings assigned to different attributes and match algorithms to generate a probability that the items are the same.

Everything over a certain threshold (say 90% ) is deemed to be a match.

When these solutions claim to be able to match data without applying any data cleansing – data stored to different standards may be matched with varying results. So “Mr Smith” and “Smith Mr” may match – but so may “Smit Mr” and “Mrs Smith”.

The problem is that items can match the threshold for a variety of reasons.

Changing the weighting assigned to different attributes can deliver radically different (or inconsistent) results. Even, changing the order of the attributes can change the results. This approach is practically guaranteed to falsely match a number of records and it is very difficult to understand why or to improve the results. Once you have merged “Smit Mr” with “Mrs Smith” it is very difficult to go back – particularly if the MDM application will push the new “Master Data” and overwrite “Mrs Smith” with “Smit Mr”.

Most MDM applications provide a workaround for these common errors.

Because they do not trust the results all matches are deemed suspect and are pushed through the MDM workflow module for manual validation and verification.

Unfortunately, this places an unacceptable workload on the operation team responsible for validating. Assuming ten minutes per match it will take twenty one working days (167 hours) to validate 1000 matches. Monthly! What if you have millions of matches? The MDM solution implemented by a South African retail bank generates 30000 exceptions (requiring manual intervention) monthly. This option is untenable!

So what is the answer?

What is required is a human-friendly rules management system that allows users to isolate particular match rules and test for accuracy.

If the system cannot provide this level of granularity, (or relies on magic), then your match process cannot be relied upon to provide consistent, accurate results.

Of course, any automated system cannot provide a 100% accurate match.

I would suggest that it is better to have one or two duplicates than to incorrectly match records that are not the same.

The objective should be to get these exceptions down to a manageable number, with a small number of exceptions being pushed to the workflow process, and with no risk of incorrectly matching and overwriting a record with the wrong values.

A three-step process to ensure match accuracy.

Data should be assessed for accuracy and completeness.

Ideally, we need at least three elements to allow us to match. If the ideal attributes are not populated, or inconsistent, or are captured to different standards then we need to apply data cleansing principles.

Data cleansing principles must be applied to ensure the records are as consistent, complete and accurate as possible.

Ensure data consistency and reliability through effective data standardization practices. Dive into our discussion on Data standardization to understand its importance in enhancing data quality and usability

In principle, complex attributes need to be broken up into their unique elements; data stored to different standards should be adapted to the agreed master standard; and data should be enriched where possible to update missing information (or to remove garbage) critical to matching. Data quality is essential for trusted matching. Of course, analysis may show that acceptable matches can be derived using a smaller set of attributes.

A deterministic fuzzy match strategy should be applied.

Deterministic matching applies specific, easily understood business rules to match data – the data either matches or it does not.

For example, Rule 1 may state that Clients are the same if they have the same name, the same passport (or social security) number and the same address. Rule 2 may state that, the names can be different if they share the same passport number and the same address.

Unlike purely probabilistic match rules, which rely on hard-to-understand statistical manipulations, each rule can be tested and, if necessary improved. So you may decide that rule 2, in this example, is not a valid match and exclude this from your results …. or push it as an exception to the workflow process. Or you could improve the rule by looking at how similar the names are and, say, match on the same surname without a first name. or a small variation of spelling, if the passport number and address are the same.

Conclusion

The bottom line is that this approach will deliver fewer exceptions as each condition can be tested and, once signed off, trusted to provide no false positives.

It will require some work up-front – which can be reduced by using the off-the-shelf cleansing and matching rules provided by the data quality tool for your data set. However, once this work is done it will typically require far less ongoing work than probabilistic rules that must be continuously tuned.

In our experience we have been able to deploy robust, explainable AI matching rules, either off the shelf ( for South African name and address data) or within a day for other data sets. This small amount of effort radically reduces the operational load as very few exceptions are generated requiring a manual decision.

In my next post, I will discuss how data cleansing affects matching and MDM.

Response to “What is MDM – Data Quality (Matching)”

Trust is at stake – use the right approach | Data Quality Matters

September 12

[…] understood. If humans have to validate every match this defeats the object. I discuss this here https://dataqualitymatters.wordpress.com/2011/07/13/what-is-data-governance-data-quality-matching/ Ultimately Trust is at stake – there is no silver bullet to matching. It requires planning, it […]