Ask Dr Gary – Finding the perfect match!

Discover the challenges of data matching and learn about three powerful approaches – deterministic, probabilistic, and a combination of both. Find out how to choose the right matching tool for your data quality needs.


The Perfect match

Read our article ‘MDM, Data Integration, or Data Quality: Exploring the Differences‘ to understand their unique roles and benefits for your organization.

A few weeks ago I offered dating advice in the post Eight insights for Fantastic Dates

Keeping to the romantic theme then: Advice on finding the perfect match!

If statistics are to be believed a significant percentage of marriages end in divorce. And how many people did you date before settling down? Finding a match can be hard.

The challenges of data matching

Matching duplicate records in a data set can also be a challenge – particularly if you don’t have the right tools.

To some extent, matching is a dark art – an arcane science practised by a knowledgeable few. Different vendors and consultants preach different approaches which can add to the confusion.

Table of Contents

Three approaches to matching

Technically, there are three possible approaches to matching – deterministic approaches, probabilistic approaches or a combination of both.

Deterministic matching and probabilistic matching are two different approaches used in data deduplication or record linkage processes to identify and match similar or duplicate records within a dataset. Here’s an explanation of each:

Deterministic Matching:

Deterministic matching is a rule-based approach that relies on exact matching criteria to identify and match records. It involves comparing specific fields or attributes of data records against predefined rules or algorithms. If the values in the selected fields match exactly, the records are considered duplicates and are merged or flagged accordingly.

Deterministic matching is useful when there are unique identifiers or fields that can be used to guarantee a perfect match. For example, using a unique customer ID or a national identification number, if two records have the same ID, they are considered duplicates.

Advantages of deterministic matching:

  • It provides precise and reliable results when exact matching criteria are available.
  • It is relatively straightforward to implement and understand.

Limitations of deterministic matching:

  • It may not be effective when dealing with variations, errors, or inconsistencies in data, such as misspellings, abbreviations, or different data formats.
  • It relies on the availability and accuracy of unique identifiers or fields for matching.

Probabilistic Matching:

Probabilistic matching, also known as fuzzy matching or similarity matching, is a statistical approach that assesses the similarity between records based on various matching criteria and assigns a probability or weight to determine potential matches. It takes into account the degree of similarity or closeness between records rather than relying solely on exact matches.

Probabilistic matching algorithms use techniques like string comparison, phonetic matching, tokenization, or other similarity measures to evaluate the similarity between data elements. The algorithms assign a matching score or weight to each pair of records, and a threshold is set to determine whether the records should be considered a match or not.

Advantages of probabilistic matching:

  • It can handle variations, errors, or inconsistencies in data by considering similarities instead of exact matches.
  • It can identify potential matches even when unique identifiers are missing or unreliable.

Limitations of probabilistic matching:

  • It may introduce false positives or false negatives due to the subjective nature of setting matching thresholds.
  • It can be computationally intensive, especially when dealing with large datasets.

Using a combination

Deterministic and probabilistic matching are often used in combination to improve the accuracy of record linkage. Deterministic matching is typically applied first to identify exact matches, while probabilistic matching is employed to identify potential matches based on similarity measures and assign weights or probabilities to assess the likelihood of a match. This combination helps achieve more comprehensive deduplication or record linkage results.

Applying match strategies in practice

Assume we have three customer names: “WIDGETT CO NR 1”, “WIDGET COMPANY NUMBER 1”, “WIDGETT CO NR 2”

A pure deterministic match would not find any duplicates:

Each value is unique and a deterministic result relies on an exact match.

But a human can see that “WIDGET CO NR 1” and “WIDGET COMPANY NUMBER 1” are probably the same – “CO” and “NR” are abbreviations for “COMPANY” and “NUMBER”.

This kind of missed match is called a false negative. It’s a problem – we may send the same customer two letters instead of one.

A purely probabilistic match might find a result.

Based on the statistical similarities between the strings “WIDGETT CO NR 1” and “WIDGETT CO NR 2” are very similar. Statistically, they are 93% the same.

But to a human, we know that company number 1 and company number 2 are probably not the same. This kind of mismatch is called a false positive match.

We have accidentally joined together two (or more records) that may not be related. This is a bigger problem – we now aren’t billing one of our customers.

Reliable matches are achieved when we use a blend of deterministic and probabilistic matching on quality data [Tweet This]

For accurate matching we may want to improve our data consistency – for example, we may choose to teach our tool that “CO” is equivalent to “COMPANY” and that “NO” is equivalent to “NUMBER”. While data does not need to be exact, a level of quality needs to be in place to support accurate matching as discussed in the post How important is quality data for matching?

This means that we are now comparing “WIDGETT CO NR 1″, WIDGET CO NR 1” and “WIDGETT CO NR 2” – we have a more consistent match that allows us to see that all three companies are very similar based on a fuzzy match. Good tools can handle minor spelling and typing errors in data – but they cannot predictably handle vast differences.

By adding an additional layer of logic we can remove matches where there are different numbers in the company name – so the first two will match and the last will not.

“WIDGETT CO NR 1” is the same as “WIDGET COMPANY NUMBER 1” but not the same as “WIDGETT CO NR 2”. We now have a result that matches reality.

Ultimately, matching is about trust, as discussed in the post Trust is at stake – use  the right match approach

Key questions when choosing a matching tool

Ask these questions about how your shortlisted mdm or data quality tool approaches matching[Tweet this]

  1. Does the tool support fuzzy (probabalistic) matching? This is necessary to handle common mistakes, typing errors, missing data and the like
  2. Does the tool allow you to identify exactly how each match was derived? If not you cannot test or audit your results
  3. Can you isolate specific match results? You need to be able to identify incorrect matches and switch them off.
  4. Does the tool provide predefined match rules for your data? Matching is a dark art – predefined best practice rules give you a kick start and guidance that can save you months of pain.
  5. Does the tool allow you to improve data quality? You don’t want to match all the clients that have the telephone number “0000000000” – data quality must be improved in order to get accurate results
  6. Does the tool proceed off-the-shelf data quality rules for your geography? South Africa, the US, Brazil and India have very different data. Tools that work with one set may not work with the others. test what your vendor can give you off the shelf.

Explore our article on Matching South African data to understand the challenges and solutions in matching South African data effectively

Last, but not least!

Get the vendor to run a test with your data. Say 50000 or 100000 rows. Get them to do it in front of you.

How long did it take?

Did it get the right results?

Did you understand how it worked and could you do it yourself after you buy the tool?

Explore the differences between deterministic and probabilistic matching methodologies in our comprehensive analysis. Dive into the debate on deterministic matching versus probabilistic matching to understand which approach best suits your data matching needs.

Pick a tool that passes these simple tests and you are on your way to perfect matching.

Image sourced from https://www.flickr.com/photos/promiseproduction/3891351547/in/photostream/

For more insights about matching check out the eLearningCurve.com course Data Parsing, Matching and De-duplication Contact us for South African pricing.

Responses to “Ask Dr Gary – Finding the perfect match!”

  1. John Owens

    Hi Gary

    Excellent points that you make.

    However, one thing that totally mystifies me is why, with this great ability to spot duplicates, do enterprises allow them to be created in the first place and THEN try to find an remove them?

    Why not just use the same logic an technology to identify potential duplicates and and prevent them being created in the first place? So much time and money saved! So much better customer service!

    Mystifying!

    Kind regards
    John

    1. Gary Allemann

      The case for real time data cleansing and matching at point of capture seems pretty obvious to me too.

      This raises the question of performance – can your technology choice cleanse and match quickly or are you going to wait ages for a result? Certainly not all choices are created equal in this regard…

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.



Related posts

Discover more from Data Quality Matters

Subscribe now to keep reading and get our new posts in your email.

Continue reading