A few weeks ago I offered dating advice in the post Eight insights for Fantastic Dates
Keeping to the romantic theme then: Advice on finding the perfect match!
If statistics are to believed a significant percentage of marriages end in divorce. And how many people did you date before settling down? Finding a match can be hard.
Matching duplicate records in a data set can also be a challenge – particularly if you don’t have the right tools.
To some extent matching is a dark art – an arcane science practiced by a knowledgeable few. Different vendors and consultants preach different approaches which can add to the confusion.
Technically, there are three possible approaches to matching – deterministic approaches, probabilistic approaches or a combination of both.
So what is the difference and why is it important?
Assume we have three customer names: “WIDGETT CO NR 1”, “WIDGET COMPANY NUMBER 1”, “WIDGETT CO NR 2”
A pure deterministic match would not find any duplicates:
Each value is unique and a deterministic result relies on exact match.
But a human can see that “WIDGET CO NR 1” and “WIDGET COMPANY NUMBER 1” are probably the same – “CO” and “NR” are abbreviations for “COMPANY” and “NUMBER”.
This kind of missed match is called a false negative. It’s a problem – we may sent the same customer two letters instead of one.
A purely probabilistic match might find a result.
Based on the statistical similarities between the strings “WIDGETT CO NR 1” and “WIDGETT CO NR 2” are very similar. Statistically they are 93% the same.
But to a human, we know that company number 1 and company number 2 are probably not the same. This kind of mismatch is called a false positive match.
We have accidentally joined together two (or more records) that may not be related. This is a bigger problem – we now aren’t billing one of our customers.
Reliable matches are achieved when we use a blend of deterministic and probabilistic matching on quality data [Tweet This]
For accurate matching we may want to improve our data consistency – for example, we may choose to teach our tool that “CO” is equivalent to “COMPANY” and that “NO” is equivalent to “NUMBER”. While data does not need to be exact, a level of quality need to be in place to support accurate matching as discussed in the post How important is quality data for matching?
This means that we are now comparing “WIDGETT CO NR 1″, WIDGET CO NR 1” and “WIDGETT CO NR 2” – we have a more consistent match that allows us to see that all three companies are very similar based on a fuzzy match. Good tools can handle minor spelling and typing errors in data – but they cannot predictably handle vast differences.
By adding an additional layer of logic we can remove matches where there are different numbers in the company name – so the first two will match and the last will not.
“WIDGETT CO NR 1” is the same as “WIDGET COMPANY NUMBER 1” but not the same as “WIDGETT CO NR 2”. We now have a result that matches reality.
Ultimately, matching is about trust, as discussed in the post Trust is at stake – use the right match approach
Ask these questions about how your shortlisted mdm or data quality tool approaches matching[Tweet this]
- Does the tool support fuzzy (probabalistic) matching? This is necessary to handle common mistakes, typing errors, missing data and the like
- Does the tool allow you to identify exactly how each match was derived? If not you cannot test or audit your results
- Can you isolate specific match results? You need to be able to identify incorrect matches and switch them off.
- Does the tool provide predefined match rules for your data? matching is a dark art – predefined best practice rules give you a kick start and guidance that can save you months of pain.
- Does the tool allow you to improve data quality? You don’t want to match all the clients that have the telephone number “0000000000” – data quality must be improved in order to get accurate results
- Does the tool proceed off the shelf data quality rules for your geography? South Africa, the US, Brazil and India have very different data. Tools that work with one set may not work with the others. test what your vendor can give you off the shelf.
Last, but not least!
Get the vendor to run a test with your data. Say 50000 or a 100000 rows. Get them to do it in front of you.
How long did it take?
Did it get to the right results?
Did you understand how it worked and could you do it yourself after you buy the tool?
Pick a tool that passes these simple tests and you are on your way to perfect matching.
Image sourced from https://www.flickr.com/photos/promiseproduction/3891351547/in/photostream/