One of my children posted the attached joke on facebook recently.
It raises an interesting point. Ambiguity in data can have life or death implications.
In most companies data volumes prohibit manual exception management of ambiguous data – particularly for matching.
Where data volumes are lower, staff complements tend to be low too. So 70 or 80 exceptions a month may be too many for a small business to deal with.
Automated matching technologies are an extremely useful tool for identifying duplicate records in your client base, inventory, supply chain, or elsewhere.
The question that arises is how to ensure that ambiguity is dealt with correctly!
There are two extremes.
At one end you do not match any ambiguous records. “J. Smith” may or may not be “John Smith” – so you continue to treat these as two seperate clients.
On the other other extreme you would match all ambiguous records – of course “J. Smith” must be “John Smith”!
False positive matches – where you incorrectly assume that two separate entities are the same – expose your business to far more risk than the other possibility – that you incorrectly continue to treat the same entity as more than one individual. Which is worse: that as a client I have two account numbers and receive two invoices, or that as a client I cease to exist in your dataset and am never invoiced?
Any matching process that does not first standardise data to remove ambiguity is prone to false positive matches that may never be picked up.
The key requirement for matching is to break data into its elements – e.g. a product description may be made up of a the BRAND, the UNIT OF MEASURE, the MATERIAL and the COLOUR. Each of these elements should be standardised to the extent necessary to remove ambiguity – is “H2O too” really the same as “H2O2”? (NO!).
Where ambiguity remains it is better to err on the side of caution – merging to client or product records incorrectly can have catastrophic consequences and be almost impossible to undo. With experience you can quickly identify which are the key elements of an object and focus your effort on standardising these only.
This gives you data certainty that the matches that have been successful can be ignored as exceptions – and allow you to focus you attention on dealing with the much smaller subset of records that have been failed as possible matches (if this is important). This will reduce your operational data management costs and will substantially increase you return on investments in MDM, CRM or similar “single view” technologies.
This post was originally published in the Data Quality Matters blog