In my last post I focused on the importance of a matching approach that can be easily understood by a human. Most important, is the ability to isolate (and improve or remove from the match) specific failing instances. The ability to do this enables trust in the automated results – which, in turn, means that business does not need to wade through and validate every match – an impossible task in any real environment.
The overriding assumption for matching is that data is sufficiently similar for a computer to consistently make the correct decision. Data cleansing is a critical prerequisite for this for two principle reasons.
- We may need to add missing data for key fields (enrichment)
- We may need to ensure that data is captured more consistently (Standardisation).
I have delivered a number of projects for which client name was the only common or shared attribute. Unfortunately, Name is a very poor indicator of uniqueness, whether for a business or an individual. By adding additional information;such as Date of Birth, Company Registration Number, address information, Income tax numbers etc; we can improve our confidence that Mr Smith and Joe Smith are the same individual.
In many cases, the information is available within the corporate environment but may not be shared across all applications.
So, for example, telephone number may be held in the client master and call centre client applications but not in the billing system. Similarly, date of birth may be held in the client master and billing applications, but not the call centre. If we match, between the Client master and billing application, using Name and Date of Birth, we can add a telephone number to the Billing System which can, in turn, be used to match to the call centre.
This is a simplistic example, but in practise, the more information you can derive and add to each system, the more flexibility you will have in your matching to other less populated system. This is necessary to engender business confidence/trust in the final result.
The other critical factor to consider is the standardisation of data.
Computers are not good at magically resolving serious ambiguity.
Simple standardisation might involve recoding the “1” and “2” used in one system to the “Male” and “Female” used by another. Is Joe Smith, born on 12/11/09 the same person as Joe Smith born on the 09-12-11. It depends – are both dates stored as day / month / year. By resolving these simple ambiguities in advance we can radically improve accurate matching.
More complex standardisation requires the use of data quality tools.
For example, in South Africa name and address data is commonly stored in both English and Afrikaans language formats. So “ABC Apteek EDMS BK, Kerkstraat 12, Richterspark Uit 7, Potgietersrus” is equivalent to “ABC Pharmacy (Pty) Ltd, 12 Church Street, Richters Park Ext 7, Mokopane”. Our robust name and address parser understands and manages these kind of issues off the shelf. The parser generates standardised name and address elements – e.g. House Number, Street Name, Suburb Name, – that are used by the matcher.
Similarly, other free text data fields, such as SAP Materials descriptions, can be parsed and broken into structured, standardised elements. “Groen TYT CRLLA” is equivalent to “Toyota Corolla, Green”
There is no magic bullet for Master Data Management. By embedding data governance data quality principles you will avoid the principle mistakes that cause most MDM projects to run heavily over budget or fail.
It is critical to understand, in advance, that whether you wish to use an existing data source as your master, or implement one of the newfangled hub or bus platforms, MDM has to be about the data.