Ease of implementation vs “ease of use” for data quality

Most experienced IT and business managers would agree – it is always cheaper and lower risk to buy an existing, working solution than to build your own. Hence the massive investment in ERP applications, CRM, etc.

Buying an existing application enables you to to:

  1. Leverage the best practise business processes embedded in the application to enhance  your business
  2. Reduce the risk of project failure by starting at a 60% to 80% fit to your needs, and adding only a little
  3. Use the time and resources saved to focus on your own core business

At face value all data quality applications look similar – they all provide data profiling, cleansing, matching, etc. They all claim to be easy to use, and to empower the business user, to integrate with your enterprise architecture, and to handle your volumes.

So the only real differentiator is the implementation head start you get when selecting one application over another! We call this ease of implementation!

In South Africa, for example, we have extremely complex address data.

  1. Addresses are typically stored as either English or Afrikaans variations e.g. “East London”,”Oos-London”
  2. The post office recognises approximately 18 000 suburbs and towns.
  3. In most cases, but not all, each of these has at least two valid postal codes, depending on whether mail is delivered to the street address or to a post office box.
  4. New names are being assigned to towns, suburbs and streets on a regular and ongoing basis – e.g. Potgietersrus is now Mokopane. Data contains both variations depending on it’s age.
  5. There are more than 12 identified address types, ranging from relatively structured urban street or postal addresses e.g. “PO Box 12, Centurion, 0082” to unstructured rural addresses such as  “The white house next to the cafe, Mzo Village”.
  6. There are many thousands of rural villages and informal settlements which are not documented and are not recognised as postal addresses – e.g. “Mandelaville Squatter Camp, Joe Slovo T/ship”
  7. There are hundreds of variations of spelling of even valid place names – e.g. “East London”,”Aest London”

Projects trying to use a tool that does not provide a significant head start in addressing these complexities will add years onto their project cost. 

A simple example, imagine you wish to identify duplicate client records in a dataset contain the following

  • “ABC Apteek (EDMS) BPK” at “ATKV Gebou 15delaan 24 Magaliesig”
  • “ABC Pharmacy (Pty) ltd” at “42 Fifteenth Ave, Magaliesview”

Our solution will give you the following head start:

1.) For the name

  • We recognise that we have a business in each line from our comprehensive list of South African business types.
  • We recognise that “(EDMS) BPK” is the Afrikaans equivalent of “(Pty) Ltd” and we handle a large array of common spelling and typing errors e.g. PtyLtd, (Prop) Ltd, (PTY)Ltd, etc
  • We recognise that “Apteek” is an Afrikaans business term and is equivalent to “Pharmacy”
  • We build a standardised English business name for each record – “ABC Pharmacy” – off the self and without impacting the original data

To build this from scratch requires a solid understanding of English and Afrikaans business terms, building the translations,  resolving spelling errors, etc. Imagine doing this for 100 000 records, or a million! Imagine doing this for individuals – is Bongani a given name, or a surname? Or is it a place name?

2.) For the address

  • We recognise that “Magaliesig” is a valid South African postal suburb and we standardise to the correct English variation – “Magalies View”
  • We recognise that “Magaliesview” is a common spelling error and standardise to the correct “Magalies View”
  • We recognise that “15delaan” is an Afrikaans street name and type and standardise to “15th Avenue”
  • We recognise that “Fifteenth Ave” is an English street name and abbreviated street type and standardise to “15th Avenue”
  • We recognise,  based on the address format that the Afrikaans address has a house number of 24, while the English address has a house number of 42.
  • We recognise that the first address has additional information – a building called the “ATKV Building”
  • We can identify and append a valid post code

Once again, building these rules from scratch requires an solid understanding of South African geographies, the ability to translate from English to Afrikaans variations, the ability to recognise a valid location, or an unambiguous close variation and correct this, etc. In our experience we have found and built into our standard rules hundreds of spelling variations/errors for even simple place names. Imagine having to go through this process for a million address lines. What if your data set includes records from other countries – in Africa, or the UK, or Russia? How will you handle these?

3.) Matching – how do we build a rule to identify that these are the same? We know which information we can safely ignore. We know that we have the same business name (standardised), we know that we have the same street name and suburb name (standardised). We can ignore the fact that we have different building names (one is not populated) and that we do not have a postal code – this is a delivery address. The house numbers are a transposition – a common typing error – and we have a best practise mechanism for dealing with this.

How would you define your matching rules from scratch? What is the house number was missing? What if the data sets had different post codes? There are literally hundreds of combinations and conditions that need to be taken into account if these are not provided off the shelf! Does your team have the experience to handle these correctly without a head start?

Many data quality solutions do not provide a meaningful South African locality – this translates into years of development effort!

I am aware of one organisation that has had a team of 8 consultants building rules like this for over a year using a tool that did not provide this kind of head start. By comparison, we have successfully delivered a number of projects, for different clients, within three months – using a team of one or two consultants. The cost savings are obvious!

More importantly, there must have been a business reason you wanted to clean up your data!

Maybe you want to launch a new product in a specific market segment, or reduce client churn, or cut postal delivery costs, or improve planning through better analytics! Maybe you are migrating to a new application and need to create an accurate, consolidated view of each unique client, supplier, product or employee before you can go live!  Maybe you need to comply with regulations and are at risk of penalties or fines in the event of non-compliance!

The head start means you will see returns on your investment almost immediately – rather than waiting for months or years to develop something that may not even work correctly.

Why build from scratch when solutions exist off the shelf?

One thought on “Ease of implementation vs “ease of use” for data quality

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.