Previous attempts at geocoding – adding a longitude and latitude point to each address – had been extremely difficult. The process had delivered limited results and had required months of effort and substantial manual intervention.
In short, it has been so complex and so unrewarding that the process had not been attempted in several years.
We were asked to put our money where our mouth is. We claim to have an off the shelf set of South African name and address rules that allow us to quickly and effectively understand and geocode free format South African addresses.
The client provided us with a set of some 500000 addresses and asked us to deliver results within a few working days. They wanted to measure both the accuracy of the results and the time to value, as well as the performance of the application.
Quick and accurate geocoding
Within a day we had the following results – based on our standard off the shelf rules with some minor, data set specific, configuration.
Elapsed Time: 00:08:42 Record Input Count Statistic 513288 Records processed Record Matches Count Description 442611 Records coded with Latitude and Longitude. 70677 Records not coded with Latitude and Longitude. Accuracy Match Levels Count Description 167048 Records Matched At Level 5 - Interpolated Rooftop. 40956 Records Matched At Level 4 - Street Level. 210837 Records Matched At Level 3 - Postal Code Centroid. 23619 Records Matched At Level 2 - City Centroid. 151 Records Matched At Level 1 - Region Centroid. 0 Records Matched At Level 0 - Country Centroid. Address Accuracy Match: 86.2%
Geocoding is, in essence, a simple process. All that is needed is to match the address captured for your customer, or supplier, to a reference data set that contains a point.
Yet, in practice, poor data quality makes this very difficult – as discussed in previous posts such as A guide to South African address data quality
The most common address quality problems include:
- Missing or inaccurate information – e.g. no post code, incorrect post code.
- Misfielded information – e.g. street name in the suburb field
- Misspellings and typingh mistakes – e.g. “raod” instead of “road”
- Out of date data – e.g. still using an old / replaced place name
For geocoding, one is also dependent on the accuracy and completeness of one’s reference data set. It is easy to find a spatial location for an urban address in a major city. But many South African live in informal or rural settlements that are not well mapped or understood.
In these cases, we may only be able to provide a regional centroid, rather than an exact location.
Yet, in this particular case, I found a number of curious exceptiosn / locations that I did not at first understand.
I had addresses such as:
- 15 Govan Mbeki Saint
- 1127 Phalaborwa Saint
- and many more similar examples.
It was only when I was presenting the results to the client that I figured this out.
At some point in the past, a data “cleansing” project had been run where someone had (probably) decided that ST should be replaced by SAINT
As a result, thousands of street addresses had been degraded.
I frequently see the results of previous cleansing attempts in production data that we are asked to resolve. Often, the data cleansing attempt has made things worse.
Three points for achieving quality data
- Work with someone who understands the problem. Our team would never have made such a rudimentary mistake as we follow the two additional principles below
- Context is critical. You cannot do blanket change son data. You need to understand context and only make changes where they are relevant. 12 ST WINIFREDS ST – the first ST is SAINT, the second is STREET. We understand and manage this kind of complexity
- Test before you deploy. Even if an error of the nature identified had been proposed as a cleansing rule, this should have been tested and the issue picked up before this was applied to the live data. I have been under pressure, particularly from IT, to deploy “cleansed” data into production with out tetsing and business validation of the proposed changes. This is always a mistake!
What strange results have you seen in your data?