At a recent GIS conference I discussed the implications of poor quality address data for use in spatial intelligence. Large organisations hold millions of address records that could provide valuable insight if modelled in a spatial intelligence platform. In order to access this insight these organisations must bridge the gap between the address data stored in their CRM, Billing or Supplier databases and the spatial layers stored in their GIS and reporting systems. This is done by providing a longitude and latitude coordinate or polygon to represent the address – the process or geocoding
Poor data address data quality is a significant obstacle to achieving this.
Address data quality is most commonly impacted by inaccuracy (such as misspelled place names or the wrong street type), by incompleteness (such as missing town names or post codes) or by timeliness (for example if a city or street name has changed this may not be reflected in older data.).
However, even accurate data can present challenges for geocoding. Address data is frequently captured in free format address fields meaning that identifying unique elements such as the house number or street name can be difficult. Other elements can be represented in different formats – e.g. 2nd Ave is the same street as Second Avenue. Typical addresses, even if accurate, are characterised by a lack of standards that impact the attempts to apply a spatial component to the data.
Addressing these issues enables accurate matching of existing address data against a reference data set. At the end of my talk I was asked why one couldn’t just use fuzzy matching.
Address validation is a good example of how imprecise matching based on inconsistent or incomplete data can have unintended consequences.
The risk of fuzzy matching, if improperly applied to poor quality data, is that of creating false positive matches. For example, ask yourself whether “5th Avenue” is the same place as “5th Street”, or whether “Deal Road” is the same as “Dial Road”? In these cases, a poor use of fuzzy matching could validate or geocode an invalid location – if done on mass this could severely impact the use of the data for spatial analysis.
This kind of issue applies to more than just address data. For example, the following serial numbers “#123”, “#000123”, “123” and “000123” may all indicate the same item, captured using different standards,, or may be unique serial numbers as provided by different suppliers. If fuzzy matching is applied without an understanding of what you are looking at and the implications of your match approach it can have severe consequences.
Fuzzy matching is a powerful and proven tool when used in a controlled manner on data of a reasonable quality. Fuzzy matching can help to manage small variations in spelling, missing information and other discrepancies that cause similar data to be seen as unique by many systems.
Like any tool, fuzzy matching must be understood, must be applied in a sensible manner, must be tuned and tested and must form part of an overall data improvement strategy that should include standardisation and enrichment in order to achieve trusted results.