How important is quality data for matching?


At a recent GIS conference I discussed the implications of poor quality address data for use in spatial intelligence. Large organisations hold millions of address records that could provide valuable insight if modelled in a spatial intelligence platform. In order to access this insight these organisations must bridge the gap between the address data stored in their CRM, Billing or Supplier databases and the spatial layers stored in their GIS and reporting systems. This is done by providing a longitude and latitude coordinate or polygon to represent the address – the process or geocoding

Poor data address data quality is a significant obstacle to achieving this.

Address data quality is most commonly impacted by inaccuracy (such as misspelled place names or the wrong street type), by incompleteness (such as missing town names or post codes) or by timeliness (for example if a city or street name has changed this may not be reflected in older data.).

However, even accurate data can present challenges for geocoding. Address data is frequently captured in free format address fields meaning that identifying unique elements such as the house number or street name can be difficult. Other elements can be represented in different formats – e.g. 2nd Ave is the same street as Second Avenue. Typical addresses, even if accurate, are characterised by a lack of standards that impact the attempts to apply a spatial component to the data.

Addressing these issues enables accurate matching of existing address data against a reference data set. At the end of my talk I was asked why one couldn’t just use fuzzy matching.

Address validation is a good example of how imprecise matching based on inconsistent or incomplete data can have unintended consequences.

The risk of fuzzy matching, if improperly applied to poor quality data, is that of creating false positive matches. For example, ask yourself whether “5th Avenue” is the same place as “5th Street”, or whether “Deal Road” is the same as “Dial Road”? In these cases, a poor use of fuzzy matching could validate or geocode an invalid location – if done on mass this could severely impact the use of the data for spatial analysis.

This kind of issue applies to more than just address data. For example, the following serial numbers “#123”, “#000123”, “123” and “000123” may all indicate the same item, captured using different standards,, or may be unique serial numbers as provided by different suppliers. If fuzzy matching is applied without an understanding of what you are looking at and the implications of your match approach it can have severe consequences.

Fuzzy matching is a powerful and proven tool when used in a controlled manner on data of a reasonable quality. Fuzzy matching can help to manage small variations in spelling, missing information and other discrepancies that cause similar data to be seen as unique by many systems.

Like any tool, fuzzy matching must be understood, must be applied in a sensible manner, must be tuned and tested and must form part of an overall data improvement strategy that should include standardisation and enrichment in order to achieve trusted results.

Advertisements

2 thoughts on “How important is quality data for matching?

  1. Thanks for the article. I would like to caution against the blind use of fuzzy matching algorithms (Levenshtein and friends). Fuzzy matching is good for spelling mistakes but not systemic errors. For example, William and Bill cannot be matched using fuzzy matching, the same for University of Cape Town and Univ. CPT.

    In some cases, data matching cannot be resolved using only the value in the field in question. Using information from other fields in the record could be crucial. An obvious example is Paris, France vs Paris, Texas. We cannot resolve our city field without knowing anything about the country in question. This dependency might complicate the cleaning process.

    In short, fuzzy matching is a hammer, occasionally more delicate tools are needed.

  2. Pingback: Ask Dr Gary – Finding the perfect match! | Data Quality Matters

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s