How important is quality data for matching?

At a recent GIS conference, I discussed the implications of poor-quality address data for use in spatial intelligence. Large organisations hold millions of address records that could provide valuable insight if modelled in a spatial intelligence platform.

In order to access this insight these organisations must bridge the gap between the address data stored in their CRM, Billing or Supplier databases and the spatial layers stored in their GIS and reporting systems. This is done by providing a longitude and latitude coordinate or polygon to represent the address – the process of geocoding.

Understand the roles of MDM, data integration, and data quality in your data strategy. Explore our article ‘MDM, Data Integration, or Data Quality: Making Sense of the Differences‘ to gain insights into how each aspect contributes to optimizing your data ecosystem

The Obstacle of Poor Data Address Quality

One significant obstacle in achieving accurate spatial intelligence is poor data address quality.

Address data quality is most commonly impacted by inaccuracies (such as misspelt place names or the wrong street type), incompleteness (such as missing town names or postcodes) or by timeliness (for example if a city or street name has changed this may not be reflected in older data.).

Challenges with Geocoding

However, even accurate data can present challenges for geocoding. Address data is frequently captured in free format address fields meaning that identifying unique elements such as the house number or street name can be difficult. Other elements can be represented in different formats – e.g. 2^nd Ave is the same street as Second Avenue. Typical addresses, even if accurate, are characterised by a lack of standards and consistency that impact the attempts to apply a spatial component to the data.

Accurate Matching with Quality Data

Addressing these issues enables accurate matching of existing address data against a reference data set. At the end of my talk, I was asked why one couldn’t just use fuzzy matching.

Imprecise Matching and Unintended Consequences

Address validation is a good example of how imprecise matching based on inconsistent or incomplete data can have unintended consequences. The risk of fuzzy matching, if improperly applied to poor-quality data, is that of creating false positive matches.

For example, ask yourself whether “5^th Avenue” is the same place as “5^th Street”, or whether “Deal Road” is the same as “Dial Road”? In these cases, a poor use of fuzzy matching could validate or geocode an invalid location – if done on mass this could severely impact the use of the data for location intelligence.

Understanding the Implications of Fuzzy Matching

It’s important to recognize that the risks associated with fuzzy matching extend beyond address data.

For example, the following serial numbers “#123”, “#000123”, “123” and “000123” may all indicate the same item, captured using different standards, or may be unique serial numbers as provided by different suppliers. If fuzzy matching is applied without an understanding of what you are looking at and the implications of your match approach it can have severe consequences.

Fuzzy Matching as a Tool

Fuzzy matching, when used in a controlled manner and applied to data of reasonable quality, is a powerful and proven tool. It can help manage spelling variations, missing information, and other discrepancies that might cause similar data to be considered unique by different systems.

Uncover the significance of data matching in maintaining data integrity and accuracy. Explore our article on Data matching to learn about its role in data governance and quality assurance.

The Importance of Understanding and Application

Like any tool, fuzzy matching must be thoroughly understood, applied sensibly, tuned, and tested. It should be part of an overall data improvement strategy that includes standardization and enrichment to achieve trusted results.

Dive into our discussion on deterministic matching versus probabilistic matching to understand when each method is most effective.

By prioritizing data quality and employing the right tools and strategies, organizations can unlock the full potential of their data for accurate matching and spatial intelligence.

Responses to “How important is quality data for matching?”

Adi Eyal (@SoapSudTycoon)

June 18

Thanks for the article. I would like to caution against the blind use of fuzzy matching algorithms (Levenshtein and friends). Fuzzy matching is good for spelling mistakes but not systemic errors. For example, William and Bill cannot be matched using fuzzy matching, the same for University of Cape Town and Univ. CPT.

In some cases, data matching cannot be resolved using only the value in the field in question. Using information from other fields in the record could be crucial. An obvious example is Paris, France vs Paris, Texas. We cannot resolve our city field without knowing anything about the country in question. This dependency might complicate the cleaning process.

In short, fuzzy matching is a hammer, occasionally more delicate tools are needed.

Ask Dr Gary – Finding the perfect match! | Data Quality Matters

September 2

[…] For accurate matching we may want to improve our data consistency – for example, we may choose to teach our tool that “CO” is equivalent to “COMPANY” and that “NO” is equivalent to “NUMBER”. While data does not need to be exact, a level of quality need to be in place to support accurate matching as discussed in the post How important is quality data for matching? […]