AI for Data Quality: A Powerful Tool, Not a Magic Wand

AI can assist to improve data quality. But it has limitations particularly if its not explainable

The promise of AI has swept through every industry, and data quality management is no exception. One area where AI is making waves is data matching – automatically identifying and linking similar records across datasets.

However, a recent vendor interaction highlights a crucial point: AI is a powerful tool, but it’s not a magic wand for data quality.

The Vendor Interaction

I accepted a call from a local data scientist wanting to show off his tool for matching product data. The tool certainly performed as described, but the data was not lifelike.

Unlike real data, where product descriptions are often captured in a single, free format text field, the sample data set demonstrated was highly regular. Individual product data attributes were consistently captured in specific fields and applied consistent standards.

When I asked some basic questions about how his engine would handle free-format text product data, or other data quality issues, he responded that we would need a project to manually standardise the data prior to matching

Let’s delve into the opportunities and limitations of AI in data quality:

AI can be your data cleaning ally

Identifying Data Quality Issues: AI excels at identifying patterns and anomalies. Machine learning algorithms can analyse vast datasets to flag inconsistencies, missing values, and potential duplicates. This helps prioritize data cleansing efforts by focusing on areas with the biggest impact.

Example: Imagine an AI system analysing product descriptions and suggesting rules to standardize units of measurement (e.g., “12oz” vs “354ml”).

Proposing Data Quality Rules: AI can analyse clean data sets and suggest rules for maintaining data quality. These rules might cover data formatting, value ranges, and identification of suspicious patterns.

Example: An AI system might analyse a customer database and suggest a rule to ensure all email addresses follow a standard format (e.g., [email address removed]).

However, be wary of:

Correlation vs. Causation: AI algorithms can identify correlations in data, but they often struggle to differentiate between correlation and causation. This can lead to proposing large numbers of possible data quality rules based on spurious relationships, each of which must be checked and discarded by an expert. This can be overwhelming.

Example: AI might suggest that all customer records with a specific zip code have a higher probability of containing an error. However, this could be due to a temporary data entry issue in that zip code, not a general rule. Human expertise is needed to understand the context and validate these correlations

Garbage In, Garbage Out: AI relies heavily on the quality of the data it’s trained on. Poor quality data, ironically, like the free text descriptions mentioned by the vendor, can lead the AI to suggest inaccurate or irrelevant data quality rules.

Example: Matching product descriptions can be challenging for AI. Descriptions like “Red Running Shoe, Men’s Size 10” and “Running Shoe (Crimson), Men’s (US 10)” require understanding synonyms and interpreting context, which can be difficult for AI.

Matching Made (Mostly) Easy with AI:

Data Matching Powerhouse: AI excels at pattern recognition, making it a valuable tool for “fuzzy” data matching.

Example, we have used AI systems to match customer and product records across different databases in spite of name variations and missing information. One customer, for example, produced wine amongst other products. We found multiple variations of spelling for popular varietals like Cabernet Sauvignon that had to be managed to identify duplicates..

And, consider the importance of:

Explainable AI: In data matching scenarios, understanding why a particular match is made is crucial. Black-box AI models can struggle with explainability, making it difficult to trust the results and hindering debugging efforts for similar records with different outcomes.

Example: An AI matching engine might suggest that two customer records are a match based on a similar email address with only one character difference. XAI can explain this decision by highlighting the weight assigned to the email address field compared to other fields like name or address.. Without understanding the reasoning behind the match, it’s difficult to assess its validity and improve the matching logic.

Additional Considerations:

Data Drift: AI models trained on a specific dataset might not perform well when the data changes significantly. In effect, this may mean that AI matching models stop working, or create false postive matches. Regular retraining and monitoring are essential to ensure the accuracy of AI-driven data quality solutions.
Bias Detection: AI algorithms can perpetuate existing biases in data. It’s crucial to choose unbiased training data and monitor for biased outcomes.

The Way Forward:

AI holds tremendous potential for improving data quality, but it should be viewed as a collaborative tool, not a replacement for human expertise. Organizations should focus on:

Human-in-the-Loop AI: Leverage AI to identify potential issues and suggest data quality rules, but involve humans in the decision-making process.
Data Governance: Establish clear data governance frameworks to ensure the quality and consistency of data used to train AI models.
Continuous Monitoring and Improvement: Regularly monitor AI performance, review proposed data quality rules, and retrain models as needed.

By thoughtfully integrating XAI with human oversight, organizations can unlock the true power of data quality and reap the benefits of clean, consistent, and trustworthy information.