We have dirty data because we don’t do the work


Last week I came across Ganes Kesari’s interview with Dr Tom Redman discussing misconceptions about data quality and how these hurt organisations. Dr Redman is well-known as a thought leader in the data quality space and offers a number of courses in our data quality training program.

“Everyone wants to do the model work, but not the data work.”

In the video, Tom discusses the findings of a Google research paper published last year, which finds that in high-stakes Artificial Intelligence data is the most undervalued asset. The paper, and this interview, discuss the negative downstream effects of data quality issues on AI outcomes – many of which are preventable but are triggered by conventional AI/ML practices that undervalue data quality. As discussed previously – for AI and Machine Learning the differentiator is quality data.

Tom argues that data preparation and data quality training are imperatives for data scientists.

Tom discusses the broad definition of quality when applied to AI – which includes understanding data’s context and relationships, biases and integrity, all of which must be understood for AI/ML models to be reliable.

“As long as there is an out from tackling the problem correctly then people and companies do not!”

Tom also speaks to the trap of believing that AI tools will magically solve the data integrity problem.

He suggests that these “magic wands” give management an out – “we don’t have to get people involved and tackle the problem properly.”

He argues that data quality is, first and foremost, a managerial problem that requires the right people to be involved with the right support.

Technology, he points out, is wonderful to increase scale and decrease unit costs, but if the basic processes aren’t in place it just makes things worse.

Key for anyone getting anywhere in the data space is people.

Tom suggest that ordinary people – the ones without data in their titles – are key to getting value from data. These are the people that capture data, the customers or users of data, collaborators in data science efforts and so on. Tom explores the trend of engaging large numbers of people in data management, and how impact of the pandemic has helped to raise awareness of data issues in our everyday lives.

I invite you to listen to the interview. Tom covers a few more points and goes into more depth than I have done here. The video is around 12.5 minutes of easy listening

You can also download our Precisely whitepaper Six Steps to Overcoming Data Pitfalls Impacting Your AI and Machine Learning Success

2 thoughts on “We have dirty data because we don’t do the work

  1. What a wonderful interview! Especially near the end where Tom calls for the involvement of ordinary people.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.