We have dirty data because we don’t do the work

Last week I came across Ganes Kesari’s interview with Dr Tom Redman discussing misconceptions about data quality and how these hurt organisations. Dr Redman is well-known as a thought leader in the data quality space and offers a number of courses in our CIMP data quality training program.

“Everyone wants to do the model work, but not the data work.”

In the video, Tom discusses the findings of a Google research paper published last year, which finds that in high-stakes Artificial Intelligence data is the most undervalued asset. The paper, and this interview, discuss the negative downstream effects of data quality issues on AI outcomes – many of which are preventable but are triggered by conventional AI/ML practices that undervalue data quality. As discussed previously – for AI and Machine Learning the differentiator is quality data.

Tom argues that data preparation and data quality training are imperatives for data scientists.

Tom discusses the broad definition of quality when applied to AI – which includes understanding data’s context and relationships, biases and integrity, all of which must be understood for AI/ML models to be reliable.

“As long as there is an out from tackling the problem correctly then people and companies do not!”

Tom also speaks to the trap of believing that AI tools will magically solve the data integrity problem.

He suggests that these “magic wands” give management an out – “we don’t have to get people involved and tackle the problem properly.”

He argues that data quality is, first and foremost, a managerial problem that requires the right people to be involved with the right support.

Technology, he points out, is wonderful to increase scale and decrease unit costs, but if the basic processes aren’t in place it just makes things worse.

Key for anyone getting anywhere in the data space is people.

Tom suggests that ordinary people – the ones without data in their titles – are key to getting value from data. These are the people that capture data, the customers or users of data, collaborators in data science efforts and so on. Tom explores the trend of engaging large numbers of people in data management, and how the impact of the pandemic has helped to raise awareness of data issues in our everyday lives.

I invite you to listen to the interview. Tom covers a few more points and goes into more depth than I have done here. The video is around 12.5 minutes of easy listening

How to Define Data Quality

Defining data quality is essential for ensuring that data meets the needs and expectations of users and stakeholders.

Explore the intricacies of defining data quality by considering factors such as accuracy, completeness, consistency, and timeliness.

By establishing clear criteria and standards for data quality, organizations can effectively assess, measure, and improve the reliability and usefulness of their data assets.

Excuses for Bad Data

Despite the importance of data quality, organizations often encounter challenges and excuses for the presence of bad data.

Explore the top excuses for bad data and the implications they have on decision-making, operational efficiency, and customer satisfaction.

From human error to outdated systems, understanding and addressing these excuses is essential for fostering a culture of data quality and accountability within organizations.

How to Plan for Data Quality

Planning for data quality involves defining objectives, identifying stakeholders, and implementing processes to ensure that data meets predefined standards.

Discover valuable insights and lessons from the Rugby World Cup for planning data quality. By leveraging best practices and strategies for data governance, organizations can mitigate risks, improve data reliability, and drive better business outcomes.

4 Steps to Data Quality

Achieving data quality requires a systematic approach that encompasses assessment, improvement, and maintenance processes.

Explore the four fundamental steps to data quality to understand how to identify issues, implement corrective measures, and sustain data quality improvements over time.

From data profiling to data cleansing and monitoring, following these steps enables organizations to unlock the full potential of their data assets.

How to Ensure Accurate Data Entry

Accurate data entry is crucial for maintaining data integrity and reliability. Discover strategies and techniques for ensuring accurate data entry, including the importance of consistency and speed of execution.

Explore valuable insights for ensuring accurate data entry to minimize errors, improve efficiency, and enhance the overall quality of data captured.

How Does Poor Data Verification Impact Accuracy?

Poor data verification processes can have significant implications for data accuracy and reliability.

Delve into the challenges and consequences of inadequate data verification practices, and explore strategies for mitigating risks and ensuring data accuracy.

Understand how poor data verification impacts accuracy and the importance of implementing robust verification protocols to maintain data quality and trustworthiness.

The Importance of Data Standards

Data standards play a critical role in ensuring consistency, interoperability, and usability across diverse systems and applications.

Explore the importance of data standards and their impact on data quality, governance, and integration efforts. Gain valuable insights into data standards and how adhering to established standards enables organizations to exchange data seamlessly, reduce errors, and improve overall data quality.

How to Use Drop Down Lists Effectively

Drop-down lists are a valuable tool for improving data entry accuracy and consistency. Explore best practices for designing and implementing drop-down lists effectively to enhance data quality.

Discover practical tips for using drop-down lists to streamline data entry processes, reduce errors, and improve the overall quality and reliability of data captured.

How to Ensure Quality, Integrity, and Consistency Across Diverse Sources and Systems

Maintaining quality, integrity, and consistency across diverse data sources and systems presents unique challenges for organizations.

Explore strategies and techniques for ensuring data quality across heterogeneous environments. Gain valuable insights into ensuring quality, integrity, and consistency and how to implement data governance frameworks, establish data quality metrics, and leverage technology solutions to achieve these objectives.

How to Achieve Data Integration

Data integration is essential for consolidating disparate data sources and enabling seamless access and analysis.

Explore strategies and best practices for achieving data integration to enhance data quality and decision-making capabilities.

Learn valuable insights into achieving data integration and how to design robust integration processes, implement data standards, and leverage integration technologies effectively.

How Does Data Drift Affect Your Machine Learning Model?

Data drift refers to changes in the statistical properties of data over time, which can impact the performance and accuracy of machine learning models.

Explore the implications of data drift on model performance and learn strategies for detecting and mitigating its effects.

Understand how data drift affects machine learning models and the importance of continuous monitoring and adaptation to ensure model effectiveness and reliability.

What Is the Impact of Bias on Machine Learning?

Bias in machine learning refers to systematic errors or prejudices in the training data that can lead to unfair or inaccurate predictions.

Delve into the impact of bias on machine learning models and learn strategies for identifying and mitigating bias.

Explore valuable insights into the impact of bias on machine learning and the importance of fairness, transparency, and accountability in developing and deploying machine learning algorithms.

You can also download our Precisely whitepaper Six Steps to Overcoming Data Pitfalls Impacting Your AI and Machine Learning Success

Conclusion

Data quality is essential for decision making and AI. A plan to “do the work” should form part of your data quality strategy.

Responses to “We have dirty data because we don’t do the work”

Warwick Taylor

February 21

What a wonderful interview! Especially near the end where Tom calls for the involvement of ordinary people.

#WorkersDay: AI and robotics move into business – Gadget

May 1

[…] strategies — for the modern business landscape. Learning AI and robotics can also help students improve their data analyses before they actually enter the […]