Artificial intelligence and machine learning are the new “big data’ – the hottest topic in analytics and decision making.
The premise – that computers that can think and learn, like humans, can replace humans for many tasks..
Projects like driverless cars; the Internet of Things (IoT) – where machines and devices contaminate amongst each other to get things done without human involvement,; investment advising atre some examples of the pervasive reach and scope of artificial intelligence and machine learning.
In an article written for Forbes in 2016, Barnard Marr spoke of AI as a revolution that will change everything about the way we produce, manufacture and deliver.
At the same time, he spoke of some of the very real dangers that AI presents – including legal and ethics concerns along with the dangers of plunging headfirst into AI without a clear plan and business case.
Very briefly – he touched on the reality that the fully autonomous, AI-powered, human-free insdutrial operation is still some way from completion.
Leadership in AI requires leadership in data quality
In a 2017 survey over half of 179 data scientists interviewed cited poor data quality as the biggest challenge hindering AI progress.
Big data is full of holes – missing, inconsistent and down right incorrect data sets that skew results – and lacking in metadata – information about the data that gives it context and makes it usable.
This means that data scientists typically spend more that 80% of their time on data engineering tasks – cleaning and preparing data to make it usable.
The challenges of managing data preparation and data quality at scale are exacerbated by the reality that big data sets increasingly combine internal data sets (such as customer, interaction, product and machine logs) over which we may have some nominal control with external publicly available data sets – such as state sponsored demographics (census data, employment data, etc) over which we have little or no control.
What can we do?
- Recognise that not all data is equal (or valuable). Governed data catalogs allow data scientists (and other stakeholders) to both find and understand the context of data and to rate the data in terms of its value and the level of trust for a particular purpose
- Invest in automating and simplifying data preparation. Data scientists need to spend less time writing code to deliver quality data and more time building AI models.
- Data preparation must include trace-ability – allowing data scientists and decision makers to understand where data used for models comes from and any changes that have been made to it to adapt it for the model
- Data quality must be delivered at big data scale – ideally within the Hadoop or Spark cluster – to ensure that Ai models are working with the best possible foundation for sound learning and decision making.