A few weeks back I wrote about the The 6 dimensions of big data quality.
- Coverage – how well does the data source meet (or fail to meet) the business need?
- Continuity – How well does the data set cover all expected or needed intervals?
- Triangulation – How consistent is data when measured form related points of reference?
- Provenance – Can we validate where the data came from, who gathered it, and what criteria were used to create it?
- Transformation from origin – How has the data changed from its point of origin and how does his affect its accuracy?
- Repetition – identifying whether data from multiple sources is identical indicating potential tampering
A few days later I read a post by Dr Thomas Redman in Harvard Business Review talking about the link between quality data and machine learning.
His article make the point that poor data quality hits machine learning twice – “first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.”
Predictive models are highly dependent on the quality of the historical data used to train the model.
Its like teaching basic math – if I teach my class that “1 + 1 = 3″every mathematical problem that they try to solve will be come out wrong – even if every other step is correct.
For this reason, good data scientists spend up to 80% of their time cleansing data before it is used to build a model.
Similarly, as the model adapts and learns from new (operational) data coming in, this data can skew the outcome of the analysis if it is not also of high quality.
How does bias affect your model?
Bias can creep into a model in a number of ways!
- It can be explicit – if I only want to target rich people, my model may only look at data pertaining to households earning more than $200000 per annum. This may mean that the model is inaccurate for predicting behavior for households where income falls below that amount.
- It may be based on historical bias. If human behavior / choices are a factor in the model then it may continue to deliver results in line with existing biases. For example, women may not be selected for engineering degrees if gender was (consciously or unconsciously) used as a criteria for for accepting applications in the past. This bias will be built into historical data, which will in turn teach the model that women should be less likely to be accepted.
- It may be inherent in the data due to incompleteness or inaccuracy. Data that ignores a certain time period, a specific demographic, or a possible touch point or outcome can drive inaccurate results. The difference between this point, and the two above is that these biases may be less obvious to spot.
It may be near to impossible to spot and manage bias in data feeding a predictive model, but the outcomes can be significant, particularly when the outcomes of our model confirm our own existing biases.
Removing bias from existing decision making processes – such as a loan application – can be nearly impossible, but these biases are almost certain to skew existing data, an dto drive false outcomes from the model.
An awareness that bias may exist, and robust attempts to mitigate it, are critical to successful predictive analytics.