Is “Bias” the 7th big data quality metric

Explore the impact of bias as the 7th dimension of big data quality. Learn how bias affects machine learning outcomes, perpetuates societal biases, and discover strategies to address bias for fair and equitable machine learning systems.


gender bias

Wondering why data is the differentiator for AI? Explore the transformative power of data in driving AI innovation. Discover why data quality is the true differentiator for AI excellence.

A few weeks back I wrote about the 6 dimensions of big data quality.

These are:

  1. Coverage – how well does the data source meet (or fail to meet)  the business need?
  2. Continuity – How well does the data set cover all expected or needed intervals?
  3. Triangulation – How consistent is data when measured from related points of reference?
  4. Provenance – Can we validate where the data came from, who gathered it, and what criteria were used to create it?
  5. Transformation from origin – How has the data changed from its point of origin and how does this affect its accuracy?
  6. Repetition – identifying whether data from multiple sources are identical indicating potential tampering

A few days later I read a post by Dr Thomas Redman in Harvard Business Review talking about the link between quality data and machine learning.

His article makes the point that poor data quality hits machine learning twice – “first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.”

Poor hits twice – first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions

Machine learning depends on quality data

Predictive models are highly dependent on the quality of the historical data used to train the model.

It’s like teaching basic math.

If I teach my class that “1 + 1 = 3″every mathematical problem that they try to solve will come out wrong – even if every other step is correct.

For this reason, good data scientists spend up to 80% of their time cleansing data before it is used to build a model.

Similarly, as the model adapts and learns from new (operational) data coming in, this data can skew the outcome of the analysis if it is not also of high quality.

How does bias affect your model?

The impact of bias on machine learning can have significant consequences, affecting the fairness, accuracy, and overall performance of the models. Bias refers to the systematic and consistent deviation of predictions or decisions made by machine learning algorithms from the true values or desired outcomes.

Here are some examples of how bias can impact machine learning:

  1. Bias in Training Data: Machine learning models learn from historical data, and if the training data contains bias, the models may learn and perpetuate that bias. For instance, if a hiring model is trained on historical data that reflects gender or racial biases in past hiring decisions, it may inadvertently discriminate against certain groups when making predictions.
  2. Algorithmic Bias: Bias can be introduced through the algorithm design itself. For example, if a credit scoring model uses features that are correlated with race, such as postal codes or names, it may lead to discriminatory outcomes. Even if race is not explicitly included as a feature, the model can indirectly capture the bias due to these correlated factors.
  3. Lack of Representation: Biased outcomes can occur when certain groups or minority classes are underrepresented in the training data. For instance, facial recognition systems have shown biases in correctly identifying people of colour due to a lack of diverse training examples.
  4. Feedback Loop Bias: Machine learning models can reinforce and amplify existing biases present in society. For instance, if a news recommendation system recommends content based on user preferences, it may lead to an echo chamber effect, where users are exposed only to information that aligns with their existing beliefs, further reinforcing biases.
  5. Inherent Data Bias: The data used for training machine learning models may inherently contain bias due to human-generated labels or annotations. If these labels reflect subjective opinions or societal prejudices, the resulting models can inherit and amplify those biases.
  6. Unintended Bias in Objective Functions: The objective functions used in machine learning algorithms guide the learning process by defining what the model should optimize. If these objectives are not carefully designed, unintended biases can emerge. For example, an online advertisement placement model optimizing for click-through rates may disproportionately target certain demographics, leading to biased outcomes.

Bias can creep into a model in a number of ways!

  1. It can be explicit – if I only want to target rich people, my model may only look at data pertaining to households earning more than $200000 per annum. This may mean that the model is inaccurate for predicting behaviour for households where income falls below that amount.
  2. It may be based on historical bias. If human behaviour/choices are a factor in the model then it may continue to deliver results in line with existing biases. For example, women may not be selected for engineering degrees if gender was (consciously or unconsciously) used as a criterion for accepting applications in the past. This bias will be built into historical data, which will in turn teach the model that women should be less likely to be accepted.
  3. It may be inherent in the data due to incompleteness or inaccuracy. Data that ignores a certain time period, a specific demographic, or a possible touch point or outcome can drive inaccurate results. The difference between this point and the two above is that these biases may be less obvious to spot.

The impact of bias

It may be near to impossible to spot and manage bias in data feeding a predictive model, but the outcomes can be significant, particularly when the outcomes of our model confirm our own existing biases.

The impact of bias on machine learning is a complex and critical issue, as it can perpetuate and amplify existing societal biases, leading to unfair outcomes and discriminatory practices.

Here are some examples:

  1. Discrimination in Hiring: Bias in training data or algorithm design can lead to discriminatory outcomes in hiring processes. If a model is trained on historical data that favors certain demographics, it may perpetuate those biases and discriminate against underrepresented groups when making predictions about job candidates.
  2. Biased Loan Approvals: Machine learning models used for loan approvals may unintentionally discriminate against certain groups based on race, gender, or other protected characteristics. If historical loan data exhibit bias, such as granting loans more frequently to certain demographics, the model can learn and perpetuate this bias, leading to unfair loan decisions.
  3. Racial Profiling in Law Enforcement: Predictive policing models can exhibit bias and contribute to racial profiling. If historical crime data contains biases, such as over-policing in certain neighbourhoods, the model may unfairly target individuals from those communities, leading to biased law enforcement practices.
  4. Inaccurate Healthcare Diagnoses: Bias in healthcare datasets or algorithms can result in inaccurate diagnoses, especially for underrepresented groups. If medical data used for training models primarily comes from a specific population, the model may not generalize well to other demographics, leading to misdiagnoses or delayed treatments.
  5. Gender or Racial Bias in Facial Recognition: Facial recognition systems have shown biases in correctly identifying individuals from different gender and racial groups. If the training data predominantly consists of certain demographics, the model may struggle to accurately recognize and classify faces from underrepresented groups, leading to potential misidentifications and discriminatory consequences.
  6. Unfair Sentencing in Criminal Justice: Predictive models used in criminal justice systems can exhibit bias and contribute to unfair sentencing. If historical data contains biases in the form of disproportionate arrests or sentencing disparities, the model may perpetuate these biases and result in unjust decisions.

These examples illustrate how bias in machine learning can lead to unequal treatment, reinforce societal prejudices, and perpetuate discrimination. It is crucial to actively address bias in the design, development, and deployment of machine learning systems to ensure fair and equitable outcomes.

How to address bias in data

Addressing bias requires careful data collection, diverse and representative training datasets, algorithmic fairness considerations, and ongoing evaluation and mitigation efforts to ensure equitable and unbiased machine learning systems.

Removing bias from existing decision-making processes – such as a loan application – can be nearly impossible, but these biases are almost certain to skew existing data, and to drive false outcomes from the model.

An awareness that bias may exist, and robust attempts to mitigate it, are critical to successful predictive analytics.

How does bias impact analytics? Intrigued by bias’s role in analytics? Delve into insights on how bias shapes analytics outcomes and find methods to mitigate its influence.

What is intelligence bias? Delve into the concept of intelligence bias and its implications. Explore how preconceived notions influence data outcomes and discover strategies to counteract intelligence bias.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.



Related posts

Discover more from Data Quality Matters

Subscribe now to keep reading and get our new posts in your email.

Continue reading