The Importance of Data Quality in the Era of Big Data

Discover the importance of data quality in the era of big data analytics. Learn why “garbage in, garbage out” still holds true and explore how Hadoop’s impact on analytics architecture affects data quality. Find out how big data analytics eliminates sample errors and the business impact of poor data quality. Let data governance principles guide…


Find the signal in the noise
Find the signal in the noise

In the world of business intelligence (BI) and analytics, data quality is paramount. As the saying goes, “garbage in, garbage out.”

However, with the advent of Hadoop and its impact on analytics architecture, there seems to be a perception that poor data quality is no longer a concern.

Let’s explore how this misconception can be addressed and why the business case for data quality remains crucial in the age of big data.

Section 1: The Changing Landscape of Data Quality

The Shift to Hadoop: Changing the Rules

Moving from a traditional structured analytics environment to the unstructured architecture of Hadoop reduces the technical impact of poor data quality.

  • Eliminating Technical Constraints: The Unstructured Advantage

We have all been there. Months of planning and development to build a new schema only to realise that the source data does not fit our assumptions. Hadoop’s unstructured nature means that we can easily load data without worrying about a predefined schema.

  • ETL Considerations: Removing Data Load Worries

The ETL error is no longer an issue. Of course, data quality should never have been relegated to the ETL process in the first place – data quality should be an enterprise strategy – but these technical issues can, and do, cause real problems for the ETL teams.

  • Seamless Integration: Simplifying Data Loading

Using Hadoop, and visual integration and analytics tools, we can quickly and simply load almost any data into our analytics environment without any concern for underlying data quality issues such as missing data, invalid values, or conflicting data types.

Section 2: Redefining Sample Error with Big Data Analytics

The Problem with Sampling

  • Statistical Subset Bias: The Pitfall of Traditional Analytics Models

Typically, analytics models have had to rely on a statistical sample of the total data set. Errors in sampling have a severe impact as they skew the results of the analysis towards a statistical subset.

  • Full Data Analysis: Eliminating Sample Error

Big data analytics removes sample errors by applying any analysis to all the data. So, although we may build our model using a sample, we apply our model (and get our result) from the full data set. The quality of our sample becomes largely irrelevant

Big data analytics removes sample error by applying analysis to the full data set

Section 3: The Business Impact of Poor Data Quality

  • The Need for Quality Data: Driving Meaningful Insights

We cannot gain insight from unsuitable data, as discussed in the post Quality data gives bigger, better insights!

For example, if we want to measure the impact of age, gender and income on the adoption of channels we would need good quality age, gender and income information in our data set. Missing, inconsistent or inaccurate data still affects us at the business level.

If BI is about reporting, analytics is about creating insight. Both depend on the data

Business value is derived when we focus on the business impact of poor data quality, rather than on the technical aspects.

  • Data Governance: Establishing Context and Fit-for-Purpose Data

Good data management principles such as data governance and data quality help to create the business context for our data, allow us to confirm that our data set is fit for purpose and allow us to have confidence in our insights.

Data management is what allows us to find the signal in the big data noise!

And don’t overlook the nuances: big data quality is a multifaceted issue that demands attention at every level of your organization.

Furthermore, the distinction between creepy data and first-person data isn’t just semantic; it’s pivotal for maintaining trust.


Image sourced from https://www.flickr.com/photos/31721843@N07/2969137144/

Responses to “The Importance of Data Quality in the Era of Big Data”

  1. Mushfiq Rahman

    Gary, you couldn’t have said it better. It’s important to tie the aspects and ramifications of data quality to business impact because it’s difficult to convince people to make changes otherwise.

    And like you mentioned, one way to handle poor quality data is to put it through unsupervised algorithms that discover patterns and make sure to not account for outliers in their analysis. It’s called clustering and we use that frequently when we analyze our customer’s data.

    Another aspect of poor quality data is also the lack of integration or coherence between different data sets. For example, some of our customers who didn’t have omnichannel integration didn’t know that the same customer who was visiting their store online was also buying at their store. Being able to link those different data sets our customer gained a whole new perspective on their customers’ shopping journey.

    And of course, there’s also duplicate data. That just messes up the accuracy of customer profiling, calculating ROI, customer count, etc. That’s a simple step that many companies forget to do but really improves the quality of the data. We run a few other steps to clean up the data:http://www.agilone.com/advanced-data-management-data-warehousing

    At AgilOne we take care of most of our clients’ data quality issues before running our predictive algorithms, but it’s would be good to see people improve their data collection and processing procedures.

    1. Gary Allemann

      Hi Mushfiq

      Thank you for your comment – we had quite a debate on linkedin in this regard

      Some so called big data experts are quite hostile to the idea that data quality must come into play in the big data world – nice to see that you agree.

      1. Mushfiq Rahman

        Where are we discussing this on LinkedIn? I’d love to see other povs.

      2. Gary Allemann

Leave a reply to Mushfiq Rahman Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.



Related posts

Discover more from Data Quality Matters

Subscribe now to keep reading and get our new posts in your email.

Continue reading