Big Data: When data quality doesn’t matter

If BI is about reporting, analytics is about creating insight. [Tweet this]

Find the signal in the noise

Find the signal in the noise

The old adage of garbage in, garbage out applies to both BI and to analytics.

Yet, the introduction of Hadoop into the analytics architecture changes how poor quality data impacts us, and may create the illusion that poor data quality is no longer of concern.

Moving from a traditional structured analytics environment to the unstructured architecture of Hadoop reduces the technical impact of poor data quality. [Tweet this]

1. The data load.

We have all been there. Months of planning and development to build a new schema only to realise that the source data does not fit our assumptions.

Hadoop’s unstructured nature means that we can easily load data without worrying about a predefined schema. The ETL error is no longer an issue. Of course, data quality should never have been relegated to the ETL process in the first place – data quality should be an enterprise strategy – but these technical issues can, and do, cause real problems for the ETL teams.

Using Hadoop, and visual integration and analytics tools, we can quickly and simply load almost any data into our analytics environment without any concern for underlying data quality issues such as missing data, invalid values, or conflicting data types.

2. The sample error

Typically, analytics models have had to rely on a statistical sample of the total data set. Errors in sampling have a severe impact as they skew the results of the analysis towards a statistical subset.

Big data analytics removes sample error by applying any analysis to all the data. So, although we may build our model using a sample, we apply our model (and get our result) from the full data set. The quality of our sample becomes largely irrelevant

The business impact of poor data quality remains

A Hadoop architecture allows us to quickly load all the garbage [Tweet this]

We cannot gain insight from unsuitable data, as discussed in the post Quality data gives bigger, better insights!

For example, if we want to measure the impact of age, gender and income on adoption of channels we would need good quality age, gender and income information in our data set. Missing, inconsistent or inaccurate data still affects us at the business level.

Business value is derived when we focus on the business impact of poor data quality, rather than on the technical aspects.

Good data management principles such as data governance and data quality help to create context for our data, allow us to confirm that our data set is fit for purpose and allow us to have confidence in our insights.

Data management is what allows us to find the signal in the big data noise! [Tweet This]

Image sourced from






4 thoughts on “Big Data: When data quality doesn’t matter

  1. Gary, you couldn’t have said it better. It’s important to tie the aspects and ramifications of data quality to business impact because it’s difficult to convince people to make changes otherwise.

    And like you mentioned, one way to handle poor quality data is to put it through unsupervised algorithms that discover patterns and make sure to not account for outliers in their analysis. It’s called clustering and we use that frequently when we analyze our customer’s data.

    Another aspect of poor quality data is also the lack of integration or coherence between different data sets. For example, some of our customers who didn’t have omnichannel integration didn’t know that the same customer who was visiting their store online was also buying at their store. Being able to link those different data sets our customer gained a whole new perspective on their customers’ shopping journey.

    And of course, there’s also duplicate data. That just messes up the accuracy of customer profiling, calculating ROI, customer count, etc. That’s a simple step that many companies forget to do but really improves the quality of the data. We run a few other steps to clean up the data:

    At AgilOne we take care of most of our clients’ data quality issues before running our predictive algorithms, but it’s would be good to see people improve their data collection and processing procedures.

    1. Hi Mushfiq

      Thank you for your comment – we had quite a debate on linkedin in this regard

      Some so called big data experts are quite hostile to the idea that data quality must come into play in the big data world – nice to see that you agree.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.