If BI is about reporting, analytics is about creating insight. [Tweet this]
The old adage of garbage in, garbage out applies to both BI and to analytics.
Yet, the introduction of Hadoop into the analytics architecture changes how poor quality data impacts us, and may create the illusion that poor data quality is no longer of concern.
Moving from a traditional structured analytics environment to the unstructured architecture of Hadoop reduces the technical impact of poor data quality. [Tweet this]
1. The data load.
We have all been there. Months of planning and development to build a new schema only to realise that the source data does not fit our assumptions.
Hadoop’s unstructured nature means that we can easily load data without worrying about a predefined schema. The ETL error is no longer an issue. Of course, data quality should never have been relegated to the ETL process in the first place – data quality should be an enterprise strategy – but these technical issues can, and do, cause real problems for the ETL teams.
Using Hadoop, and visual integration and analytics tools, we can quickly and simply load almost any data into our analytics environment without any concern for underlying data quality issues such as missing data, invalid values, or conflicting data types.
2. The sample error
Typically, analytics models have had to rely on a statistical sample of the total data set. Errors in sampling have a severe impact as they skew the results of the analysis towards a statistical subset.
Big data analytics removes sample error by applying any analysis to all the data. So, although we may build our model using a sample, we apply our model (and get our result) from the full data set. The quality of our sample becomes largely irrelevant
The business impact of poor data quality remains
A Hadoop architecture allows us to quickly load all the garbage [Tweet this]
We cannot gain insight from unsuitable data, as discussed in the post Quality data gives bigger, better insights!
For example, if we want to measure the impact of age, gender and income on adoption of channels we would need good quality age, gender and income information in our data set. Missing, inconsistent or inaccurate data still affects us at the business level.
Business value is derived when we focus on the business impact of poor data quality, rather than on the technical aspects.
Good data management principles such as data governance and data quality help to create context for our data, allow us to confirm that our data set is fit for purpose and allow us to have confidence in our insights.
Data management is what allows us to find the signal in the big data noise! [Tweet This]
Image sourced from https://www.flickr.com/photos/31721843@N07/2969137144/