
In the world of business intelligence (BI) and analytics, data quality is paramount. As the saying goes, “garbage in, garbage out.”
However, with the advent of Hadoop and its impact on analytics architecture, there seems to be a perception that poor data quality is no longer a concern.
Let’s explore how this misconception can be addressed and why the business case for data quality remains crucial in the age of big data.
Section 1: The Changing Landscape of Data Quality
The Shift to Hadoop: Changing the Rules
Moving from a traditional structured analytics environment to the unstructured architecture of Hadoop reduces the technical impact of poor data quality.
- Eliminating Technical Constraints: The Unstructured Advantage
We have all been there. Months of planning and development to build a new schema only to realise that the source data does not fit our assumptions. Hadoop’s unstructured nature means that we can easily load data without worrying about a predefined schema.
- ETL Considerations: Removing Data Load Worries
The ETL error is no longer an issue. Of course, data quality should never have been relegated to the ETL process in the first place – data quality should be an enterprise strategy – but these technical issues can, and do, cause real problems for the ETL teams.
- Seamless Integration: Simplifying Data Loading
Using Hadoop, and visual integration and analytics tools, we can quickly and simply load almost any data into our analytics environment without any concern for underlying data quality issues such as missing data, invalid values, or conflicting data types.
Section 2: Redefining Sample Error with Big Data Analytics
The Problem with Sampling
- Statistical Subset Bias: The Pitfall of Traditional Analytics Models
Typically, analytics models have had to rely on a statistical sample of the total data set. Errors in sampling have a severe impact as they skew the results of the analysis towards a statistical subset.
- Full Data Analysis: Eliminating Sample Error
Big data analytics removes sample errors by applying any analysis to all the data. So, although we may build our model using a sample, we apply our model (and get our result) from the full data set. The quality of our sample becomes largely irrelevant
Big data analytics removes sample error by applying analysis to the full data set #DataQuality #Bias
Tweet
Section 3: The Business Impact of Poor Data Quality
- The Need for Quality Data: Driving Meaningful Insights
We cannot gain insight from unsuitable data, as discussed in the post Quality data gives bigger, better insights!
For example, if we want to measure the impact of age, gender and income on the adoption of channels we would need good quality age, gender and income information in our data set. Missing, inconsistent or inaccurate data still affects us at the business level.
If BI is about reporting, analytics is about creating insight. Both depend on the data #DataQuality
Tweet
Business value is derived when we focus on the business impact of poor data quality, rather than on the technical aspects.
- Data Governance: Establishing Context and Fit-for-Purpose Data
Good data management principles such as data governance and data quality help to create the business context for our data, allow us to confirm that our data set is fit for purpose and allow us to have confidence in our insights.
Data management is what allows us to find the signal in the big data noise!
Tweet
And don’t overlook the nuances: big data quality is a multifaceted issue that demands attention at every level of your organization.
Furthermore, the distinction between creepy data and first-person data isn’t just semantic; it’s pivotal for maintaining trust.
Image sourced from https://www.flickr.com/photos/31721843@N07/2969137144/

Leave a reply to Mushfiq Rahman Cancel reply