In baseball, a home run (abbreviated HR, also “homer”, “dinger”, “bomb”, or “four-bagger”) is scored when the ball is hit in such a way that the batter is able to circle the bases and reach home safely in one play without any errors being committed by the defensive team in the process
For a data quality home run I would suggest that we need our data quality batter to move through all four data quality bases in order to achieve a data quality improvement.
So what are the four data quality bases?
First data quality base – Data Profiling
The first step of any data quality process must be to compare our data to our agreed ideal data set.
We do this by profiling our data set – measuring compliance of our data to our agreed standards and rules.
Basic data profiling can be done using SQL. However, the advanced data profiling and discovery capabilities of tools like Trillium Software Discovery put data profiling into the hands of the business data steward – while providing more detailed insights more quickly than SQL approaches are able too,
This means that our business data stewards can quickly identify anomalies in data that require intervention, make decisions, and act on them without requiring IT support.
Second data quality base – Monitoring
“You can’t manage what you can’t measure” – Peter Drucker
The second requirement for a successful data quality deployment is the ability to monitor changes to data quality as improvements kick in. Various levels of detail are needed – from governance dashboards to detailed exception reports or issue logs for hands on stewards and operational staff.
Data quality reports allow us to measure the value being driven in the form of data quality improvements, allow us to focus in on problem areas that need more attention, and provide input for root cause analysis.
Third data quality base – Cleansing
Our steward must now act on his decisions!
Data issues must be allocated to a team for resolution.
Some issues may require manual remediation. A data quality case must be opened and managed to ensure these issues are addressed at source.
Other issues may be resolved through an automated cleaning process. Stewards should be able to quickly and easily define rules to standardize, enrich, and match data.
Wherever possible stewards should reuse the work done to get to first base – the data profiling and analysis – to ensure that rules can be deployed timeously, and that rules defined by teh data steward are not lost in translation by any technical implementation.
Fourth data quality base – Deployment
Rules must now be deployed – either as part of a batch ETL process – or, more frequently, these days, as real time data quality services.
Batch data quality processes existing, dirty data,. Real time services ensure that new data entering our organisation is cleaned on entry.
An enterprise data quality solution must allow business stakeholders to develop data quality processes that can easily be deployed in both real time and batch.
We must cater for the various core components in our architecture – maybe we are running SAP for ERP,, Microsoft for CRM and have various legacy applications on the mainframe. We may have an enterprise bus from one vendor and an ETL tool from another.
The enterprise data quality platform should provide ease of integration with any and all of these components.
In my experience, far too many data quality implementation get stuck on first or second base.
If your strategy, or your technology platform, does not easily support all four steps, the chances are that you will struggle to deliver quality data.
Image sourced from http://en.wikipedia.org/wiki/Chase_Utley