A number of independent experts have commented recently on the need for an enterprise data quality approach that goes beyond trying to fix data as you load it into the data warehouse.
A recent announcement by American pharmaceutical giant, Pfizer inc, that it is replacing traditional ETL with a data virtualisation approach is a response to the on-going business need for rapid turn around on data critical applications. According to Pfizer BIS team lead, Micheal Linhares, traditional ETL development took too long and cost too much.
While not everybody is ready to give up their ETL environments, the demand for real time consolidation of data goes beyond data virtualisation. In her post Judgement Day for Data Quality, Forrester analyst Michele Goetz talks about other technologies, such as Hadoop processes and data appliances that create and persist new data silos that require management. I agree with her that these new business demands place an even stronger demand on data quality tools to “place a higher value on governance enablement and the ability to extend sophisticated and mature processing across the entire data management spectrum.”
Of course, big data analytics, as it matures, is making increasing use of these technologies to provide insight on the fly. Big data analytics is frequently characterised by a large number of small servers, each working in parallel to process a small amount of the total volume, which must then be brought together in order to provide meaningful insight, as discussed in this insight by Trillium Software VP, Nigel Turner.
This real time consolidation of vast data sets requires real-time standardisation and matching of related records in order to drive meaning. Data validation at source must be extended to enable real time validation of these new virtual data sources that are becoming the norm.
As data integration shifts to real time so must data quality initiatives in other words, Pfizer may be leading a trend beyond batch driven data integration and data cleansing.