In “The small print: You’re going to have to clean up all that big data“, data scientist and Datameer CEO, Stefan Groschupf, debunks the popular big data myth that garbage in big data sources can be ignored and still deliver good quality insights.
In fact, Stefan points out that the exponential growth of both data sources and data volumes increases the risk that “data-quality issues can wreak havoc on an organization if the data isn’t vetted at all points of the analytics workflow, from ingest to final visualization.” [Tweet this]
Stefan’s blog provides numerous examples in retail and financial services. For example, inconsistent product codes and descriptions can lead analytics to incorrectly predict product shortages – leading to overproduction and write downs – or underproduction, which leads to lost sales due to key products being out of stock.
Another example: A bank uses a big data analytics to flag unusual variations in credit scores linked to location and other indicators. Without quality location data this cannot be achieved. When missing or invalid location data is flagged early in the credit management process it can be corrected saving the bank millions in defaulting loans.
The data quality community have been highlighting the importance of quality data for big data analytics for years. [Tweet this]
It is great to have a big data pioneer like Stefan joining the data quality movement. In his post he says that organizations must manage the data quality risks inherent in big data sources by being “smart about how they’re approaching data from the very beginning of the process and each time new data is added.”
Traditional data quality tools may need to adapt in order to solve the challenges posed by big data. New big data integration and analytics technologies that enable data profiling and visualisation at every step of the analytics process can play a critical role in identifying the anomalies in enormous data sets at an early stage. This allows data quality issues to be resolved before they create false insights.
The earlier data quality issues are identified the less costly the impact, as illustrated above.
Errors in the data layer trickle through to all the value added layers but become increasingly difficult to find and resolve. Conversely, improvements to the source data layer lead to increased value at all points in the data value chain. Waiting until the analytics layer means that poor data quality may never be identified, leaving you with poor insights and untrustworthy decisions.
For example, Stefan shows that a telecommunications company took an entirely different approach to data quality analysis to more accurately plan its spending on new infrastructure. The company analyzed its customer information to find incorrect subscriber data (invalid email addresses, for example) that skewed results on usage in different areas. By correctly correlating subscriber information with network performance data, the company was able to keep up with existing and forecasted demand. Ultimately the company said it was able to avoid wasting an estimated $140 million on unnecessary capital by not adding new infrastructure that may have been predict would have been required under models that used the poor quality data.
Stefan’s view: “Data quality has become an important, if sometimes overlooked, piece of the big data equation. Until companies rethink their big data analytics workflow and ensure that data quality is considered at every step of the big data analytics process — from integration all the way through to the final visualization — the benefits of big data will only be partly realised.” [Tweet this]
I agree – data science and data quality are tightly entwined. How are you handling your big data quality issues?