“Big data” was one of 2012’s buzzwords – spawning a host of new technologies and comment.
One of the challenges for implementers is that there are conflicting views as to what big data is.
Everyone agrees that big data is about volume.
The principal focus of most commentators, and most technology solutions, is related to dealing with this volume. Traditional relational databases were primarily designed to enable easy searching and reporting on data, but anyone who has run a query against a large dataset knows that it can take hours, or even days, to return results. Newer technologies,, such as the open source Apache Hadoop, or Software AG’s Teracotta platform, are designed to support large volumes via a distributed architecture and in-memory data management.
The real challenge of big data, however, is not volume, but structure (or the lack thereof). Big data comes in a vast variety of formats, from machine generated feeds, to telecommunications call records, to various free format web sources and business communications.
Social intelligence – the ability to mine social media to analyse clients’ or prospects’ feelings about your company, products and brand – is one of the more hyped examples of where big data is expected to add significant value to business analytics. Why extrapolate opinions from focus groups and surveys, the thinking goes, when clients’ opinions are captured on Facebook, Twitter, HelloPeter and similar sources. Analysis of the entire client group removes the need for assumptions – leading to more accurate planning and the ability to respond rapidly to emerging trends..
The real challenge is that valuable, relevant information is buried amidst a massive volume of clutter. Relevant content must be pulled from unstructured text fields and linked together across multiple user profiles and applications. Common sense suggests that filters must be applied to reduce volumes – there is no value investing in infrastructure to store irrelevant data.
The requirement, then, is to ensure that big data is fit for purpose – a data quality problem! Of course, free format text data is not restricted to the Internet.
Business correspondence, such as emails, letters and facsimiles, can hold valuable and time critical information or instructions, which can easily be overlooked due to sheer volumes. These oversights lead to additional administrative costs, or may even result in legal liability, if charges are not responded to timeously.
Applications such a Trillium Software’s claims data quality bridge the gap between traditional business analytics and big data analytics – delivering value today on existing data sets. The same technologies can be deployed tomorrow against social media sets, to ensure confidence that real insight will be gained from investments in big data infrastructure.
In the age of big data, quantity of data is far outstripping quality of data. Data quality tools should be applied to filter and discard useless and irrelevant data from that that contains insight. As data volumes continue to grow exponentially this data governance function will become more critical, if costs of infrastructure are not to spiral out of control.
Whether big data is just about volume, or about volume and variety, data quality will remain critical to deriving value. When planning for big data, don’t forget to plan for data quality.