Andrew C. Oliver’s (@acoliver) recent post “How to create a data lake for fun and profit” is an interesting take on the value of a data lake – an unstructured data warehouse where you pull in all your different sources into one large “pool” of data. In contrast to data marts and warehouses a data lake doesn’t limit the potential use cases by forcing predefined schemas or limits to existing data.
Andrew suggests that the architecture for a data lake is simple – a Hadoop File System (HDFS).The benefits are also simple – getting access to data no longer requires a complex and expensive data integration effort because the data is already there. To start a new project you simply provide appropriate access to the data and start your analysis.
The bulk of the grunt work comes in the form of building feeds from existing systems. You can go the programming route, or your can invest in easy to use data management platforms, such as Datameer, that cut the time and effort required to get data in, and out, of Hadoop by many months. Datameer also simplifes analytics against Hadoop, provides secure access to data, and simpifies many other routine data management tasks. [Tweet this]
The article drove an interesting conversation with data governance expert, Alan D. Duncan (@Alan_D_Duncan). His point – making data available to more people (the so called democratisation of data) is a good thing, but without context it is valueless.
Some of us may have used the term “data attic” to describe a data warehouse where data gets dumped to get dusty and never get used. A data cesspool may be the data lake equivalent. [Tweet this]
Alan’s point, without context, a data lake may be unmanageable. As per my post What is Data Governance and why is it important – data governance provides context for data, and provides policies for the appropriate use of data. Research shows that a organisations that base their big data plan on the data governance strategy get value.
I agree with Andrew and Alan that a data lake should be based on some planned use cases – these give context and scope for which data should be stored. What we don’t need to worry about is how we will store the data. The months of planning and development that goes into building schemas can be ignored.
What we also need to do is track our lineage as we work with big data. [Tweet this]
How did we join that unstructured web log to the structured CRM data?
What filters did we apply to isolate the signal from the noise?
Who should have access to this customer interaction data?
The lineage and governance of big data is no less important, and is arguably more difficult to manage, than the lineage of structured data. Tools such as Datameer answer these kind of questions with a visual lineage that makes it easy for any user to understand how the data got to him, and what has happened to it along the way.
This ensure confidence that we are working with the right data and helps to stop errors in analysis based on poor data quality, sample bias, or other context related issues.