Data Lake vs Data Cesspool

Gary Allemann

Data Lake vs Data Cesspool

Explore the value of a data lake and its impact on data quality in this insightful post. Learn how a data lake, an unstructured data warehouse, can bring together diverse data sources for Hadoop analytics and integration. Understand the importance of data lineage and context to ensure accurate analysis and decision-making.

Datameer Lineage for Hadoop — Visual Linage and Context for Hadoop analytics and integration

Andrew C. Oliver’s (@acoliver) recent post “How to create a data lake for fun and profit” is an interesting take on the value of a data lake – an unstructured data warehouse where you pull in all your different sources into one large “pool” of data.

Schema-on-Read

In contrast to data marts and warehouses, a data lake doesn’t limit the potential use cases by forcing predefined schemas or limits to existing data.

Andrew suggests that the architecture for a data lake is simple – a Hadoop File System (HDFS). The benefits are also simple – getting access to data no longer requires a complex and expensive data integration effort because the data is already there.

To start a new project you simply provide appropriate access to the data and start your analysis.

Of course, since Andrew wrote his article the options for data lakes have moved beyond Hadoop, with popular choices including AWS, Databricks, Microsoft Azure, Snowflake, and more. But the principles remain valid.

The bulk of the grunt work comes in the form of building feeds from existing systems. You can go the programming route, or you can invest in easy-to-use data preparation platforms, that cut the time and effort required to get data in, and out, of Hadoop by many months.

Democratisation without context is useless

The article drove an interesting conversation with data governance expert, Alan D. Duncan (@Alan_D_Duncan). His point – making data available to more people (the so called democratisation of data) is a good thing, but without context it is valueless.

Some of us may have used the term “data attic” to describe a data warehouse where data gets dumped to get dusty and never get used. A data cesspool may be the data lake equivalent. [Tweet this]

Alan’s point, without context, a data lake may be unmanageable.

As per my post Why do you need data governance – data governance provides context for data, and provides policies for the appropriate use of data. Research shows that organisations that base their big data plan on the data governance strategy get value.

I agree with Andrew and Alan that a data lake should be based on some planned use cases – these give context and scope for which data should be stored. What we don’t need to worry about is how we will store the data. The months of planning and development that go into building schemas can be ignored.

What we also need to do is track our lineage as we work with big data. [Tweet this]

How did we join that unstructured web log to the structured CRM data?

What filters did we apply to isolate the signal from the noise?

Who should have access to this customer interaction data?

The lineage and governance of big data are no less important and are arguably more difficult to manage, than the lineage of structured data. Tools such as MANTA answer this kind of question with a visual lineage that makes it easy for any user to understand how the data got to him, and what has happened to it along the way.

This ensures confidence that we are working with the right data and helps to stop errors in analysis based on poor data quality, sample bias, or other context-related issues.

Tags:

data lake, data lineage

Responses to “Data Lake vs Data Cesspool”

The Scary Data Lake | Liliendahl on Data Quality

October 30

[…] The idea of having a data lake scares the hell out of data quality people as seen in the title used by Garry Allemann in the post Data Lake vs Data Cesspool. […]

Reply
Opinion: Data: A Concise Word For A Tremendous Amount Of Information

November 2

[…] in 2014, we identified that without context the data lake would become unmanageable. In 2018, we talked about how chaotic data lakes were making it impossible to deliver on the goals […]

Reply

Data Lake vs Data Cesspool

Responses to “Data Lake vs Data Cesspool”

Leave a comment Cancel reply

Related posts

Data Lake vs Data Cesspool

Schema-on-Read

Democratisation without context is useless

Share this:

Responses to “Data Lake vs Data Cesspool”

Leave a comment Cancel reply

Related posts

Discover more from Data Quality Matters