Big data or big disaster?

Explore the evolution of big data in South Africa: From the early days to today’s Hadoop data lakes transforming data architecture. Discover the challenges of data quality, governance, and democratization, and learn how a governed data catalog can unlock the potential of your enterprise information asset. Dive into the journey from data chaos to trusted…

When I first started posting about big data, very few users existed in South Africa.

Today, most last organisations have a Hadoop data lake – in many cases replacing traditional ETL and/or acting as a data archive as well as a feeder to the enterprise data warehouse and, various operational data marts.

In a few, short years, Hadoop has moved from a new experiment to a core component of the data architecture.

When I asked the question “Data lake or data cesspool?” back in 2014, I raised the importance of data lineage in the data lake.

At the time, conventional wisdom held that the data lake did not need documentation, governance, or quality. This myth, and others common to the early days of big data analytics, has been thoroughly debunked.

The challenges of finding, understanding and delivering trusted data through the data lake are more relevant than ever.

In many cases, the intention of the data lake is to support a more agile approach to data sourcing – allowing us to both empower IT staff and data scientists to deliver new insights more quickly – and to empower self-service business intelligence for end users.

This democratization of data – making data available to the people that need it – is a good thing.

But early adopters are finding than an ungoverned and undocumented data lake cannot deliver.

In the chaos of the data swamp, legitimate users cannot identify the data they need, or cannot trust it as they cannot guarantee its source, measure its timeliness, track its quality, etc.

Conversely, without proper governance, it is possible for sensitive data to be exposed to unauthorized users. In the wake of the Facebook / Cambridge Analytica scandal, and with GDPR and PoPIA looming business is beginning to understand the importance of protecting sensitive data from illegitimate or unethical uses.

In order to be trusted, and useful, the data lake must:

Make it easy to find the data I need
Make it easy to understand the context of the data – i.e. track its origins and technical details (lineage)
Make it possible to assess the quality of the data
Allow me to find similar (or related) data sets to broaden my analytics potential, and to compare various data sets against each other to find the best possible set for my purpose
Govern access to data ( e.g. through data sharing agreements) to ensure that sensitive data sets are protected and to measure which data is useful
Recognise that the data lake is part of a broader data ecosystem and provide insights into how and where it fits in.

One approach to solving these challenges, that fits with the concept of data democratization, is the governed data catalog.