Today, most last organisations have a Hadoop data lake – in many cases replacing traditional ETL and/or acting as a data archive as well as a feeder to the enterprise data warehouse and, various operational data marts.
In a few, short years, Hadoop has moved from new experiment to a core component of the data architecture.
When I asked the question “Data lake or data cesspool?” back in 2014, I raised the importance of data lineage in the data lake.
At the time, conventional wisdom held that the data lake did not need documentation, governance, or quality. This myth, and others common to the early days of big data analytics, has been thoroughly debunked.
The challenges of finding, understanding and delivering trusted data through the data lake are more relevant than ever.
In many cases, the intention of the data lake is to support a more agile approach to data sourcing – allowing us to both empower IT staff and data scientists to deliver new insights more quickly – and to empower self service business intelligence for end users.
This democratization of data – making data available to the people that need it – is a good thing.
But early adopters are finding than an ungoverned and undocumented data lake cannot deliver.
In the chaos of the data swamp, legitimate users cannot identify the data they need, or cannot trust it as they cannot guarantee its source, measure its timeliness, track its quality, etc.
Conversely, without proper governance, it is possible for sensitive data to be exposed to unauthorized users. In the wake of the Facebook / Cambridge Analytica scandal, and with GDPR and PoPIA looming business is beginning to understand the importance of protecting sensitive data from illegitimate or unethical uses.
In order to be trusted, and useful, the data lake must:
- Make it easy to find the data I need
- Make it easy to understand the context of the data – i.e. track its origins and technical details (lineage)
- Make it possible to assess the quality of the data
- Allow me to find similar (or related) data sets to broaden my analytics potential, and to compare various data sets against each other to find the best possible set for my purpose
- Govern access to data ( e.g. through data sharing agreements) to ensure that sensitive data sets are protected and to measure which data is useful
- Recognise that the data lake is part of a broader data ecosystem and provide insights into how and where it fits in.
One approach to solving these challenges, that fits with the concept of data democratization, is the governed data catalog.
Two upcoming webinars, hosted by Information Management, will explore how the data catalog adds value
On Thursday. 26 April, we will explore How data catalogs drive real results with business intelligence and analytics
On Tuesday, 15 May, the webinar will explore the value of a data catalog in socializing data across the enterprise, and in delivering healthy data processes and a data culture.
Register now to learn more.