Data lineage is generally defined as a kind of data life cycle that tracks the data’s origins and where it moves over time, as well as changes (such as aggregations or transformations) that may happen to it as it moves.
Historically, this meant the data lineage has become a term associated with ETL – a process for the movement and transformation of data. ETL developers use data lineage embedded into their ETL tool of choice in order to visually represent the ETL flows, identify issues and understand the impact of changes.
From Data Lineage to Business Traceability
Initially driven by compliance requirements, such as BCBS 239, business has become increasingly aware of the critical importance of lineage to understanding the source of data and delivering well understood and trusted reports.
However, the technically oriented views of lineage often do not support business users who require a simplified understanding that allows them to understand at a glance that data is sourced from the right place and is an accurate representation of the source.
Pioneered by business oriented vendors, such as our partner, Collibra, business traceability diagrams are designed to add business context to data lineage – allowing users to answer questions such as “where does my data come from? What policies were used? What standards are applied?”
In a popular post on Understanding the difference between Lineage and Traceability Collibra makes the case that the ability to present both technical lineage and business traceability diagrams is critical to understanding data and using it effectively.
ETL provides an incomplete picture
An unspoken truth, however is that ETL typically provides an incomplete picture.
- As described above, business traceability adds business context to ETL processes. This context may also include manually captured steps in an ETL process – for example there may be cases where human are required to intervene and edit data or add data points from the source to meet specific criteria. These manual edits cannot be detected by an ETL process and must be manually added to complete a lineage view
- In many environments, formal ETL processes may call stored procedures, or other database level code, in order to perform transformations or aggregations. This code is not typically included in a lineage view
- Finally, many organisations sit with multiple ETL tools for different purposes – maybe using one of the big vendors for their data warehouse, a specialist like Syncsort Connect for Hadoop, and combinations of SQL code or Microsoft SSIS for departmental work.
How do we bring this altogether?
Automation is key
A typically large organisation may have tens of thousands of ETL and SQL scripts and stored procedures that must be documented and maintained in order
In a case study titled The only reason developers exist our partner, MANTA, discussed the challenges involved in trying to complex landscape keep this up to date at a Fortune 500 bank.
The customer faced three common challenges:
- The inevitability of change means that ETL processes are in a constant state of flux
- The pain of documentation means that developers do not accurately document their processes resulting in
- Make believe lineage – a documented view of the ETL world that does not reflect reality
Using MANTA allows companies to automatically update lineage across a range of ETL tools, data sources and reporting platforms – providing a consolidated an accurate view of your lineage and, if desired, pushing it to tools like Collibra to add the business context and business traceability.
Companies that need to understand the source of their data and get a true picture of how it moves and changes through their organisation should take a look.
More on this topic in future posts