The reality is that many analytics platforms struggle to track how they work.
One challenge is that many analytics solutions leverage multiple technologies.
For example, we may source data from multiple applications – each with its own underlying database such as Oracle or MS SQL Server. Each of these platforms may have internal code that makes changes to data, or generates critical fields.
Our target platform (the analytics environment) may be yet another technology – such as Snowflake or Spark. Again, internal code may be used to make changes 9such as aggregations) to data once it arrives.
We may use one or more ETL tools, or even custom code such as JAVA or SQL, to move data from these sources to our target. And once it is there we may add one or more reporting tools – such as SAS, PowerBI or Tableau, each of which again may manipulate data.
Size and scale
In many organisations, there may be many thousands of individual scripts manipulating data. Maintaining these scripts is a huge challenge. Unsurprisingly, documenting these scripts often falls by the wayside.
More and more, we are moving our analytics platforms to the cloud. Hybrid cloud solutions combine on-premise data sources with cloud analytics platforms – adding complexity.
Data lineage provides a map of your data
We can think of data lineage as a map of our data – allowing us to easily track where data has come from, who is using it, and where it is going.
For example, moving to the cloud, we need to understand what data we’re using. Who is consuming that data? And if we take this one system and replace it or move it to the cloud, what flaws will need to get adjusted? What rows will need to get adjusted?
If we want to keep the system safe, we need to have a map. We need to know how the information flows so that when something goes wrong, we know where to look. If we don’t have a map, well, not having a map is not an option, really, because if we don’t have one, we still have to build it when something goes wrong, when we need to search for a solution. And the result is that the process takes a lot more time: projects are delayed, migrations take forever, reports are inaccurate, and it’s impossible to manage risk. So that’s where data lineage comes in.
Working with our technology partner, MANTA, we automatically scan the code across your data landscape, source systems, stored procedures, ETL processes, and BI tools to extract the movement of data and how it is changed in the process, and present this as visual maps.
Best of all, we track changes to these scripts over time so that you can quickly and easily figure out what changed when something breaks. We help BI teams to deliver new reports more quickly, and IT operations teams to keep critical data pipelines running.
Listen to Jan Ulrych, VP of Research & Education at MANTA, discussing the business case for data lineage, using real-world examples, and discussing how to manage the challenges of size, scale and complexity in this podcast below