This post was first published on the Manta blog and is republished with permission
An explosion of data should not scare us. It is an explosion of data processing that might be our doom. MANTA’s CEO Tomas Kratky explains how to deal with growing complexity of our data pipelines in 2021.
In our end-of-the-year blog post, we usually thank our customers, partners, investors, and other supporters because we would not exist without them, and happy customers are what keeps us running in the first place. We also do it to praise ourselves and to mention a few amazing achievements so everybody knows we had an excellent year 😎
Not Knowing Costs Lives and Money
But that somehow seems irrelevant when we look back at 2020. It was definitely not an easy year for anyone. We, both as a community and as individuals, were tested in many different and challenging ways. Many lives were lost and even more lives were drastically impacted by COVID-19. We had to live (and we still live) with a lot of uncertainty all around us. We did not know how dangerous the virus would be, if there would be any long-term effects on our health, if / how easily we could get sick multiple times, and the list goes on. And that kind of “not knowing” sucks. Many of us decided to stay at home, limit our social interactions, and have our groceries delivered for many months to protect ourselves and the ones we love.
Even if COVID-19 is not my topic today (this is a data lineage blog after all), the past eight months have made me think about how impactful not knowing actually is and how easy it is to ignore the fact that we do not know something, unless we are in real danger. And this is very relevant to what we do here at MANTA as well. So, let’s talk about uncertainty and the price we pay for it when it comes to our data pipelines.
Complete Visibility into Your Data Pipeline is Critical
When I founded MANTA in 2016, data lineage was not a well-understood concept at all. Things changed a bit in 2018 with more focus on compliance, data governance, and privacy. Understanding the journey of data became critical for almost every organization. Suddenly, it was not enough to know where the data is located and what types of data we have, it was also important to understand and control how different data locations are connected as part of a data lifecycle. And there was a price to pay for “not knowing”—penalties, fees, and potentially, the damaged reputation of a company.
30% to 40% of data engineering resources are wasted on frustrating and repetitive manual tasks.
But there was (and still is) another, more disruptive and less visible change happening, turning data lineage into something absolutely critical for every data-driven company—the explosion of our data stack. Over the past few years (less than 5), most companies have adopted (at least some) new technologies for data streaming, big data, cloud data storage, modern AI/ML, etc. Our data landscape has evolved from a simple architecture with basic ETL/ELT, a data warehouse, and a bunch of reports into a living and breathing ecosystem with tens or hundreds of different technologies.
Our data pipelines today are very complex. They are also woefully misunderstood and poorly managed. And privacy and compliance are no longer the main issues.
- As a manager, you open a report/dashboard to make an important decision, and you have no idea how the numbers there are calculated or where they come from. And no one can answer your questions with a sufficient level of certainty in a reasonable amount of time. That drastically undermines trust in data.
- As a data engineer, you are terrified of making even a small change to the environment because you have no idea what else may be impacted by that change. Or if a data incident happens, you spend days, weeks, or sometimes even months (yes, I still remember that stupid project 15 years ago) chasing it through our data pipeline to find the root cause.
The price we pay for limited visibility and observability of our data pipelines, the price we pay for “not knowing”, is enormous—wrong and/or late business decisions that could cost us millions of dollars and around 30% to 40% of data engineering resources are wasted on frustrating and repetitive manual tasks. I am not even talking about slow data incident resolution and its impact on the daily business of a company.
An Incomplete List of Missteps and Failures
It is not an explosion of data we should be worried about. It is an explosion of data processing, the growing complexity of our data pipelines, that may be our doom if not handled properly. We can do amazing things with data but only if our data pipelines are under control. And even though we’ve seen steady growth in the importance of data lineage since we have started MANTA, the data lineage boom really started in 2020 (and it is still just the beginning!).
We see traditional data management vendors with new data lineage options, open-source solutions popping up everywhere, and more and more specialized players. Has our “not-knowing problem” been solved? Unfortunately, not really! Data lineage became a buzzword in 2020, and now it’s time to pay the price for that. Here is a list of the most annoying data lineage missteps I saw in the last 12 months.
1. Messing with Scanners
Every data lineage solution starts with automated scanning. The less you do manually, the better for everyone. That is why you see vendors promoting long lists of technologies they support. But there is a catch. In most cases, as data catalog solutions, they only scan data structures (like tables, columns, fields, etc.), not data processing logic.
In most cases, data catalog solutions can only scan tables, columns, fields, and similar data structures.
If cataloging data is your goal, OK. But do not expect to see high quality (or any) lineage. A very good example is if a vendor claims to have scanners ready for MongoDB or Amazon S3. Sounds good, but those are both “containers” for data and there is no logic hidden there. All the logic is typically in your ETL tool, Python or Java code, etc. But other cases are not so obvious; for example, a scanner for Kafka or Spark where understanding Java and Python is key, or a scanner for Oracle where PL/SQL procedural language is what really matters. This “sales technique” is quite popular among larger, traditional data management vendors.
2. Pretending Runtime Lineage is Data Lineage
Doing data lineage well is very hard, and people do not like to hear that. So they look for shortcuts. One of the major themes of 2020 was operational/runtime lineage (there is no formal vocabulary, so let’s not fight over naming conventions), and I expect to hear about it even more in 2021. While data lineage is a map of all possible data flows and is typically derived from the processing logic itself (by analyzing and decoding it), runtime lineage represents information about data flows executed recently and is usually derived from log files and execution plans generated by data processing tools.
One major disadvantage of the runtime approach is that it only gives you high-level information about data flows and you do not see the details of the calculations. Even more important is that runtime lineage, by definition, might be good for traditional incident resolution (because you only need to follow recently-executed data flows) but it is of no use for incident prevention and impact analysis (because you need to see all possible flows of data from the location you plan to change, not only flows executed recently).
And even if you collect runtime lineage over a very long period of time, it is still not good enough because your environment is not stable and undergoes constant change, so runtime lineage from two weeks ago may not be relevant today. (Do not get me wrong—runtime metadata is a very important piece of the puzzle to better control your data pipelines, but it is step number two after you have data lineage in place.) This approach is typical for new open-source data lineage / data catalog solutions.
3. Faking Lineage with AI and Machine Learning
What is worse than one buzzword (data lineage)? Two buzzwords, obviously (data lineage discovery with advanced AI/ML techniques). With more demand for data lineage, traditional players in data quality, data privacy, and data cataloging see it as a natural area for expansion of their product portfolios. And as they are very strong in data discovery, profiling, and classification using AI/ML techniques, many of them have decided to use the same approach for deciphering data lineage. It sounds nice because you do not need to bother with building scanners for different technologies and programming languages, but the catch is that by looking at data you have no way of discovering true data lineage. First, there are no details about transformations (because you are only looking at data), and second, if the data logic is more complex, there is no way to find that connection (flow) just by looking at the data. But who cares, it is AI/ML so it will solve all our problems, right?
There Are No Shortcuts to Success
This is definitely not a complete list of all the missteps to data lineage in 2020, but these are surely my favorites. As I mentioned in the first half of this article, data lineage is key to restoring the visibility and observability of our data pipelines. It is critical for limiting the scope of “not knowing” and keeping the associated costs to an absolute minimum. 2020 was a very important year for data lineage, and its significance will even grow in 2021. As a wise man once said more than 30 years ago: “There is no silver bullet.” And data lineage is no exception. Doing data lineage is far from easy, and if you take a shortcut, you are likely to end up at a dead end.
That being said, I wish all of you health, love, and happiness in 2021. And fewer data lineage missteps!