Navigating Data Complexity in 2021 | The Importance of Data Lineage

Learn how to handle the explosion of data processing and the complexity of data pipelines in 2021 to avoid the high costs of uncertainty. Explore the critical importance of data lineage for data-driven companies and the common missteps to avoid.

In the realm of data management for 2021, addressing the surge in data processing complexity is paramount. Discover how mastering data pipelines can help you sidestep the steep costs of uncertainty. Delve into the critical role of data lineage for data-centric enterprises and learn how to evade common pitfalls.

An explosion of data should not scare us. It is an explosion of data processing that might be our doom. MANTA’s CEO Tomas Kratky explains how to deal with the growing complexity of our data pipelines in 2021.

The Price of Uncertainty

In uncertain times, the price of not knowing can be steep. 2020 was undoubtedly a challenging year for all, characterized by unprecedented trials and the devastating impact of COVID-19. We faced uncertainty about the virus’s danger, its long-term effects, and our vulnerability to multiple infections. Such uncertainty is never comfortable.

In response, many of us opted for safety measures like staying home, limiting social interactions, and relying on home deliveries. Although this is not directly related to our discussion on data lineage, it highlights the consequences of not having facts to make decisions on. This concept is equally applicable to our data pipelines and the uncertainty that surrounds them.

A Clear View of Your Data Pipeline is Crucial

When Kratky founded MANTA in 2016, the concept of data lineage was still in its infancy. Fast forward to 2021, and it’s become a linchpin for almost every organization. Beyond merely knowing the location and types of data, understanding how various data points connect within the data lifecycle has become essential. Ignoring this understanding comes at a price—fines, penalties, and damage to a company’s reputation.

The landscape of data processing has evolved rapidly in the last few years, with companies adopting various technologies like data streaming, big data, cloud storage, and modern AI/ML. Our data pipelines have transformed from simple structures into complex ecosystems, often with numerous technologies at play.

the complexity of modern data stacks — Modern data stacks are incredibly complex

Today, our data pipelines are intricate, misunderstood, and inadequately managed. Trust in data erodes when decision-makers open reports or dashboards without a clear understanding of how the numbers are calculated. Data engineers fear making even minor changes due to the uncertainty of their impact. When data incidents occur, tracking them through our convoluted data pipelines becomes a monumental task.

The price we pay for limited visibility into our data pipelines, the price we pay for “not knowing,” is staggering—incorrect or delayed decisions that can cost millions of dollars, along with a significant waste of data engineering resources. And let’s not forget the impact on daily business operations.

Common Missteps and Failures

It’s not the sheer volume of data that should concern us but the complexity of data processing within our pipelines. To harness the full potential of data, we must first gain control over our data pipelines. While data lineage has gained prominence over the years, the real data lineage boom began in 2020, and it’s just the beginning.

We’ve witnessed traditional data management vendors introducing new data lineage options, open-source solutions emerging, and specialized players entering the arena. However, has this solved our “not-knowing problem”? Unfortunately, not entirely. Here are some of the most prevalent missteps in data lineage observed over the past year:

In most cases, data catalog solutions can only scan tables, columns, fields, and similar data structures.
Tomas Kratky, MANTA

Misguided Scanners: Many data lineage solutions emphasize automated scanning, aiming to minimize manual effort. While they may support various technologies, they typically focus on cataloguing data structures like tables and columns, neglecting data processing logic. This approach falls short when it comes to achieving high-quality lineage.
Runtime Lineage Misconception: 2020 witnessed a surge in operational/runtime lineage, but this approach offers only high-level insights into recent data flows. It fails to provide the level of detail required for incident prevention and impact analysis, making it unsuitable for comprehensive data lineage.
AI and Machine Learning Illusion: Some vendors have attempted to leverage AI and machine learning for data lineage discovery, but this approach falls short in deciphering true data lineage. It lacks details about data transformations and struggles with complex data logic.

There Are No Shortcuts to Success

While these are not the only missteps in data lineage, they are certainly noteworthy. As we’ve explored earlier, data lineage is pivotal in restoring visibility and observability in our data pipelines. It plays a critical role in minimizing the scope of “not knowing” and the associated costs. Data lineage is no silver bullet; it’s a complex undertaking that can’t be cut short. Shortcuts can, and do, lead us down dead ends.

In conclusion, 2020 marked a pivotal year for data lineage, and its significance will only grow in 2021. Remember, there are no shortcuts to success in the realm of data lineage. It’s a challenging journey, and cutting corners is likely to lead to problems down the road.