Spark is dead – long live …. ? – Data Quality Matters

Discover the changing landscape of Big Data as Hadoop’s prominence wanes. Explore the potential of scalable storage and analytics capabilities and learn how new technologies like Spark, Flink, and more are reshaping the data analytics market.

When this post was written Hadoop was, largely, synonymous with Big Data.

This is because Hadoop had the potential to alter how we analyse data – by providing relatively cheap, highly scalable storage and analytics capabilities that allow businesses to ask, and answer, questions that were previously unanswerable.

The promise of Big Data has raised the profile of data at C-level, created new roles such as the Chief Data Officer and is driving demand for previously unloved disciplines such as data governance.

In spite of the interest and the hype, Gartner Research shows that only about 40% of companies have made serious investments in Hadoop yet – with others expected to begin within the next few years.

The evolving big data landscape

One factor that hindered adoption: The Hadoop framework is still evolving.

Hadoop pioneer and big data expert, Stefan Groschupf, discussed this in a recent webinar titled – The Everchanging Hadoop Ecosystem. What does it mean for you?

Stefan had been involved with the development of Hadoop from the early days, first as an engineer and data scientist and, more recently, as CEO of big data discovery platform, Datameer.

The Hadoop ecosystem is based on an open-source framework with commercial variations available from various vendors. the framework consists of multiple components that can, very simplistically, be described as various engines for executing analyses in different ways.

What is the best computing engine within Hadoop?

One of the earliest and most commonly used is MapReduce – an approach for processing and generating large data sets on parallel, distributed clusters. For sheer computing power at a relatively low cost, MapReduce is hard to beat.

But no single computing framework is ideal for every analytic task.

That’s why many big data solutions focus on specific computation frameworks that meet different needs – for example:

In-memory machine learning or stream processing frameworks
Proprietary in-memory software, where data is analyzed outside the Hadoop ecosystem
Large-scale data analytics frameworks, where the analysis must be distributed across a cluster
Small data analytics/BI platforms, where the analysis must be done on a single, high-memory machine

Computing engines, such as Spark, Mahout, YARN, Tez, and many more, meet specific analytics needs but may be poor at others.

This is where the real challenge, complexity and cost of Hadoop is hidden.

Traditionally, Hadoop has required API calls to be written using low-level code – JAVA, R or Python are common – to load, integrate, prepare, analyse and visualize data.
Low-level code implies (scarce/expensive) technical resources and long development cycles. Early Hadoop deployments ran over similar periods to traditional data warehousing developments – 6, 12, even 18 months or more before the real value was shown.
Code does not adapt to changes in the platform. Newer versions of the widely used Mahout machine learning engine, for example, use a completely different API too early versions. This means that engineers need to redevelop all their Mahout code in order to upgrade to newer versions of Hadoop.
Code does not cater to emerging (or failing) computing engines. Last year, for example, Apache Spark was being touted as the next big thing – offering superior performance to MapReduce for certain analytics jobs. Now, newer components, such as Flink, are proving to be much quicker than Spark. Having spent 8 months redeveloping your MapReduce code to run on Spark, do you really want to go through that pain and expense again to take advantage of Flink? Only to find that it too may have a limited shelf life?
The best framework for development may not be the best for production. Factors such as the size of the data set, the number of Hadoop nodes available, and even the type of analysis mean that different engines may perform better in development (against a smaller sample data set, or on fewer nodes) while another engine may perform best in production. Data scientists that cannot code for this complexity waste time and resources.

The shifting Big Data landscape

Today’s big data landscape is no longer synonymous with Hadoop. Entrants such as Databricks, Snowflake, AWS and Microsoft Azure have reshaped the big data analytics market offering a wide range of technology choices, mostly in the Cloud, each with its own pros and cons.

Early Hadoop adopters had to learn to abstract their Hadoop development layer from the underlying computing engines. This meant that they can quickly adapt to changes and future-proof their big data development effort.

Modern businesses must similarly make choices to avoid vendor lock-in.

What will your company do?