The Evolution of Hadoop

As we leave 2020 behind i felt that this was an appropriate opportunity to review the evolution of Hadoop. Happy new year, everyone.

This post was first published on the Precisely blog.

When Hadoop was initially released in 2006, its value proposition was revolutionary—store any type of data, structured or unstructured, in a single repository free of limiting schemas, and process that data at scale across a compute cluster built of cheap, commodity servers. Gone were the days of trying to scale up a legacy data warehouse on-premises built on expensive hardware. Processing more data was as simple as adding a node in the cluster. As the variety and velocity of data continued to proliferate, Hadoop provided a mechanism to leverage all of that data to answer pressing business questions.

We’ve come a long way since Hadoop since burst on to the scene, and as we look at the cloud transformation organizations are embarking on, we at Precisely would like to trace how Hadoop has transformed since it first burst on the scene, and where we see it going.

Early days

Hadoop’s initial form was quite simple: a resilient distributed filesystem, HDFS, tightly coupled with a batch compute model, MapReduce, to process the data stored in the distributed file system. Users would write MapReduce programs in Java to read, process, sort, aggregate, and manipulate data to derive key insights. While impressive, the ongoing challenge of finding developers comfortable writing Java MapReduce code, and the inherent complexity of doing so, led to the release of query engines like Hive and Impala. With these technologies, users familiar with SQL could leverage the power of Hadoop without the need to understand MapReduce code.

Apache Spark joins the party

Hadoop took a significant step forward with the release of YARN in 2012 as an “operating system” of sorts for the platform. YARN’s introduction decoupled MapReduce from Hadoop as the only available data processing paradigm. This was a monumental step forward, as it signaled Hadoop’s shift from being a single product to an ecosystem with a variety of different tools in the stack.

As Hadoop was maturing, Apache Spark was being developed at Berkeley. Designed as a scalable compute framework for memory-intensive workloads, with no native storage, Spark was a natural fit within the Hadoop ecosystem. Paired with Hadoop’s HDFS for data storage, Spark became a natural compute alternative to MapReduce for workloads within Hadoop. This allowed users to leverage Spark for machine-learning applications, accelerated ETL workloads, and stream processing with the utilization of Spark streaming. Clearly, Hadoop was growing to accommodate a wider variety of workloads.

Hadoop in age of the cloud

This brings us to the cloud transformation of today. While there has been significant consolidation in the Hadoop vendor market over the past five years, there are still a variety of Hadoop offerings available to organizations. AWS, Azure, and GCP all have their own Hadoop-as-a-service offerings (EMR, HDInsight, and Dataproc, respectively).

On the other end of the spectrum, Cloudera offers the Cloudera Data Platform (CDP) across datacenter, private cloud, and public cloud. What’s especially remarkable about CDP and its architecture, is that it is Hadoop reimagined in a cloud-native context. Gone are the classic Hadoop zoo animals, and in its place are business use cases and experience tailored towards those use cases. In the curated experiences, users have the ability to spin up data warehouses (based on Hive), machine learning experiences (based on Spark), and more. All of these experiences are powered by Kubernetes, offering users the scalability, compute isolation, and ease of deployment that users expect in the cloud.

It’s clear that Hadoop and its definition have continued to evolve since the platform’s introduction nearly 15 years ago. What started as a purely on-premises offering built on HDFS and MapReduce is now entirely re-imagined within the cloud, with Kubernetes, cloud object storage, Spark, and more now in the ecosystem. Clearly, Hadoop has grown to meet the needs of the cloud opportunity, and it will be extremely exciting to see where it goes in the next 15 years.

Learn how to unleash the power of data – Read the Precisely eBook: A Data Integrator’s Guide to Successful Big Data Projects