This is because Hadoop has the potential to alter how we analyse data – by providing relatively cheap, highly scalable storage and analytics capabilities that allow business to ask, and answer, questions that were previously unanswerable.
The promise of Big Data has raised the profile of data at C-level, created new roles such as the Chief Data Officer and is driving demand for previously unloved disciplines such as data governance.
In spite of the interest, and the hype, Gartner Research shows that only about 40% of companies have made serious investments in Hadoop yet – with others expected to begin within the next few years.
One factors that hinders adoption: The Hadoop framework is still evolving.
Hadoop pioneer and big data expert, Stefan Groschupf, discussed this in a recent webinar titled – The Everchanging Hadoop Ecosystem. What does it mean for you?
Stefan is has been involved with the development of Hadoop from the early days, first as an engineer and data scientist and, more recently, as CEO of big data discovery platform, Datameer.
The Hadoop ecosystem is based on an open source framework with commercial variations available from various vendors. the framework consists of multiple components that can, very simplistically, be described as various engines for executing analyses in different ways.
What is the best computing engine within Hadoop?
One of the earliest and most commonly used is MapReduce – an approach for processing and generating large data sets on parallel, distributed clusters. For sheer computing power at relatively low cost, MapReduce is hard to beat.
But no single compute framework is ideal for every analytic task. That’s why many big data solutions focus on specific computation frameworks that meet different needs – for example:
- In-memory machine learning or stream processing frameworks
- Proprietary in-memory software, where data is analyzed outside the Hadoop ecosystem
- Large-scale data analytics frameworks, where analysis must be distributed across a cluster
- Small data analytics/BI platforms, where analysis must be done on a single, high-memory machine
Computing engines, such as Spark, Mahout, YARN, Tez, and many more, meet specific analytics needs, but may be poor at others.
This is where the real challenge, complexity and cost of Hadoop is hidden.
- Traditionally, Hadoop has required API calls to be written using low level code – JAVA, R or Python are common – to load, integration, prepare, analyse and visualize data.
- Low level code implies (scarce/expensive) technical resources and long development cycles. Early Hadoop deployments ran over similar periods to traditional data warehousing developments – 6, 12, even 18 months or more before real value was shown.
- Code does not adapt to changes in the platform. Newer versions of the widely used Mahout machine learning engine, for example, use a completely different API to early versions. This means that engineers need to redevelop all their Mahout code in order to upgrade to newer versions of Hadoop.
- Code does not cater to emerging (or failing) computing engines. Last year, for example, Apache Spark was being touted as the next big thing – offering superior performance to MapReduce for certain analytics jobs. Now, newer components, such as Flink, are proving to be much quicker than Spark. Having spent 8 months redeveloping your MapReduce code to run on Spark, do you really want to go through that pain and expense again to take advantage of Flink? Only to find that it too may have limited shelf life?
- The best framework for development may not be best for production. Factors such as the size of the data set, the number of Hadoop nodes available, and even the type of analysis mean that different engines may perform better in development (against a smaller sample data set, or on fewer nodes) while another engine may perform best in production. Data scientists that cannot code for this complexity waste time and resources.
Early Hadoop adopters have learnt to abstract their Hadoop development layer from the underlying computing engines. This means that they can quickly adapt to changes and future proof their big data development effort.
What will your company do?