Hadoop works on the principle of schema on read, not schema on write. Any data (structured or unstructured) can be stored in Hadoop with out developing a schema. This cuts the development time scales, reduces risk complexity and reduces the impact of poor quality data that may have caused traditional ETL jobs to fail. Instead, consuming programs determine and apply structure when they access it.
Hadoop runs on commodity hardware. This makes it easily 10 times cheaper to deploy than the high end, specialised hardware used for typical enterprise data warehouse deployments (based on the average cost per terabyte of computing power). Where the average EDW may store and analyse around 15TB of data, typical Hadoop deployments may store and process a few hundred TB of data for the same cost.
Hadoop is fault tolerant. Hadoop copes with expected failures to the commodity hardware used through data replication and speculative processing. This means that Hadoop will run multiple copies of the same task (assuming resources are available) until one returns results.
Hadoop requires a new approach. Traditional BI and ETL tools are designed to work with predefined, structured schemas. While these tools can be made to work with Hadoop, typically through a Hive interface, this approach negates a key benefit of Hadoop – the reduction in development time and costs allowed by schema on read.