Managing the modern data pipeline

A few weeks back I wrote about the emerging role of the data engineer – the group of person’s responsible for delivering the quality data pipelines that enable the data scientist. I followed it up with this tweet – which I believe summaries very consisely the changing reality of big data and advanced analytics 2012…

ETL ELT architecture

What is ETL?

ETL defined Extract, Transform and Load  or ETL is a standard information management term used to describe a process for the movement and transformation of data. ETL is commonly used to populate data warehouses and datamarts, and for data migration, data integration and business intelligence initiatives. ETL processes can be built by manually writing custom scripts or code, with…

Four steps to transforming data

Another brief post this week on an area that we do not focus on very often: data transformation. Data transformation is a relatively mundane yet fundamental data management capability – particularly when dealing with similar data from multiple sources. Three simple examples: System A represents Male and Female and 0 and 1, while System B…

Why Hadoop

Hadoop: Quick Facts

Hadoop is a highly scalable, NoSQL database used to perform high speed analytics against large volumes of data. Hadoop works on the principle of schema on read, not schema on write. Any data (structured or unstructured) can be stored in Hadoop with out developing a schema. This cuts the development time scales, reduces risk complexity…