Three tips for laying the ground work for machine learning


In a post on Information Week Exasol CTO, Mathias Golombek outlines his three tips to creating the infrastructure for machine learning.

Delivering quality data at scale

As discussed in #AI differentiation is in the data, Mathias stresses the importance of quality data to the delivery of usable machine learning models.

” For ML algorithms to offer informed judgments and recommendations on business decisions, the underlying database must provide a steady supply of clean, accurate, and detailed data. It’s important to remember that more data doesn’t necessarily mean better data, ” he says.

A common approach to building models for machine learning is to start with a subset of data to test and build the model. Using a platform like Trillium for Big Data helps to ensure that the data cleansing and enriching applied to the subset can be deployed at scale into our big data environment – Hadoop or Spark, for example, – without loosing the changes made. This is critical to ensure that the production data model reflects the development / test environments and will deliver expected results.

The Syncsort whitepaper Debugging Data: Why Data Quality is essential for AI and Machine Learning outlines strategies for dealing with data quality issues as you enter into your machine learning journey.

Embrace hyperscale

Hyperscale cloud provides unique opportunities to scale machine learning, as required, without making massive infrastructure investments.

Mathias points out that cloud elasticity allows you to scale your infrastructure at different points at the project life cycle – for example working on a lap top for development, working on internal servers for test, and deploying to cloud at scale for production runs that may happen once or twice a month.

Understanding your data governance policies for the cloud, along with the ability to build scalable data pipelines and deploy both on premise and in the cloud is critical.

Leverage Python

Lastly, Mathias recommends that your data scientists skill up on Python – one of the world’s most popular programming languages for predictive analytics. Python skills may of course be supplemented by commercial tools – in particular for data engineering and data quality – but will remain at the core of any machine learning initiative.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.