Three tips for laying the groundwork for machine learning

Unlock the potential of machine learning with these three foundational tips: prioritize data quality, harness hyperscale cloud capabilities, and empower your data scientists with Python skills. Learn how these principles lay the groundwork for successful machine learning endeavors.


In a post on Information Week Exasol CTO, Mathias Golombek outlined three tips for creating the infrastructure for machine learning.

laying the groundwork for machine learning

Delivering quality data at scale

As discussed in #AI differentiation is in the data, Mathias stresses the importance of quality data to the delivery of usable machine learning models.

“For ML algorithms to offer informed judgments and recommendations on business decisions, the underlying database must provide a steady supply of clean, accurate, and detailed data. It’s important to remember that more data doesn’t necessarily mean better data.”

Mathias Golombek

A common approach to building models for machine learning is to start with a subset of data to test and build the model. Using a data quality platform like Trillium for Big Data helps to ensure that the data cleansing and enriching applied to the subset can be deployed at scale into our big data environment – Snowflake, Hadoop or Spark, for example, – without losing the changes made. This is critical to ensure that the production data model reflects the development/test environments and will deliver the expected results.

The Precisely whitepaper Debugging Data: Why Data Quality is essential for AI and Machine Learning outlines strategies for dealing with data quality issues as you enter into your machine learning journey.

Embrace hyperscale

Hyperscale cloud provides unique opportunities to scale machine learning, as required, without making massive infrastructure investments.

Mathias points out that cloud elasticity allows you to scale your infrastructure at different points in the project life cycle – for example working on a laptop for development, working on internal servers for test, and deploying to the cloud for production runs that may happen once or twice a month.

Understanding your data governance policies for the cloud, along with the ability to build scalable data pipelines and deploy both on-premise and in the cloud is critical.

Leverage Python

Lastly, Mathias recommends that your data scientists skill up on Python – one of the world’s most popular programming languages for predictive analytics. Python skills may of course be supplemented by commercial tools – in particular for data engineering and data quality – but will remain at the core of any machine learning initiative.

For example, Data360 Analyze is a commercial data preparation tool that accelerates delivery of self-service data pipelines while including Python and R options for custom extensions or to solve complex problems.

What is the impact of poor data quality on machine learning? Explore the impact of poor quality data on ML models.

Are you hesitant to fully embrace the potential of AI? A lack of trust may be holding you back

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.



Related posts

Discover more from Data Quality Matters

Subscribe now to keep reading and get our new posts in your email.

Continue reading