Data quality remains a bug-bear for data analysts

The other day I came across an entertaining pair of posts written last year by Megan Dibble describing her Life as a Data Analyst.

Megan takes a slightly tongue-in-cheek look at the challenges facing data analysts including struggles with statistics and coding, and the ubiquitous politics between IT and business.

The bulk of her challenges are related to data integrity.

Based on conversations with various data scientists and data analysts anywhere between 60% and 90% of their time is taken up with finding and cleaning data for analytics.

Megan illustrates some of these issues in an amusing way:

The Challenge of Data Availablity & Data Quality

This is definitely the biggest struggle for most projects dealing with a large amount of data: the data everyone wants is not easily available. That’s part of the reason why data engineers and data analysts exist, however, sometimes it would be nice if a simple query — select * from… — was all it took to get data for a project!

A data catalogue is the tool required to make finding data easier. However, if the catalogue is focussed on only the most commonly used data it will end up simply cataloguing the data that is already understood and available. Similarly, if the catalogue captures all enterprise datasets without context it will become overwhelming.

To be useful, a catalogue must be implemented with intention. This requires discipline and, initially some effort. In particular, as data analysts or data engineers create or use each data set they should get into the habit of adding some context – where the data came from (when it is created) and what it was being used for (for each new analytics function) – to make it easier for other analysts to find useful data for similar use cases.

Integration with data quality tools can help to build trust in the data set, by scorecarding data quality. They should also offer the ability to clean data in a repeatable fashion, and with writing code, to address the other pet peeves mentioned.

On Data Quality

The more experience I have as a data analyst, the more I realize just how important data quality is. If you cannot consistently rely on the accuracy of your source data, how can you derive any valuable insights? Well, you can’t.

Megan writes that her relationships with the data engineers have become critical to her ability to do her job and remain sane. Here again, robust, business-friendly data quality and data preparation tools can help tremendously to address data quality issues without waiting for technical support.

On Why My Datasets Aren’t Joining

“Incorrect data joins happen a ton and are part of the normal data processing workflow. However, it gets rough when you can’t figure out quickly why two or more datasets are not joining together the way you expect.

Again – these kinds of data integrity problems come down to a lack of reference data standards and data quality. One can leverage the catalogue and integrated data quality tools to improve standards and reduce these frustrations

Start with Clear Business Objectives

A data catalogue is one part of a broader data governance framework that must also define roles and responsibilities, set standards for data quality, and establish processes for data management.

To illustrate how a company might use its data catalogue, imagine your company is aiming to improve its customer experience. Here is what a five-step process might look like to determine the metadata, goals, and metrics that you would want to be documented in your data catalogue:

Step 1: Define the desired outcome. What decisions do you need data to help you make? What questions might you be able to answer by using data more effectively? In this case, you might want to better understand the demographic profile of your customers. That, in turn, might reveal opportunities for developing brand messages that speak more directly to your target audience.

Step 2: Discover what customer data you have and need. This step is all about data discovery. Following the example, you need to understand what you already know about your customers, where that information came from, and how it’s currently being used. You might also identify opportunities to add value with data enrichment or location intelligence. Demographic enrichment can tell you more about key lifestyle characteristics of your customer base. Mobile data can reveal important consumer behaviours that help you cater your products and messages to specific target audiences.

Step 3: Locate the data you need. This is about data acquisition and data integration. Identify the data sets you need, then implement reliable, scalable systems to integrate and harmonize the data from different sources. For customers, that might entail integrating transactional data from ERP, service and support information from CRM, clickstream analytics from the web, and third-party demographic data.

Step 4: Verify if the data is usable and trusted. Check for completeness, accuracy, and consistency. Completeness means that all relevant data has been included, and nothing is missing. Accuracy is about the correctness of the data. Consistency means that the data is logically congruent and does not include contradictory information. Customer data is notorious for its tendency to degrade over time as people relocate, undergo name changes, or pass away. Data quality must be proactive, ongoing, and must be supported by the right data quality tools.

Step 5: Assess your business outcomes. The final step is to determine whether or not your business users are successful in achieving the desired outcomes. If not, why? Were you asking the wrong questions? Were certain data elements missing from the picture? Were there data integrity problems that weren’t identified and proactively addressed previously?

A data catalogue serves as a single source of truth, providing a comprehensive roadmap to all the data your organization owns and uses. Every effective data quality program and data governance initiative is built upon a clear understanding of that roadmap. This is why a good data catalogue solution is so critical.