To Excel or not to Excel? Tips for avoiding data quality errors

Learn how to avoid data quality errors when using Microsoft Excel as a data source. Discover the risks of calculation errors and auto-formatting issues, and get tips to ensure data integrity.

Why is data quality essential? Reliable data is the foundation of meaningful insights and informed decision-making. Delve into the importance of prioritizing data quality in your organization.

One of the most common sources of data, in almost any environment, is Microsoft Excel.

Yet, through the use of Excel, many organisations inadvertently increase the risk of poor-quality data.

Calculation Errors in Excel

Microsoft Excel is a business tool of choice.

In many cases, complex models such as Commission structures or Risk models may have been developed by a business user using Excel. These business users commonly have no programming training, and the resulting spreadsheets are typically not rigorously tested. In fact, very few are even checked by a second person.

Spreadsheets are prone to error

Research shows that 88% of spreadsheets contain errors.

The European drive for austerity in southern Europe is driven by a conclusion, attributed to an incorrect Excel formula, that there is a link between high government spending and low economic growth. Peer review shows that this conclusion was flawed but austerity, and its consequences, are here to stay.

Errors in spreadsheets can have devastating consequences

Even if initial calculations are correct, Excel applications often rely on manual “copy and paste” operations (or allow other changes to data). The “London Whale” trading debacle cost JPMorgan in excess of $6.2 billion. Internal analysis linked the problem to spreadsheet errors.

In these examples, the line between programming and data quality errors is blurred. Poor quality data is not just caused by errors in capture – it may be system generated.

Auto format errors.

Excel is a default target for data extraction from a number of commercial applications.

If, for example, your ERP provides a default Excel report this may be used to provide an extract for data assessment, rather than having to wait for a more robust IT solution.

By default, Excel changes the format of long numeric values.

For example, Excel converts both “4901081045611230” and “4901081045611440” to “4.90108E+15”.

In this example, we have gone from two unique credit card numbers to a single duplicated number.

This problem can also creep into delimited text files (such as .csv files) if Excel is used to open them (say to view them) and they are then saved.

In most cases, this problem will only affect downstream applications, but if the Excel data is used to update the source, changes can become permanent.

The problem is avoided if numeric data is imported into Excel as text.

However, this is not default behaviour and most users will struggle to get this right.

It is reasonable to assume that most Excel extracts will contain this kind of problem, and this assumption is borne out by our experience dealing with data extracts.

So should we stop using Excel?

Excel will remain a common data source in spite of these issues.

What is important is that we use our awareness of the data quality risks inherent in Excel to avoid exacerbating the problem.

Have more than one person check Excel calculations.
Do month-on-month calculations give similar results (big changes may indicate and error)?
Where possible, avoid Excel use for data extracts.
Where exponential numbers show up in data assume this is an Excel change and compare to the original source.
Avoid overwriting the original source data with data that has been through Excel.

What tips do you have for avoiding Excel data quality errors?

Does manual data cleansing work? Explore the effectiveness of manual data cleansing and its role in improving data quality.

How does data decay impact business? Understand the analogy of data aging like fine wine while software applications age like fish, and explore the impact of data decay on business operations and decision-making.