I recently attended the first Africa Data Forum in Johannesburg – a three day event focusing on data science and skills development.
Data governance was a key theme of the first day of the event. Research shows that data scientists typically spend an average of 60% of their time simply trying to find and prepare the right data sets for their analysis. Data governance has proven application in reducing this “wasted” time.
A number of the discussions turned to the “science” aspect of data science. The scientific method suggests that one form an hypothesis, and prove, or disprove the hypothesis – in the data scientist’s case through the analysis of (big) data.This has been discussed in my post Data Scientists must see the story behind the data.
Yet another aspect of “real” science is that papers / proofs are published and available for peer review. Both the method and, in many cases, the data, must be made available so that the scientists peers may test his. or her, hypothesis, calculations and conclusions. Without these rigorous checks and balances something like this may happen – Sheldon returns from the North Pole
The data scientist is typically working on hypotheses that are to be used for competitive differentiation. is or her results cannot be published in trade journals.
Yet, some kind of trust must be determined in the outcomes.
Principles such as data quality, data traceability and consistent definitions of terms, driven by engagement and governance from the appropriate subject matter experts could go a long way towards ensuring that data science results are reliable and trust worthy?
Could it be that data governance will be the peer review process for data science?