The cost of poor data quality
Bad data makes data scientists work harder, not smarter!
Photo by Artificial Photography on Unsplash
It’s amazing how nowadays the majority of us understand that AI is the way to go when talking about becoming a market leader, regardless of the vertical where you’re into. But in order to successfully develop and adopt AI solutions, there’s a path to be made, and that path is not easy! Data is one of the most important key factors (besides all the technical depth around an ML solution) that dictate whether or not an AI project will succeed, but are we taking into account that we need data with quality?
Well, having that said I have two questions:
When can I say “I have enough data”?
What is quality data after all?
Let’s dive into these questions! 🚀
Enough is enough!
This is the question that I guess everyone, including Data Scientists would like to know! But although it sounds like a simple thing, it isn’t. “The more the merrier” is not exactly the ideal, after all, you can have decades of data, but if you have been collecting it without a real purpose, well probably the data won’t hold all the answers for the questions that your business have!
In reality, there are many aspects that impact the amount of data needed, from the use case to be explored to the complexity of the problem and even the chosen algorithm.
So there is no magic number, but it’s always dangerous to assume that there’s enough or even plenty of data!
💎 The “Crème de la crème” of data
Perfect data does not exist when it comes to records collected from real-life systems! Don’t assume this and don’t expect Data Science teams to agree with your assumptions, you’ll probably be wrong 🌝 — but we can work towards having it as close as it’s best before feeding it into a model.
But before let’s define what is high-quality data after all. Data quality can be defined as data measures based on factors such as accuracy, completeness, consistency, reliability, and above all, whether is up to date.
So does this mean that the same data will have the same quality for different use cases?
No, nevertheless it is possible to define a ground quality metrics that are independent of the use cases, and will give us already a pretty good idea of how much work will that data require.
And what is the connection of data quality with Machine Learning?
Due to its nature, a machine learning model is very sensitive to the quality of the data, well, you’ve probably already heard the expression “Garbage in garbage out”. Because of the huge volume of data required, even the smallest of the errors in the training data can lead to large scale errors in the output. I totally recommend you to have a look into this article about “High-quality datasets are essential for developing machine learning models.”.
Data quality is a must for the ones that are looking to start investing in Artificial Intelligence based solutions. Do you already have a strategy to tackle your data quality issues, or you still think they don’t exist?
💰How much are you willing to spend?
For starters from a productivity perspective, the situation appears bleak. Did you know that your Data Scientists spend 80% of their time finding, cleaning, and trying to organize the data, leaving only 20% of their time for the development and analysis of ML solutions? That’s a lot of hours wasted by professionals that are highly expensive on work that could be partially automated. Let me just put here a price tag, the average salary of a Data Scientist in the US is around $120k, and you can do little to nothing with just one person (this I’ll leave for another discussion!). Don’t forget that Data Science jobs are highly qualified, and performing data preprocessing besides tedious can lead to frustration and a lot of churn among your data teams.
On the other hand, you can also have a lot of direct financial backlash from the use of data with poor data quality.
- First storing and keeping bad data is both time-consuming and expensive.
- Second, according to Gartner, “the average financial impact of poor data quality on the organization is estimated to be $9,7 million per year.” and recently IBM also discovered that in the US alone, businesses lose $3.1 trillion annually due to poor data quality. Bad data and poor results from using that data can lead to the loss of confidence from the end-users and customers. Meaning, customers churn related to bad data is a reality.
- And last, but not the least, and this one might be shocking — data inaccuracy and poor quality are inhibiting AI projects. A lot of times AI projects are kicked-off with no idea if there’s enough data, or if the data that exists suits the use case. There are a lot of assumptions done without even looking into the data, which leads to a massive investment in a project that is doomed from the beginning. Another fact, the majority of the companies fail to integrate external information, either because it’s not accessible (due to privacy) or just because is very time-consuming, and this thrid-party data can tell you a lot more than you imagine about your own business.
Data quality is a pre-condition for AI, and not the other way around! Meaning, if the quality of your data is bad, analytics and Ai initiatives are worthless to pursue.
Poor data quality can cause analytics and AI projects to take longer than expected (around 40% longer), which means they will cost more or even they will eventually fail to achieve the desired results (70% of AI projects). With more than 70% of the organization relying on data to drive their future business decisions, the data problems are not only going to drain resources (financial and human) but also the ability to extract new valuable business insights. So if you are looking to invest in AI, look first in develop, define, and implement the right tools for an excellent data quality strategy.