Lately, there has been a lot of discussion about data quality and its impacts on model performance. Mainly due to thispresentationwhich highlighted this topic — model-centric vs data-centric, which has already been discussed in thisprevious article.
However, simply put, your effort into the initial stages of project, more specifically that time when you need to dive deep in the available data and decide what and how to use, converts directly into model performance later on, more so than actual model tuning. This stage is known as the data preprocessing and is where the data quality can be assessed.
However, what exactly isdata quality?
Data Quality for Data Science
The common consensus defines it as how accurate, reliable and up to date your data is. Yet, I would dare adding how much of that data is relevant to your problem and how much of it actually translates.
At this point you might be wondering:how do we guarantee that our data has quality?
Well as with everything else in data science:it depends!
Notwithstanding, knowing how thedata science life cycleunfolds, and understanding the nuances of how your data can be built to fit the needs of its stages as well as the maturity of the project is a great way to start.
Nonetheless, since this is not only highly dependent on your use case but also implies have a great business understanding, the quickest, most common way, to achieve data quality is to clear out the trash. Which usually means removing duplicates, filling missing values, handling outliers, removing incorrect attribute values and so on.
Yet, this is the bread and butter of data wrangling. What we need to do is start taking this to another level, the level that lets us handle real data the way it needs to be handled, and for that we are going to cover a couple of topics that can help you achieve this new level.
Knowledge
First and foremost, understand the problem. Seriously, misinterpreting a problem/task can snowball in a huge fashion.
If possible, study the ins and outs of the business you are handling, or at least have someone who understands it that can validate it thoroughly. If neither is available, well, brace yourself you are in for a wild ride, anything can happen.
At the very least frame the problem you want to solve properly. This will not only help you out, but also allow you to better explain the strengths and weaknesses of the data you have versus the needs.
By doing so you will be able to pre validate your decisions on the spot, or understand if there are errors in the data or even in the data ingestion that need fixing early on.
One good example of how data without proper business knowledge assessment scales into nothingness or even worse: possibly causing harm, is explained in thisarticleabout why AI tools developed to fight COVID-19 failed.
Source
Once the frame is properly set and understood, it is time to understand how is that data reaching you. This is when you will likely thank having a data engineer by your side.
Validating the data ingestion goes a long way.
Once, in a project, I was working on we discovered that our binary target feature could have a third meaning. Simply because we stumbled upon a couple of events that had the same timestamp (up to the same millisecond) but had different target values.
After a thorough investigation and a conversation with some of the developers we discovered that they had used the same status in the database, the one we were using to define the target value, to launch a feature request a couple of years before.
This implied redesigning a pipeline that by then had already two months of work in, in the end we managed to solve that puzzle within a week, but all those headaches and time lost could have been avoided if we had simply sat down and ask the meaningful questions at the start, such as:
How do we collect these features?
Can we guarantee they don’t have other meanings? If so, how can we filter those?
When do we do the collection? Does it align with the events we need to track/register?
The previous two topics directly impact yourETLand exploratory phases, since they will define how much of your the reality is properly translated into your data, which impacts your findings, and how does your ETL need to be shaped.
Bias
Bias doesn’t come from AI algorithms, it comes from people.
Any algorithm of your choice will just propagate whatever biases are imbued in your data, so handling them during the data preprocessing for modelling stage is essential.
Start by asking yourself what errors can your model generate, what are its impacts, and what can you do to avoid them/handle them.
There are no error free models, there will always be false positives, or simply bad predictions. Yet, criticising and working on solving biases early on goes a long way.
Think of the well knownissues with face recognition algorithmsby some of the biggest AI players in world. If from the get go, developers/decision makers had asked themselves if their datasets were properly representing all the races, this would had never happened, and a lot of serious life impacting issues would have been avoided.
Simply by sampling data according to all the races, or collecting more data until all races were equally represented would have solved things (and it was how things end up happening).
If you want to strive for fairness, work to annihilate your own ignorance of the consequences of your decisions.
My advice on this particular topic is to start byknowing which types of biases there are and how they work. However, this is a journey of self discovery, so do not expect a bullet proof solution for this topic. Yet, the very fact you are already concerning yourself with it puts you on the right track.
Sampling
This is where I have a couple of rabbit holes in store for you.
Obviously, sampling impliesbalancing classes, or even balancing feature representation like the example above.
Yet, there are some twists in this topic:
Have you ever considered that you are sampling (collecting) real events according to a fixed time window, whereas the reality that captures the variance you want to introduce in your models is based on other units, like currency for example.
In other words, you should rethink how are you sampling your frequencies and if they fit your use case. Techniques such as thefixed horizon methodorthe triple-barrier methodcan be used to solve this issue.
Not only that, but have you ever thought that your samples can have information from other samples?
This can be referred as the “spilled” samples problem, this often happens in scenarios like the financial market where the samples are simply substantially cross correlated.
Nonetheless even in such chaotic scenarios you can weight observations by uniqueness in order to help you.
Regardless of your issues, take the time to evaluate the sampling strategy that better fits your use case, and not the other way around.
And while you might think this is unrelated, properly sampling before and after the development of your model goes a long way in monitoring its performance in production. Otherwise, how can you be sure if a unpredictable world event (like the covid-19 pandemic) is the cause of your model drift?
Conclusion
In the end, achieving excellency in data quality is an art of its own, and what it is covered here it is just the tip of the iceberg, there are other subjects, likemultivariate time series,that would need it’s own entire article just to get a glimpse of how to handle them.
However, if you start questioning not only your own data (where does it come from, how is it been processed, and how has is it been collected), but also yourself (are there any unconscious biases introduced by me) you will start improving the quality in your data unintentionally. For the all the rest, you can always count with useful open repositories for data quality for the different stages of data science such asydata-qualityorevidently.