Just like when assembling a puzzle, working with missing pieces – i.e., missing data – can compromise our ability to fully understand our datasets. Missing data is just one problem in the wide range of data quality issues that can affect machine learning models.
The Data-Centric AI Community has been educating the ML community on the importance of high-quality data through a series of videos and blog posts, namely explaining imbalance data.
In this blog post, we'll delve into the world of missing data, understand its significance, and explore why it poses a big deal for machine learning projects!
What is Missing Data?
In essence, Missing Data refers to the absence of certain values in observations or features within a dataset. The reasons behind missing data can be diverse: someone might forget to answer a question in a survey or a sensor may be malfunctioning during data collection.
These underlying “reasons” for absent observations are called “missing mechanisms” and they describe the relationship between the observed and missing data in a dataset.
The following video explains Missing Data Mechanisms with a set of simple real-world examples:
Why is Missing Data Important?
Missing data is an issue for any data science project because it may jeopardize the learning process of a classifier, or render it inapplicable altogether.
In fact, several machine learning algorithms cannot handle missing data internally – we frequently need to encode or impute the missing values to proceed with our analysis. And even when they can, missing data can lead to biased an inaccurate results, undermining the reliability of our findings.
A common approach when facing missing in real-world datasets is to ignore the observations that have missing values, and work only with the complete set of data. However, this approach comes with serious pitfalls:
-
Biased Results: The deletion process may result in severe loss of information, which in turn bias the prediction results, since classifiers cannot learn the entire population accurately;
-
Reduced Sample Size: If missing data represents a significant amount of the data, the deletion process may result in a very reduced sample size, which limits the statistical power of our analysis, potentially rendering it inconclusive.
Strategies to Handle Missing Data
Addressing missing data is critical to obtaining meaningful insights and accurate predictions. Some commonly used strategies for handling missing data include:
-
-
Case Deletion: In cases where missing data is minimal, dropping incomplete observations may be a viable option. However, as highlighted above, this should be done with caution to avoid losing significant information;
-
Data Imputation: Replace the absent information with plausible estimates. Popular imputation strategies are based on statistical analysis (e.g., mean/median or regression), or machine learning methods (KNN imputation, or NN-based imputation).
-
Model-bases procedures: Where the data distribution is modeled by means of some procedures (e.g., Expectation-Maximization algorithm);
-
Handling Missing Data Internally: Rather than discarding or imputing data, some approaches are able to deal with absent values (e.g., ensemble methods, decision trees, fuzzy approaches).
Conclusion
Missing Data is a common problem found in several domains, and is imperative to handle it in order to ensure the integrity and accuracy of data-driven projects.
As data scientists, we must acknowledge the significance of missing data and strive to find innovative ways to fill in those gaps. By doing so, we can unlock hidden insights and achieve more meaningful outcomes in the ever-evolving landscape of data science.
We encourage join the Data-Centric AI Discord server for more food for thought and interactive discussion on the topic of data quality for data science. Who knows, you may your missing piece!