What is Missing Data in Machine Learning?

Explaining Missing Data, DCAI


Just like when assembling a puzzle, working with missing pieces – i.e., missing data –  can compromise our ability to fully understand our datasets. Missing data is just one problem in the wide range of data quality issues that can affect machine learning models.

The Data-Centric AI Community has been educating the ML community on the importance of high-quality data through a series of videos and blog posts, namely explaining imbalance data

In this blog post, we'll delve into the world of missing data, understand its significance, and explore why it poses a big deal for machine learning projects!

What is Missing Data?

In essence, Missing Data refers to the absence of certain values in observations or features within a dataset. The reasons behind missing data can be diverse: someone might forget to answer a question in a survey or a sensor may be malfunctioning during data collection.

These underlying “reasons” for absent observations are called “missing mechanisms” and they describe the relationship between the observed and missing data in a dataset.

The following video explains Missing Data Mechanisms with a set of simple real-world examples:



Why is Missing Data Important?

Missing data is an issue for any data science project because it may jeopardize the learning process of a classifier, or render it inapplicable altogether. 

In fact, several machine learning algorithms cannot handle missing data internally – we frequently need to encode or impute the missing values to proceed with our analysis. And even when they can, missing data can lead to biased an inaccurate results, undermining the reliability of our findings.

A common approach when facing missing in real-world datasets is to ignore the observations that have missing values, and work only with the complete set of data. However, this approach comes with serious pitfalls:

  • Biased Results: The deletion process may result in severe loss of information, which in turn bias the prediction results, since classifiers cannot learn the entire population accurately;

  • Reduced Sample Size: If missing data represents a significant amount of the data, the deletion process may result in a very reduced sample size, which limits the statistical power of our analysis, potentially rendering it inconclusive.


Strategies to Handle Missing Data

Addressing missing data is critical to obtaining meaningful insights and accurate predictions. Some commonly used strategies for handling missing data include:

    • Case Deletion: In cases where missing data is minimal, dropping incomplete observations may be a viable option. However, as highlighted above, this should be done with caution to avoid losing significant information;

    • Data Imputation: Replace the absent information with plausible estimates. Popular imputation strategies are based on statistical analysis (e.g., mean/median or regression), or machine learning methods (KNN imputation, or NN-based imputation).

    • Model-bases procedures: Where the data distribution is modeled by means of some procedures (e.g., Expectation-Maximization algorithm);

    • Handling Missing Data Internally: Rather than discarding or imputing data, some approaches are able to deal with absent values (e.g., ensemble methods, decision trees, fuzzy approaches).



Missing Data is a common problem found in several domains, and is imperative to handle it in order to ensure the integrity and accuracy of data-driven projects. 

As data scientists, we must acknowledge the significance of missing data and strive to find innovative ways to fill in those gaps. By doing so, we can unlock hidden insights and achieve more meaningful outcomes in the ever-evolving landscape of data science.

We encourage join the Data-Centric AI Discord server for more food for thought and interactive discussion on the topic of data quality for data science. Who knows, you may your missing piece! 

Explaining Imbalanced Data, DCAI

What is imbalanced data in Machine Learning?

Data quality plays a crucial role in the success of machine learning projects. In the realm of artificial intelligence, where algorithms learn from data to make predictions and decisions, the quality of the input data directly impacts the...

Read More

How to Visually Evaluate Your Synthetic Data Quality?

As Synthetic Data becomes a must-have for the future of AI, guaranteeing its quality becomes indispensable. Fidelity, one of the main pillars of synthetic data evaluation, is crucial in ensuring that synthetic datasets accurately represent...

Read More
pipelines large datasets

How to Synthesize a Dataset with a Large Number of Columns?

High-dimensional datasets are at the heart of many business applications and domains, from financial services to telecommunications, retail, and healthcare. These datasets, characterized by a large number of columns — sometimes hundreds or...

Read More