Back

Understanding Missing Data Mechanisms: Types and Implications

Understanding Missing Data Mechanisms

 

Missing data is a common challenge in data quality and can occur for various “reasons”, called “missing data mechanisms”. It is crucial to understand the underlying mechanisms causing missing data as they can significantly impact the validity and reliability of statistical analyses and conclusions drawn from a dataset. 

Following our previous blogpost on missing data, we will explore the three types of missing data mechanisms: MCAR, MAR, and MNAR!

 

What are MCAR, MAR, and MNAR?

Missing Data is characterised by the appearance of absent values in some observations, and although all missing values may look the same for the untrained eye, the truth is that they may follow three main mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).

If you haven’t caught up with all the theory behind missing data and you’d like to learn more about it, check this 2-min video from the Data-Centric AI Community that explains what is missing data, its importance to machine learning models, and possible strategies to overcome it:

 

 

Now, let's delve into each mechanism and its implications for data analysis!

 

MCAR: Missing Completely At Random

In MCAR, the missingness is completely unrelated to both observed and unobserved values in the dataset. In simpler terms, the absence of data occurs randomly, without any discernible pattern. 

A classic example of MCAR is when a survey participant unintentionally skips a question. The likelihood of data being missing is independent of any information present in the dataset. This mechanism is considered the most desirable for data analysis, as it does not introduce any bias.

 

Implications of MCAR for Machine Learning:

  • MCAR data can be handled effectively through simple methods like listwise deletion or mean imputation, without compromising the integrity of the analysis;

  • Statistical inferences and results derived from MCAR data are generally unbiased and reliable.

 

MAR: Missing At Random

In MAR, the missingness can be explained by some observed features in the dataset. Although the data is missing systematically, it is still considered random because the missingness is not related to the unobserved values.

For instance, in a tobacco study, younger participants might report their values less often (regardless of how much they smoke), leading to systematic missingness for age-related reasons.


Implications of MAR for Machine Learning:

  • MAR data requires more sophisticated handling techniques like multiple imputation or maximum likelihood estimation;
  • Failure to account for MAR properly may introduce bias and affect the validity of statistical analyses.

 

MNAR: Missing Not At Random

MNAR occurs when the missingness itself is related to the unobserved data. In this case, the missing data is not random and is associated with specific reasons or patterns.

Referring to the tobacco study example, participants who smoke the most might intentionally withhold their smoking habits, leading to systematic missingness related to the missing data.

 

Implications of MAR for Machine Learning:

  • MNAR data is the most challenging to handle, as the reasons for missingness are not captured within the observed data;
  • Traditional imputation methods may not be suitable for MNAR data, and specialized techniques that consider the reasons for missingness are required.

 

Conclusion

Understanding the mechanisms behind missing data is crucial for any data scientist or analyst. Each mechanism – MCAR, MAR, MNAR – presents unique challenges and implications for data analysis.

As data scientists, it is essential to identify the appropriate mechanism and employ suitable imputation or handling methods accordingly. Failing to address missing data properly can compromise the integrity of analyses and may lead to erroneous conclusions.

By utilizing appropriate techniques, we can mitigate the impact of missing data and enhance the reliability of our findings.

We encourage you to join the Data-Centric AI Discord server to learn more about missing data and other issues that impact data quality for data science.

 

 

Cover Photo by  Emily Morter on Unsplash

Back
High-quality data is a concern for all the elements of the modern data teams: from data engineers to data scientists.

The different dimensions for high-quality data in AI

Data Engineering vs Machine Learning the differences and overlaps Data quality is critical to both Data Engineering and Data Science, after all poor quality data can be costly quite costly for a business. Accordingly to Gartner poor data...

Read More
Explaining Missing Data, DCAI

What is Missing Data in Machine Learning?

Just like when assembling a puzzle, working with missing pieces – i.e., missing data – can compromise our ability to fully understand our datasets. Missing data is just one problem in the wide range of data quality issues that can affect...

Read More
Automated process in a healthcare laboratory.

Data-Centric AI in Healthcare: Revolutionizing Diagnosis and Treatment

In healthcare domains, the collection and exploration of biomedical and clinical data is pivotal to making informed decisions about patient care and developing accurate medical recommendation systems. However, the landscape of medical data...

Read More