Missing data is a common challenge in data quality and can occur for various “reasons”, called “missing data mechanisms”. It is crucial to understand the underlying mechanisms causing missing data as they can significantly impact the validity and reliability of statistical analyses and conclusions drawn from a dataset.
Following our previous blogpost on missing data, we will explore the three types of missing data mechanisms: MCAR, MAR, and MNAR!
What are MCAR, MAR, and MNAR?
Missing Data is characterised by the appearance of absent values in some observations, and although all missing values may look the same for the untrained eye, the truth is that they may follow three main mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).
If you haven’t caught up with all the theory behind missing data and you’d like to learn more about it, check this 2-min video from the Data-Centric AI Community that explains what is missing data, its importance to machine learning models, and possible strategies to overcome it:
Now, let's delve into each mechanism and its implications for data analysis!
MCAR: Missing Completely At Random
In MCAR, the missingness is completely unrelated to both observed and unobserved values in the dataset. In simpler terms, the absence of data occurs randomly, without any discernible pattern.
A classic example of MCAR is when a survey participant unintentionally skips a question. The likelihood of data being missing is independent of any information present in the dataset. This mechanism is considered the most desirable for data analysis, as it does not introduce any bias.
Implications of MCAR for Machine Learning:
MAR: Missing At Random
In MAR, the missingness can be explained by some observed features in the dataset. Although the data is missing systematically, it is still considered random because the missingness is not related to the unobserved values.
For instance, in a tobacco study, younger participants might report their values less often (regardless of how much they smoke), leading to systematic missingness for age-related reasons.
Implications of MAR for Machine Learning:
- MAR data requires more sophisticated handling techniques like multiple imputation or maximum likelihood estimation;
- Failure to account for MAR properly may introduce bias and affect the validity of statistical analyses.
MNAR: Missing Not At Random
MNAR occurs when the missingness itself is related to the unobserved data. In this case, the missing data is not random and is associated with specific reasons or patterns.
Referring to the tobacco study example, participants who smoke the most might intentionally withhold their smoking habits, leading to systematic missingness related to the missing data.
Implications of MAR for Machine Learning:
- MNAR data is the most challenging to handle, as the reasons for missingness are not captured within the observed data;
- Traditional imputation methods may not be suitable for MNAR data, and specialized techniques that consider the reasons for missingness are required.
Conclusion
Understanding the mechanisms behind missing data is crucial for any data scientist or analyst. Each mechanism – MCAR, MAR, MNAR – presents unique challenges and implications for data analysis.
As data scientists, it is essential to identify the appropriate mechanism and employ suitable imputation or handling methods accordingly. Failing to address missing data properly can compromise the integrity of analyses and may lead to erroneous conclusions.
By utilizing appropriate techniques, we can mitigate the impact of missing data and enhance the reliability of our findings.
We encourage you to join the Data-Centric AI Discord server to learn more about missing data and other issues that impact data quality for data science.
Cover Photo by Emily Morter on Unsplash