Data-Centric AI — A Statistician’s View

Data-Centric AI from the perspective of a statistician

How data improves models by lessening uncertainty

It’s not every day that I read an academic paper that does a perfect job of balancing philosophical rigor and technical depth. I love deeply technical and applied ML research that drives breakthroughs that improve the world. But I also love getting into juicy and thoughtful discussions with other statisticians, data scientists, and machine learning engineers about the philosophical underpinnings of the practices that we employ.

That’s why I was delighted to read the preprint of “Sources of Uncertainty in Machine Learning -- A Statisticians' View” from researchers at LMU Munich. Although the title of the paper is about uncertainty, as I dug deeper into the paper, I realized that the authors were writing as much about data-centric AI (although they didn’t use those words) as uncertainty.

The authors started off by defining uncertainty in the context of machine learning, dividing uncertainty into “aleatoric” (also often referred to as irreducible) and “epistemic” (often thought of as reducible) uncertainty. After debunking some common misconceptions of the topic and presenting some formal, mathematical definitions for these concepts, the authors dove into the role that data plays in improving models and decreasing uncertainty.

I highly recommend that anybody interested in this topic read the original paper for themselves. In this blog post, I’m going to summarize various aspects of the post and intersperse the summary with my own comments and connections.


On Uncertainty

The authors start by discussing uncertainty, quickly splitting it using the common framework of aleatoric uncertainty and epistemic uncertainty.

  • Aleatoric uncertainty is irreducible. This means that even the “best possible model” wouldn’t be able to make error-free predictions
  • Epistemic uncertainty is reducible. It’s the difference between our actual model and that theoretical “best possible model”

Epistemic uncertainty can be further divided into model uncertainty and parametric uncertainty.

  • Model uncertainty is diminished by selecting the right model architecture. For instance, if a phenomenon is generated by a linear process, linear regression is the best model architecture, while if a phenomenon is generated by a stepwise function, a tree-based model might be more appropriate
  • Parametric uncertainty is diminished by finding the right model parameters (e.g. in our simple linear regression, y = mx+b, m and b are our parameters). This, fundamentally, is a function of having the right data with which to tune our model

The authors use the classic example of rolling a die. While we can gather data about the frequency of outcomes from rolling the die and thereby minimize our epistemic uncertainty, without modeling the physics of the die in motion (its momentum and the various forces acting on it), there will remain a certain amount of “aleatoric uncertainty”.

In the words of the authors:

"If no additional variables are gathered, it is not possible to further reduce the uncertainty concerning the fair dice roll and all uncertainty is of aleatoric nature. If however, all relevant physical quantities are measured… the process of rolling a dice becomes deterministic and no aleatoric uncertainty is present"

Intuitively, we can already see the connection that the authors are going to make to data. Although they don’t spell it out yet, we can consider that improving our data helps to diminish uncertainty in our model. Specifically, collecting more attributes about our data (in the die example, collecting data about the initial position of the die and how it was thrown) moves uncertainty from being aleatoric to being epistemic. Meanwhile, collecting more and higher quality data (e.g. less noisy, less error-containing) helps us to select a better model and select better parameters for that model, thus diminishing epistemic uncertainty.

As the authors emphasize in the conclusion :

"uncertainty has multiple sources and ignoring it can have severe consequences on the validity of trained machine learning models."

These sources of uncertainty all stem from data, which is the next topic of the authors’ discussion.

On Data

Next, the authors move from talking about data indirectly to focusing on data as the key to improving model performance.

  1. As the authors write:

    "Data for training models are often deficient, leading analysts to end up with suboptimal models, often even without considering potentially superior alternatives"


    We can understand the deficiency of data for training models as coming from five places:
  • Missing features
  • Imperfect observations
  • Incorrect labels
  • Wrong data
  • Changing processes

Let's dive into each of them with a bit more depth.

Missing features

"Omitted variables or missing features… are a relevant source of a model's uncertainty"

Missing features are variables that could have improved the model if they had been included. In the case of our die-throwing example, if we failed to take into account the properties of the surface that the die was being thrown onto, this missing feature would limit the predictive power of our model.

As discussed in the section on uncertainty, we can conceptualize missing features as turning what could’ve been epistemic uncertainty into aleatoric uncertainty. Collecting data on these missing features (or engineering new features) is a key part of the model improvement process practiced in data-centric AI.


Imperfect observations

But even if we collect all of the potentially relevant features, we can still introduce uncertainty into our model from errors in our data collection.

"Imperfect measurement instruments, usage of proxy variables, and subjectivity in labeling decisions are among the sources of errors in data"

Imperfect observations are issues with the data about the independent variables that have been collected. We can imagine that if we failed to measure the velocity of a throw with enough accuracy or precision, we wouldn’t be able to predict the outcome of our die roll.

By improving the quality of our observed data, we can improve the models that we train on our data.


Incorrect labels

"Errors may occur not only in the features but also in the outcome"

Incorrect labels are the natural complement of imperfect observations, but instead of the independent variables being missmeasured, our dependent variable is mislabeled.

Stretching the die example to its breaking point, we can imagine that sometimes we make a mistake when recording the outcome of a roll. When these mistakes make their way into training data, a model won’t properly learn the relationship between the inputs and the outcome

"Labellers can differ in their level of uncertainty or reliability (variance) and their central tendency (bias), which carries forward to the data"

High-quality data labeling is a key part of data-centric AI. There are excellent data labeling platforms and services that work to minimize the chances of incorrect labels.

Wrong data

Data can be wrong, not only in the crude sense that an observation can be imperfect or a label can be incorrect but also in the broader sense that the data can be wrong for the particular problem that a practitioner is trying to solve.

"Data are omnipresent in many machine learning applications, and high data quality is key…. To properly deploy machine learning models in real-world settings, data scientists need some knowledge about the data production process itself. Are the data suitable for the purpose intended by the analyst?"

This is a real-world problem that many data analysts and data scientists have likely encountered before. Because there is often a division between data and machine learning engineers on the one hand and data scientists and analysts on the other, valuable context about the data that models are trained on gets lost.

There’s no technological fix to this problem, no tool or platform that exists that can solve it. Making sure that the data collected align with the problems being solved is an organizational and process change, as well as a key part of enacting a successful data-centric AI strategy.

An important insight that the authors bring up is that this is fundamentally because data isn’t some natural abstraction that exists and then is plucked out of the air by data engineers. Data are artifacts, literally artificial, and require selection and design. In the authors' words:

"When faced with data it is easy to overlook the fact that data are always designed, deliberately or unintentionally, although the eventual data analyst might be uninvolved or even unaware of decisions going into the data generating process"

Or, as Vicki Boykis puts it in one of my all-time favorite blog posts:
 “All numbers are made up, some are useful”.


Changing processes

The final topic is an old favorite of mine because it requires something that I’ve been advocating for years: monitoring ML models. But this is putting the cart before the horse.

The final issue that I wanted to highlight from this paper is the issue of changing data. Real-world processes change and evolve, and alongside them, data changes. While countless blog posts and papers have been written dissecting the different types of data changes (training-serving skew, concept drift, covariate shift, etc), the fact fundamentally remains that data change, resulting in stale models.

"When machine learning models are deployed in real-world applications, possible shifts in the data need to be dealt with. For example, new measurement protocols may be in place, or in a new data source the relations learned by the model may no longer be valid, for instance, due to changes in true relationships or due to changes in… data deficiencies and errors"

The final issue that I wanted to highlight from this paper is the issue of changing data. Real-world processes change and evolve, and alongside them, data changes. While countless blog posts and papers have been written dissecting the different types of data changes (training-serving skew, concept drift, covariate shift, etc), the fact fundamentally remains that data change, resulting in stale models.



In conclusion, as the authors have pointed out without explicitly saying, the key to improving ML model performance is improving the data that those models are trained on. This is the core tenant of data-centric AI and aligns with the experiences of many practitioners. Although the authors don’t do much to discuss the strategies for mitigating these problems, we at YData make it our mission to help data scientists improve their models by improving their data, and so are happy to provide support and guidance for data scientists looking to leverage DCAI.

I hope that this summary and synthesis of “Sources of Uncertainty in Machine Learning -- A Statisticians' View” has been as interesting for you to read as it has been for me to write. If you want greater depth on any of the topics that I’ve highlighted here, I highly recommend checking out the original paper, which goes into much greater depth and detail than I do here. If you want advice on how to mitigate the data problems that the paper brings up, reach out the YData team or start today your journey into Data-Centric AI with Fabric. 

YData recognized as a kiey player in Synthetic Data

Synthetic Data Software Market Growth Statistics 2022

Synthetic Data Software Market Report covers several important factors like key market trends, growth forecast, and growth opportunities, market size and share. It also includes some top players analysis – Neuromation, Deep Vision Data,...

Read More
Generative AI described by Generative AI

What is Generative AI according to Generative AI?

Generative AI products can create new content similar to what humans produce. What does it mean? It can generate text, images, videos, or even music resembling what a person might create. Generative AI is a specific area of Artificial...

Read More
ydata-synthetic the open-source for synthetic data generation

Synthetic data generation with Gaussian Mixture Models

Photo by Roman Synkevych on Unsplash A probabilistic approach to fast synthetic data generation with ydata-synthetic To find synthetic data generation within the same sentence as Gaussian Mixture Models (GMMs) sounds odd, but it makes a...

Read More