Using synthetic data to overcome bias in Machine Learning

Using synthetic data to overcome bias in ML

Machine Learning models are excellent tools to analyze large data sets and can have incredible accuracy in challenging tasks, from face recognition to credit scoring. Unfortunately, these models are not entirely free from bias and can easily reinforce these biases, especially in sensitive areas like crime prevention and healthcare.

Barocas et al. state in their "Fairness and Machine Learning: Limitations and Opportunities" book that fairness is most often conceptualized as equality of opportunity. Therefore, a Machine Learning model would be unfair if it was biased toward a specific person or group.

Let’s lead with an example. If a person lives in a poor neighborhood, an ML model will consider his credit score risky since that is probably the score of most individuals living in the area - this type of bias is called guilty by association

The key to solving this bias is not to remove features from the data (in this case, location) or other sensitive ones like race or gender - this would lead to a less accurate model and likely have side effects. An example of this non-recommended methodology is not including gender in car insurance, which negatively affects women since, statistically, they have fewer accidents and thus could pay a lower insurance premium.

One possible solution is adding more features to make the model less biased. By including other personal attributes, like education level and owned property, location will be less relevant as other factors will start to balance the decision in other directions regarding credit risk.

By definition, all machine learning models are biased since they are made to be discriminatory irrespective of any attribute: separate dogs from cats or who will pay a loan and who won’t. However, some biases do not contribute much to the accuracy of the model and should be mitigated. 
Bias is more common in less represented classes that deviate considerably from “normal” behavior. In the credit scoring case, someone with a high income who lives in a poor neighborhood. Since these cases are rare, the model will struggle to classify them properly, and the answer is mostly biased towards the majority class.

A key element to be aware of is that the model is trained against the group, not at the individual level. So, statistically, it may be accurate even while making a few biased predictions. Accuracy may thus not be the best (or only) metric to be considered. Another possibility is the reduction of false positives (FP) - even if this may lead to a decrease in accuracy.

How many types of data bias?

There are several types of bias: statistical bias, data bias and model bias. In this paper Here let’s explore further about data bias. 

Data bias can be divided into 3 categories:

  • Undersampling: the instance of underrepresented minority classes.

  • Labeling errors: mislabeled data.

  • User-generated bias: when the analyst unintentionally increases the bias during data processing and training the algorithm.

What’s the impact of data bias?

The outcome of an algorithm is as good as the data used to train it. 

Most Machine Learning algorithms minimize a loss function averaged over all training samples. The simple fact that a specific category, or combinations of categories, are rare in the data could make the model unreliable or biased because it was exposed to a limited number of samples that have little relative impact on the loss function. 

Much attention has been given to gender and race bias, but bias may be exposed through many other unsuspected ways. How problematic the bias is depends on the importance of that feature to the model’s output.

Looking at another example, if 95% of employees are males, any algorithm trained on existing data will probably give a lower score to female applicants simply because it was exposed to many more male than female candidates - this has happened recently with the Amazon algorithm that was supposed to automatically screen job applicants. The algorithm potentially has generated what is called a spurious correlation reinforced through a confirmation bias. 

This correlation may very well be authentic, but we need to properly validate this hypothesis by training the algorithm on a less gender-biased population. , this is not easy to do as it would require collecting vast amounts of data and testing for the counterfactuals. Removing the gender attribute is also not a solution as gender may contain important information and/or it may be inferred from other attributes (this phenomenon is called proxying).

In a recent publication, economists Laura Blattner at Stanford University and Scott Nelson at the University of Chicago show that differences in mortgage approval between minority and majority groups are not just due to bias but to the fact that minority and low-income groups have fewer data in their credit histories. 

This means that when this data is used to calculate a credit score and it is used to make a prediction on loan default, that prediction will be less precise, thus leading to even higher inequality. 

Having understood the context of data bias through various examples, this article further explains how synthetic data can be used to alleviate data bias and increase fairness.

How to reduce data bias/unfairness?

Broadly there are two ways to reduce data bias: fix the existing data or use synthetic data to mitigate the problems.

Fixing the existing data consists of the following elements:

  • Include more data to create a more representative set.

  • Undersampling the majority class concerning the sensitive attribute(s).

  • Remove noisy samples.

  • Correct noisy labels.

Removing or correcting labels can have some drawbacks. Reducing data size or introducing undesirable noise, and collecting more annotated data could be prohibitively expensive. A better solution is the use of synthetic data to balance the data.

Synthetic data offers a promising alternative to reducing bias in the data. With synthetic data, we can increase the size of the minority class by augmenting the data and inserting new (synthetic) samples containing the minority class or combinations of rare and sensitive attributes - a process called data balancing.

Why synthetic data?

Synthetic data is a new concept that still confuses many practitioners: it is not fake data and should not be confused with test or mock data. 

In other words, synthetic data is data that preserves all the statistical properties of the original data (distributions, correlations, etc.) but that is generated, thus not matching any existing real record.

Some may think that synthetic data may introduce more noise than signal, which is not the case since this data is produced using machine learning algorithms with strong constraints that tightens it to the real data - it will look as if it was real without ever being observed. 

With YData, one can generate as much data as required through the use of a Synthesizer.

A use-case example: How to reduce bias in data

We will now show how to tackle fairness bias in a well-known public dataset - the Adult Census Income. This dataset is a collection of census data from 1994 mainly used for prediction tasks where the goal is to identify if a person makes over 50K a year. Each person is described by 14 features focused on personal information, including sensitive attributes such as race and sex.

After exploring the data, we can identify several sensitive categorical features that show high levels of imbalance in their representativeness.  Some examples are sex, race, relationship, and marital status. We focused our use case on the race variable considering its relevance in fairness bias. When further analyzing this variable, we find a clear dominance of the "White" class in 85.4% of the instances (we will refer to this class as the dominant one). All the remaining classes are considerably less represented, with the "Other" being the one with fewer instances - only 0.8% of the rows (we will refer to this class as the minority one). Although any of the underrepresented classes could be tackled in this use case, we focused on the "Other" since it covers different minorities, which aggravates the issue of lack of representation.

Representativeness (i.e., number of rows) for each class of the race variable.

The imbalance found between classes of the race variable is also visible when considering the outcome variable. Only 9.2% of the race "Other" makes over 50K a year, which will highly bias any prediction for this group.

Class imbalance of the outcome variable.

We tackle this fairness issue by using a YData Synthesizer to oversample this particular group, aiming to make it better represented in the dataset. We start by filtering our interest group from the training data, and we train a Regular Synthesizer on this filtered data. We then perform oversampling on all classes of the race variable (except the dominant one) using the new samples. The imbalance ratio can't surpass 1 (i.e., perfect balance), otherwise we could create a new imbalance issue towards the dominant class. To avoid this behavior, we perform subsampling on the new samples up to the perfect imbalance ratio. We can see in the image below that the imbalance gap between people of the race "Other" who make more and less than 50K was considerably reduced after the oversampling.

Class imbalance of the outcome variable after oversampling with YData's Synthesizer.

We also apply the undersampling technique to use as a baseline for comparison. In this case, the imbalance ratio is 1 between all classes, with the selection of the instances to keep being randomly performed.

We then train two classifiers (Random Forest and AdaBoost) on the original, balanced, and undersampled data. Before training the classifiers, we preprocess the data: all the categorical features are encoded into a numeric representation, and the missing values are replaced by the feature's mean or mode (depending if they are continuous or categorical variables). We evaluate the accuracy, F1 score, and recall on the test set. The metrics are also independently evaluated for the overall data, the dominant class ("White") and the minority one ("Other"). The results for a single run are available in the figure below, and for 10 independent runs in the table below.

The results show that performing oversampling with the Synthesizer provides the best metrics for the minority class while maintaining stable results for the dominant class and the overall data. This pattern is particularly visible in the F1 score and recall metrics: on average, for the minority class, we get a 2.5% improvement in accuracy, 27.3% in the F1 score, and 58.7% in the recall. Recall tends to increase when the number of false negatives decreases, which proves that synthetic data help reduce the false negatives for the minority class, thus reducing the bias in the prediction and mitigating any initial prejudice against individuals of minority races. Undersampling is able to achieve minor improvements for the minority class but deteriorates the results for the dominant class and the overall data.

Accuracy, F1 score, and recall metrics for two classifiers (Random Forest and AdaBoost) trained on the original, balanced, and undersampled data.


Accuracy, F1 score, and recall metrics for two classifiers (Random Forest and AdaBoost) trained on the original, balanced, and undersampled data.


By now, everyone in the AI industry understands that bias is a known issue across many machine learning problems. Data bias is more prevalent than you realize - whenever a specific (or combination of) groups are underrepresented in the data, the solutions built using this data tend to bias towards the grover-represented groupFollowing these fundamentals of machine learning, bias is bound to occur in practice. However, we can still proactively identify and mitigate this it. 

While solutions such as collecting more data points and undersampling can be explored - we know that synthetic data remains one of the most reliable and cost-effective solutions out there that improves the performance of the overall solution. Using synthetic data also means that the original data is completely being used, thus not leading to any loss of information that occurs when dropping a part of data.  

The benefits of using synthetic data were very well demonstrated in the Adult Census use case we saw, resulting in a reduction in bias and performance improvements in key metrics. Behind the scenes, our powerful synthesizers can learn patterns in the data and accurately synthesize more data points for the minority class, thus mitigating the bias by oversampling the minority class. 

If you want to try the Synthesizers for yourself and be sure that it mitigates bias for your own specific use case - we invite you to get in touch with us and experience it yourself!

Correlation Matrix for Multivariate Data

How to Profile Datasets with a big number of Variables?

As the Data-Centric AI paradigm has come to prove that focusing on data quality will have the most transformative impact in industries across all verticals, more and more companies and organizations worldwide are starting to look for the...

Read More
Data-Centric AI landscape by YData

The DataPrepOps Landscape

Since Andrew Ng coined the term in 2021, the number of companies that identify themselves as providing data-centric AI tools has exploded. From synthetic data to data monitoring, companies all over the machine learning workflow have jumped...

Read More

How good is my Synthetic Data for Analytics?

Synthetic data, designed to mimic real-world datasets, must be able to provide the same answers as real data to be valuable. For instance, when determining the average of customers that buy certain products, the result returned by the...

Read More