How to deal with bias in data?

Muslim family holding signs
Reducing your AI bias with synthetic data

In the latest days, countries have been assaulted by manifestations around a topic that we do not always give the attention we should: inequalities and discrimination in our society towards black people. Regardless of whom this discrimination is targeted, one thing is for sure, it exists!! But how does that link to data an AI? With (algorithmic) bias! For that reason, today we will focus on how to deal with bias in datasets, with synthetic data.


The dataset

For today’s topic, I guess there could not exist a better dataset than the Census dataset (you can easily find it in Kaggle). This data was collected by the census bureau from the USA and its extraction was done in 1994 (although probably not the most updated in what concerns demographics). The overall goal for this dataset is to classify whether a person will have an income over 50k a year.

The dataset contains 41 variables, from which 32 are categorical, with a number of around 30 observations. A variable to pay attention to is “race” — in this dataset are represented the following categories for this variable: White, Black, Asian or Pacific Islander, Amer Indian Aleut or Eskimo, and Others. For the purpose of this exercise, we will keep only two of the above “White” and “Black”. As you can observe in the below image, the dataset imbalanced:

Race category highly imbalanced in the Census dataset

Race category highly imbalanced in the Census dataset

The white individuals of the dataset represents around 87% of the total dataset, whereas the black individuals account for only 13%. With a ration of 4 to 1 white individuals to black, this can have a huge impact on the models trained using the dataset, as they will tend to learn the characteristics of the white individual’s population, which can result in poor diagnoses for any other underrepresented race. As we were not able to collect more records at this point, how can we achieve an equal representation in order to reduce the bias present in the input data as much as possible?


Synthetic data to the rescue!

How can I use Synthetic Data to solve my dataset biases issues?

In order to solve the bias, we will generate synthetic data from the black population, for that purpose, we will use
 YData’s synthetic data generator lib. The process followed for the data generation you can find in this notebook. Before we proceed, let’s just do a double-check for the “sex” variable:
Sex variable ratio between Female/Male individuals

Sex variable ratio between Female/Male individuals

In the subset of the population, we can observe that both males and females are equally represented. Well, now that we’ve the population that we are looking to generate 3000 new samples, to augment our training data, we are good to go and use YData’s synthetic data lib.

Adult census dataset

 The adult census dataset


The synthetic data model is trained quite fast (less than 1 minute) — in this case, we are dealing with a pretty small amount of data with a few variables, we have already achieved synthetic data in a good shape with a score around 98,78% percent.

General stats from the new 1000 generated samples

General stats from the new 1000 generated samples


The final results

To validate the results of the original dataset, versus de new combined data (original + synthetic) we are going to train a set of classification models. As per the image below, the combination of real data and the augmentation of the black individuals synthetically have resulted in an overall increase of both the model’s accuracy (average of 2%) and f1_score (average of 1.4%), with the biggest improvement observed for SVM and KNNeighours models.


average improvement of 2% accuracy for all the tested classifiers

Average improvement of 2% accuracy for all the tested classifiers


overall models accuracy for the original and combined datasets

Overall models accuracy for the original and combined datasets 


overall models f1_score for the original and combined datasets
Overall models f1_score for the original and combined datasets




At YData we are concerned not only with the existing bias in our day-to-day lives, but more importantly, with the bias present in most of the datasets that Data Scientists handle every single day. We all can do our parts in reducing what is a massive problem today, and we are excited to show how synthetic data can help you out. Improve your model’s results while helping your business to become fairer.

There are a ton of exciting applications for synthetic data, from reducing your organization’s privacy debt, help your data science teams to move faster, and augment your data for DL and ML models training. We’re looking forward to hearing from you some use cases that you’re struggling with, so feel free to reach out at

Fabiana Clemente, Chief Data Officer at YData.

Multivariate and complex time-series synthetic data generation. Sequential data generation

Synthetic Multivariate Time Series Data

Generating synthetic versions of complex time series data As we saw in our previous post, YData Fabric’s time series synthesizer works well for univariate, single-entity datasets, regardless of how complex the processes generating those...

Read More
Time-series synthetic data generation with seasonal information

Simple Synthetic Time Series Data

Generating synthetic versions of simple time series data Time series data is all around us, from health metrics to transaction logs. The increasing proliferation of IoT devices and sensors means that more and more time series data is...

Read More
Time-series synthetic data generation

The trade-offs of time-series synthetic data generation

Synthetic data is artificially generated data that is not collected from real-world events and does not match any individual's records. It replicates the statistical components of real data without containing any identifiable information,...

Read More