A practical guide to generating synthetic data using open-sourced GAN implementations
The advancements in technology have paved the way for generating millions of gigabytes of real-world data in a single minute, which would be great for any organization or individual in utilizing the data. However, a large amount of time and resources would be consumed in cleaning, processing, and extracting vital information from the mounds of data.
The answer to handling such a problem is by generating synthetic data.
What is Synthetic Data?
The definition for synthetic data is quite straightforward: artificially generated data that mimics real-world data. Organizations and individuals can leverage the use of synthetic data to their needs and would be able to generate data, according to their specifications, as much as they require.
The use of synthetic data is highly beneficial in preserving privacy in information-sensitive domains: the medical data of the patients and transactional details of banking customers are a few examples where synthetic data can be used to mask the real data, which would enable sharing of sensitive data among organizations.
Few well-labelled data can be used to generate a large amount of synthetic data, which would fast-track the time and energy needed to process the massive real-world data.
There are many ways of generating synthetic data: SMOTE, ADASYN, Variational AutoEncoders, and Generative Adversarial Networks are a few techniques for synthetic data generation.
This article will focus on using Generative Adversarial Networks to generate synthetic data and a practical demonstration of generating synthetic data using open-sourced libraries.
A Brief Introduction to GANs
Generating photorealistic faces using GANs based on StyleGAN3 research. Image from [1].
Many machine learning and deep learning architectures are prone to adversarial manipulation, that is, the models fail when data that is different to the one that is used to train is fed. To solve the adversarial problem, Generative Adversarial Networks (GANs) were introduced by Ian Goodfellow [2], and currently, GANs are very popular in generating synthetic data.
A typical GAN consists of two components: generator and discriminator, where both networks compete with each other.
The generator is the heart of the GAN, where it attempts to generate fake data that looks real by learning the features from the real data.
The discriminator evaluates the generated data with the real data and classifies whether the generated data looks real or not, and provides feedback to the generator to improve its data generation.
The goal of the generator is to generate data that can trick the discriminator.
A Vanilla GAN architecture. Image from [3].
Mode Collapse
Mode collapse is a common problem that GAN-based architectures face during adversarial training, where the generator repeatedly generates one specific type of data. This occurs when the generator identifies that it can fool the discriminator with one type of data, the generator would keep on generating that same data.
This problem can easily go undetected, as the metrics would indicate the model training is running smoothly, but the generated results would indicate otherwise.
An example of mode collapse in image-based GANs. Image from [4].
Wasserstein GAN (WGAN)
The main problem in a standard GAN is the difference in complexity of the outputs from the generator and the discriminator.
A standard Vanilla GAN uses the Binary Cross Entropy (BCE)loss function [5] to evaluate whether the generated data looks real, where the output of the loss function is between 0 and 1. The task of the generator is to generate synthetic data that might have a lot of features and values, and the output from the discriminator is not sufficient for the generator to learn, and due to the lack of guidance, the generator can easily fall into mode collapse.
WGAN [6] alleviates the problem by replacing the discriminator with a critic, where the critic would evaluate the distribution of the real data with the distribution of the generated data and outputs a score of how real the generated data looks when compared to the real data. The Wasserstein loss function utilized in WGAN measures the difference between the real distribution and the generated distribution based on the Earth Mover’s Distance.