How to Leverage Data Profiling for Synthetic Data Quality
Leverage EDA to beat GANs challenges
Machine Learning has mainly evolved as a model-centric practice where achieving quality results is left pending small data processing efforts and instead concentrated on experimenting with families of models and performing exhaustive parameter tuning routines. Realizing the need for wholesome ML, the data-centric approach has emerged and attempts to put data and data quality back in the main stage.
The main reason for this shift is simple, while good models are important, so is the data used. As the adagio of Computer Science says, “Garbage in, garbage out”, would we expect data used in ML to be any different? While good models and good data are often task-dependent, good data is typically more scalable since effective data preprocessing should serve not one but a multitude of ML tasks, translating to shorter iteration cycles towards the desired results and yielding benefits compounding across multiple projects.
Have you ever tried or experimented with Generative Adversarial Networks for data synthesis on a dataset of choice and wondered how you would guarantee the quality of the produced data?
You would have good reasons to be concerned about it since, indeed, the literature points to a multitude of difficulties of GAN learning:
- Mode collapse — A special kind of overfitting where the model tends to output very similar outputs no matter what input it is given
- Train instability — Often, non-overlapping support of the produced samples distribution and the samples provided to the discriminator lead to unstable training, which can completely jeopardize model learning
- Complex convergence dynamics — Both the Generator and the Discriminator can behave optimally in practical terms and lead to different undesirable results. The ideal balance is a balanced match of both models, but there is no straightforward way to ensure that this occurs.
As if the above points were not enough, in the end, GANs are still a black-box model. Input is fed to the model, and the output can be retrieved in the intended format. However, that barely scratches the surface of what it means to have synthetic data with good quality.
If you have ever experimented with creating synthetic data, then you probably have thought about this problem; we can summarise it with the following dilemma:
- There are many aspects of synthetic data that are related to its quality, so you want to be thorough
- Performing a deep analysis can be very time-consuming, and you don’t want to be too long on this step
Having the best of both is tricky; this is where open-source packages can be of use. These can help you get to your results faster and often teach you new effective ways to handle old pains.
In this article, we will show you how to integrate the Data Profiler package on a ydata-synthetic project. ydata-synthetic is an open-source Python package that provides a series of generative model implementations for tabular and time-series data synthesis. Data Profiler is an open-source solution from Capital One that uses machine learning to help companies monitor big data and detect private customer information so that it can be protected. By combining these two packages you can easily synthesize data and profile it to assess the quality of the generated data.
When assessing the quality of the synthetic outputs we could focus on combinations of the following aspects:
- Privacy, does the generated data include any sensitive information? (Names, addresses, combinations of features that can make a real-world record identifiable)
- Fidelity, how well does the synthetic data preserve the original data properties? (Marginal distributions, correlations)
- Utility, how well does the synthetic data behave when put into use in a downstream ML application? (Train Synthetic Test Real, Feature importance distribution comparisons)
For this study we are concerned with fidelity, to analyze it we will take the following steps:
- Install DataProfiler and import required packages
- Read a dataset
- Profile the dataset
- Data processing
- Define and fit a synthesizer
- Sample and post-process synthetic data
- Profiling synthetic data and comparing samples
- Saving/loading a profile for later analysis
We will cover basic DataProfiler usage in a data synthesis flow. We will focus the data pre-processing and its impact on the fidelity of the produced samples by leveraging Data Profiler and its outputs to understand what pre-processing leads to the best results.
1 . Install Data Profiler and import required packages
In this demo we will fire up a conda environment for called ydata_synth, install and use the slimmer version of Data Profiler to explore the utility aspect of our synthetic data.
conda create -n ydata-synth python=3.8
conda activate ydata_synth
pip install ydata-synthetic==0.7.1
pip install DataProfiler[reports]
Import all our required packages: