Data is the foundation of modern machine learning models. However, data privacy issues, high costs, and the difficulty in obtaining large datasets make it challenging to develop robust and efficient models. This is where synthetic data generation steps in as a game-changer.
Synthetic data is artificially created information rather than being generated by actual events. It maintains the statistical properties of the original data, ensuring the trained models perform as expected.
Today, we’re exploring how to generate synthetic tabular data using the ydata-synthetic library, a powerful Python package developed by YData. This open-source library allows us to generate high-quality synthetic data, preserving the intrinsic properties of the original dataset.
To begin, we need to install the ydata-synthetic package. This can be done easily with pip:
The ydata-synthetic package offers various models for different data types, including time-series and tabular data. For this tutorial, we will focus on tabular data, using the GAN-based synthetic data generators provided by the library.
The main steps in the process are as follows:
For the purpose of this tutorial we will consider the Adult Census Income dataset and the popular model for tabular data synthesization: the Conditional Tabular Generative Adversarial Network (CTGAN).
ydata-synthetic provides preprocessing tools that help in transforming our data suitable for training. We just need to make the necessary imports:
We can then load the data and define the numeric and categorical features:
ydata-synthetic provides multiple pre-built GAN models, including the VanillaGAN, CGAN, WGAN-GP, and CTGAN. You can check more examples in the documentation, and learn more about GANs in this blogpost.
To leverage CTGAN, we need to specify the model and training parameters. Then, we train it on our data. The preprocessing is done internally!
Once the model is trained, generating synthetic data is straightforward:
This creates a new DataFrame with 1000 synthetic data points. The post-processing is done internally, so the data is automatically returned in the original format and range. This gives us a DataFrame of synthetic data that maintains the statistical properties of our original data.
Synthetic data generation is a powerful tool for overcoming issues related to data availability and privacy. The ydata-synthetic library makes this process easier by providing pre-built GAN models and preprocessing utilities. This allows us to generate synthetic data that closely matches our original data’s statistical properties, providing a valuable resource for machine learning model development.
While this tutorial focused on tabular data, the ydata-synthetic library also supports other data types like time-series. If you’d like to learn more about ydata-synthetic, you can refer to these Frequently Asked Questions!
We encourage you to explore these options and leverage the power of synthetic data in your projects and join us at the Data-Centric AI Community to connect with other data enthusiasts.
Cover Photo by eMotion Tech on Unsplash
This article can be found originally in YData Medium Publication.