Back

Generative AI for Tabular Data

Synthetic Data resembles the creation of an artificial

Data is the foundation of modern machine learning models. However, data privacy issues, high costs, and the difficulty in obtaining large datasets make it challenging to develop robust and efficient models. This is where synthetic data generation steps in as a game-changer.

Synthetic data is artificially created information rather than being generated by actual events. It maintains the statistical properties of the original data, ensuring the trained models perform as expected.

Today, we’re exploring how to generate synthetic tabular data using the ydata-synthetic library, a powerful Python package developed by YData. This open-source library allows us to generate high-quality synthetic data, preserving the intrinsic properties of the original dataset.

Getting Started

To begin, we need to install the ydata-synthetic package. This can be done easily with pip:

Understanding the ydata-synthetic package

The ydata-synthetic package offers various models for different data types, including time-series and tabular data. For this tutorial, we will focus on tabular data, using the GAN-based synthetic data generators provided by the library.

The main steps in the process are as follows:

  • Preprocessing: This is where we scale our data to ensure it fits the model’s expectations. The package includes utilities for this process.
  • Training: We use the data to train a generative adversarial network (GAN). The library has several pre-defined GAN architectures for synthetic data generation.
  • Generation: Once the model is trained, we can generate synthetic data.
  • Reverse Preprocessing: The final step is to convert the generated data back to its original form, undoing the scaling performed during the preprocessing stage. In most cases, this is done seamlessly by the package as well.

Walkthrough

For the purpose of this tutorial we will consider the Adult Census Income dataset and the popular model for tabular data synthesization: the Conditional Tabular Generative Adversarial Network (CTGAN).

Preprocessing and Model Training

ydata-synthetic provides preprocessing tools that help in transforming our data suitable for training. We just need to make the necessary imports:

We can then load the data and define the numeric and categorical features:

Training the Model

ydata-synthetic provides multiple pre-built GAN models, including the VanillaGAN, CGAN, WGAN-GP, and CTGAN. You can check more examples in the documentation, and learn more about GANs in this blogpost.

To leverage CTGAN, we need to specify the model and training parameters. Then, we train it on our data. The preprocessing is done internally!

Synthetic Data Generation

Once the model is trained, generating synthetic data is straightforward:

This creates a new DataFrame with 1000 synthetic data points. The post-processing is done internally, so the data is automatically returned in the original format and range. This gives us a DataFrame of synthetic data that maintains the statistical properties of our original data.

Conclusion

Synthetic data generation is a powerful tool for overcoming issues related to data availability and privacy. The ydata-synthetic library makes this process easier by providing pre-built GAN models and preprocessing utilities. This allows us to generate synthetic data that closely matches our original data’s statistical properties, providing a valuable resource for machine learning model development.

While this tutorial focused on tabular data, the ydata-synthetic library also supports other data types like time-series. If you’d like to learn more about ydata-synthetic, you can refer to these Frequently Asked Questions!

We encourage you to explore these options and leverage the power of synthetic data in your projects and join us at the Data-Centric AI Community to connect with other data enthusiasts.

 

Cover Photo by eMotion Tech on Unsplash
This article can be found originally in YData Medium Publication.

 

Back
ydata-synthetic the open-source for synthetic data generation

Synthetic data generation with Gaussian Mixture Models

Photo by Roman Synkevych on Unsplash A probabilistic approach to fast synthetic data generation with ydata-synthetic To find synthetic data generation within the same sentence as Gaussian Mixture Models (GMMs) sounds odd, but it makes a...

Read More
GANs for Synthetic Data Generation

GANs for Synthetic Data Generation

A practical guide to generating synthetic data using open-sourced GAN implementations The advancements in technology have paved the way for generating millions of gigabytes of real-world data in a single minute, which would be great for...

Read More
Synthetic Data

10 Most Asked Questions on ydata-synthetic

1. What is the ydata-synthetic and what does it do? ydata-synthetic is an open-source Python package developed by YData’s team that allows users to experiment with several generative models for synthetic data generation. The main goal of...

Read More