1. What is the ydata-synthetic and what does it do?
ydata-synthetic is an open-source Python package developed by YData’s team that allows users to experiment with several generative models for synthetic data generation.
The main goal of the package is to serve as a way for data scientists to get familiar with synthetic data and its applications in real-world domains, as well as the potential of Generative AI. To get started using the package, continue to “How can I start using it?”. You’ll see how simple it is to explore it, even if you’re not that into “code”.
2. How can I start using it?
If you’re a rockstar developer, then there’s no secret sauce here: all you need to do is install the package through PyPi and get started! Check ”How do I generate synthetic data using the ydata-synthetic package?” to see a minimal working example for tabular data! For more information on time-series data, go to “Does ydata-synthetic handle time-series data?”.
If you’re more into the results and visualization than fighting against python’s quirks, then here’s some great news! Our latest release comes fully served with a user-friendly UI! Check ”How to run the Streamlit interface?” to see how you can create synthetic data with an effortless and efficient no-code flow.
3. How do I generate tabular synthetic data using the ydata-synthetic package?
Let’s talk code. First, you need to install the latest version through PyPi:
Now, let’s go through a minimal working example. We’ll start by importing the pmlb library (a wrapper for the Penn Machine Learning Benchmark data repository), from which we will later fetch the adult census income dataset. We also need to import our synthesizers, of course!
Now we can load the data:
And define the model (batch_size, learning_rate, beta_1 and beta_2) and training (epochs) parameters:
To learn how you can tweak these parameters like a pro, check “How can I tune a Synthesizer?”
With the parameters specified, we can start creating our objects corresponding to the chosen architecture (check “Which machine learning algorithms does the ydata-synthetic package use?” for more information) and the training details: Similarly to scikit-learn you just need to .fit to train the synthesizer and .sample for the synthetic data generation.
Easy, breezy, beautiful. But for some, it is not as beautiful as the brand-new UI. Learn How to run the Streamlit app before you go!
4. How can I tune a Synthesizer?
Tuning a synthesizer refers to the selection or optimization of model and training parameters. Essentially, it means that you’re able to specify the parameters of the available models to fit your data characteristics better..
This is traditionally performed through the ModelParameters class. Generically, all GAN models implemented in ydata-synthetic work with batch_size, learning_rate, beta_1, beta_2, noise_dim , data_dim, and layers_dim, among others. Here’s a quick explanation of these parameters:
- batch_size refers to how many records you’d like to use to adjust the model’s training at each step.
- epochs is the number of training iterations.
- learning_rate determines how much the weights of the model are updated in response to the estimated error during training.
- beta_1 and beta_2 are regularization parameters of the model.
Additional parameters can be set for our CTGAN model, such as l2_scale. If you’re a curious fella, you can always check the complete CTGAN implementation to learn a little more about additional parameters of the model. You can also check this article to understand a bit more about how the internal workings of GAN architectures.
5. What are the supported types of data for synthetic data generation in ydata-synthetic?
The package is equipped to handle both tabular (comprising numeric and categorical features) and time-series data.
Interesting datasets for you to experiment with are the Adult Census, Credit Card Fraud, or Cardiovascular Disease (tabular data), and the Stock Market dataset (time series).
This folder contains several examples of our currently supported GAN architectures such as CGAN, WGAN, WGANGP, DRAGAN, CRAMERGAN, CWGANGP, CTGAN, and TimeGAN.
If you’re looking for a fast and furious way of getting your hands dirty, you can use the Google Colab examples provided directly in the repository’s README.
For tabular data, you may refer to the Tabular synthetic data generation with CTGAN on Adult Census Income dataset. For time-series data, take a look at the TimeGAN synthesization on the Stock Market dataset and its companion blog post.
6. Which machine learning algorithms does ydata-synthetic use for generating synthetic data?
ydata-synthetic specializes in Generative Model architectures, which include Generative Adversarial Networks (GANs), but not only.
Generative models, in essence, aim to learn the underlying distribution of the input data and create new examples that are probable to be generated from that distribution.
Take for instance ChatGPT: it learns from tons and tons of text data and then generates new text by predicting the most likely word to come next, given the context of the input it receives (the query you make). Seems very “hollywood-ish AI”, but it is in fact a generative model (oops, did I just burst your bubble?).
Currently, ydata-synthetic supports the following generative architectures:
- GAN
- CGAN (Conditional GAN)
- WGAN (Wasserstein GAN)
- WGAN-GP (Wasserstein GAN with Gradient Penalty)
- DRAGAN (On Convergence and stability of GANS)
- Cramer GAN (The Cramer Distance as a Solution to Biased Wasserstein Gradients)
- CWGAN-GP (Conditional Wasserstein GAN with Gradient Penalty)
- CTGAN (Conditional Tabular GAN)
- TimeGAN (specifically for time-series data)
The latest architecture for tabular data is CTGAN which has proven to be a model with quite good generalization capabilities. See “How can I start using it?” to learn how to try it (or any of the available models!). If you’re curious about these generative models, particularly GANs, you may want to give this blog post a reading: it’s a two-part series that goes over different GAN architectures!
7. How to run the Streamlit app?
To try ydata-synthetic using the streamlit app, you need to install it using the [] notation that encodes the extras that the package incorporates. In this case, you can simply create your virtual environment and install ydata-synthetic as:
Note that Jupyter or Colab Notebooks are not yet supported, so you need to work it out in your Python environment. Once the package is installed, you can use the following snippet to start the app:
And that’s it! After running this command, the console will output the URL from which you can access the app!
8. How to generate synthetic data in Google Colab and Python Environments?
Most issues with installations are usually associated with unsupported Python versions or misalignment between python environments and package requirements. Let’s see how you can get both right:
Python Versions:
Note that ydata-synthetic currently requires Python >=3.9, < 3.11 so if you’re trying to run our code in Google Colab, then you need to update your Google Colab’s Python version accordingly. The same goes for your development environment.
Virtual Environments:
A lot of troubleshooting arises due to misalignments between environments and package requirements. If you’re new to data science development, maybe you just install packages into your global Python environment. This may turn into a lot of headaches when project requirements are conflicting.
Virtual Environments are ideal to overcome this issue: they isolate your installations from the “global” environment so that you don’t have to worry about conflicts. In short: if you’re in data science, virtual environments like pyenv or conda should become your best friends.
Using conda, creating a new environment is as easy as running this on your shell:
Now you can open up your Python editor or Jupyter Lab and use the synth-env as your development environment, without having to worry about conflicting versions or packages between projects!
9. Does ydata-synthetic handle time-series data?
Yes, ydata-synthetic uses the TimeGAN architecture to generate synthetic time-series data.
Generating time-series data is simple, although some tweaks are required before passing your input data into the network. Here’s an example using the Yahoo Stock Price!
First, we install the package and import the necessary packages:
In this case, we’re not only importing ModelParameters and TimeGAN, but also real_data_loading as well, since we will need to apply some preprocessing on the data, as we explain in what follows. Let’s first load the data. We had it already on our repository, so the call is simple:
Now we have the data_df, but in order to feed it to TimeGAN, we need to preprocess it so that all features are scaled to the [0,1] interval. We can use sklearn’s MinMaxScaler for that. Additionally, it would be helpful to get the data divided by sequence length, so that the models know which time windows you want to consider. That’s what is implemented in the real_data_loading.
Note that windowing is an important part of the TimeGAN architecture, as the model is not designed to generate full sequences of time-series events. Check “Does TimeGAN replicate my full sequence of data?” to fully understand this behavior.
After the data is processed, we can feed it to TimeGAN and train the synthesizer as per the example below:
To generate new synthetic stock data it is has simple as calling the sample method.
Note that the new synthetic data will correspond to several windows of 24 x 6 plausible sequences according to your input data. To learn more about this see “Does TimeGAN replicate my full sequence of data?”
10. Does TimeGAN replicate my full sequence of data?
Most people experimenting with TimeGAN expect their output delivered in this format:
However, this is an unrealistic expectation simply because the TimeGAN architecture is not meant to replicate the long-term behavior of your data.
TimeGAN works with the concept of “windows”. Essentially, it learns to map the data distribution of short-term frames of time, within the time windows you provide.
It also considers that those windows are independent of each other, so it cannot return a temporal pattern most people expect. That’s not supported by this architecture itself, but there are others that allow for both short-term and long-term synthesization, as those available in our Fabric.
In sum, TimeGAN does not allow you to replicate your existing time series in full, but it does allow you to generate “likely to occur” time windows, i.e., time windows whose values are plausible, as they were generated according to the characteristics and distribution of the input data.
However, there is no temporal relationship between the returned data, and there is no need to be. To leverage TimeGAN, you need to have your end goal in mind: what do you need synthetic data for? If the objective is to perform data augmentation to feed an LSTM model, for instance, then TimeGAN is a suitable strategy: it helps you increase the representation of existing time windows. If the objective is to recreate a temporal trend or sequence, then you need to go for different architectures.
How can I get support if I run into any trouble?
Throughout this article, we have covered a broad spectrum of topics regarding synthetic data and how you can explore ydata-synthetic during your learning journey. Yet, one question is missing: “What if I have additional questions?”.
Throughout this article, we have covered a broad spectrum of topics regarding synthetic data and how you can explore ydata-synthetic during your learning journey. Yet, one question is missing: “What if I have additional questions?”.
The best place to get fast and personalized support is to join our Discord Server. We have dedicated spaces for data discussions and open-source troubleshooting. You can go straight to the “🔐 YData Synthetic” category and post your question or feature request in “❓questions” or “🎤feature-requests”, respectively. We’re very dedicated to supporting and learning from our community, so a moderator will be right on top of your issue!