A tutorial on how you can combine ydata-synthetic with Great Expectations
With the rapidevolution of machine learning algorithms and coding frameworks, the lack of high-quality data is the real bottleneck in the AI industry.
Transform 2019 of VentureBeatpredicted that 87% of AI projects would never make it into production. According to a 2017 Harvard Business Reviewstudy, only 3% of companies’ data meets basic quality standards.
This wouldn’t have been a surprise to you if you’re working in the AI industry — we all have experienced it first hand. Thankfully, we have seen a paradigm shift in the recent past, pioneered by Andrew Ng, moving from amodel-centric approach to a data-centric approach.
When used right,synthetic data is one such elementthat can aid in the data-centric approach. Synthetic data is artificially generated data that is not collected from real-world events. It replicates the statistical components of actual data without containing any identifiable information, ensuring individuals’ privacy.Our open-source libraryfor synthetic data ydata-synthetic focuses solely on this. One of the common concerns we had from our users was how do we ensure that the synthetic data follows all the quality standards of the original data? As much as preserving the statistical properties of the original data, ensuring it follows a rigid data quality standard is essential too.
Luckily our friends atGreat Expectationshave built and open-sourced a framework they call — a shared, open standard for data quality. When your data team adapts the framework, you know what to expect from your new data.
In this article, we take you through an end-to-end use-case where you use the original data to train a model that synthetic data, validate the synthetic data quality standards against original data using the Great Expectations library.
Great Expectations is a tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams. It allows the user to create ‘expectations’ based on a good sample of data and use these expectations to validate if the new data meets the data quality standards.
Of course, we kept the introduction simple for the scope of this article, and there’s more to these libraries, which are evolving as we speak. The idea is: both libraries are working towards the same goal, high-quality data at scale on different fronts.
A 10-Step Guide: How to Use Both Tools Hand-in-Hand for Your Next Project
The Integrated Project Flow Diagram. Great Expectations in green and YData Synthetic in red.
Imbalanced class data is a common problem when building models, and generating synthetic data for the minority class can help develop more accurate models. However, we need to ensure that the newly generated data points maintain the same quality standards as the original minority class data points.
We will solve this problem by generating synthetic data using YData Synthetic and validating it through Great Expectations.
Before we get started, to follow along with the tutorial, we recommend you create a virtual environment and pip install both ydata-synthetic and great_expectations by running the following on your terminal.
pip install ydata-synthetic great-expectations
Now that you have the environment ready, let’s get started. If you want to follow our steps by steps, you candownload the materialsfrom ydata-synthetic GitHub.
Step 1: Setup the project structure through a Data Context
In Great Expectations, your Data Context manages the project configuration. There are multiple ways to create the Data Context; however, the simplest one is by using the CLI that comes along when you install the great_expectations package.
Open your terminal and navigate to the project directory and type in the following:
great_expectations init
Press enter to complete the creation of the Data Context, and that’s about it.
Setting up a Data Context.
If you’re curious about the modified project structure, here’s an excerpt from the GE documentation:
great_expectations.yml contains the main configuration of your deployment.
The expectations/ directory stores all your Expectations as JSON files. If you want to store them somewhere else, you can change that later.
The plugins/ directory holds code for any custom plugins you develop as part of your deployment.
The uncommitted/ directory contains files that shouldn’t live in version control. It has a .gitignore configured to exclude all its contents from version control.
Step 2: Download/Extract the actual data set we use to create synthetic data
We can download the data we use for this example from Kaggle. If you inspect the classes, you’ll notice that the “fraud” class is much lesser than the “not fraud” class, which is the case in real life.
Hence we can extract the fraud class data points only and consider it the actual dataset. We aim to create synthetic data points similar to fraud occurrences.
Step 3: Configure a Data Source to connect our data
In Great Expectations, Datasources simplify connections by managing configuration and providing a consistent, cross-platform API for referencing data.
Let’s configure our first Datasource: a connection to the data directory we’ve provided in the repo. Instead, this could even be a database connection and more.
great_expectations datasource new
Creating a new DataSource.
As shown in the image above, you would be presented with different options. Select Files on a filesystem (for processing with Pandas or Spark) and Pandas. Finally, enter the directory asdata(where we have our actual data).
Once you’ve entered the details, a jupyter notebook will open up. This is just the way Great Expectations has given templated codes, which helps us create expectations with a few code changes.
Let’s change the Datasource name to something more specific.
Edit the second code cell as follows:datasource_name = "data__dir"
Then execute all cells in the notebook to save the new Datasource. If successful, the last cell will print a list of all Datasources, including the one you just created.
Step 4: Create an Expectation Suite using the built-in Great Expectations profiler
An expectation is nothing but a falsifiable, verifiable statement about data. Expectations provide a language to talk about data characteristics and data quality — humans to humans, humans to machines, and machines to machines.
The idea here is that we assume that the actual data has the ideal quality of the data we want to be synthesized, so we use the actual data to create a set of expectations which we can later use to evaluate our synthetic data.
The CLI will help create our first Expectation Suite. Suites are simply collections of Expectations. We can use the built-in profiler to automatically generate an Expectation Suite calledcreditcard.quality
Type the following into your terminal:
great_expectations suite new
Again select the options as shown in the image above. We create expectations using the automatic profiler and point it to use the actual dataset.
Again another jupyter notebook would be opened with boilerplate code for creating a new expectation suite. The code is pretty standard; however, please note that all columns are added to the list of ignored columns in the second cell. We want to validate every column in our example; hence we should remove these columns from the ignored_columns list.
Executing the notebook will create an expectation suite against the actual credit card fraud dataset.
Step 5: Transform the real data for modelling
Now that we have created the expectation suite, we shift our focus back to creating the synthetic data.
We follow the standard process of transforming the data before training the GAN. We’re applying PowerTransformation — make data distribution more Gaussian-like.
Feel free to experiment with more pre-processing steps as it will yield better results.
Step 6: Train the synthesizers and create the model
Since we have pre-processed our data, it’s time to put our advanced ydata-synthetic GAN models to work.
For this example, we train a kind of GAN, calledWGAN-GPwhich provides much-needed training stability.