Photo by Conny Schneider on Unsplash
Synthetic data is a cornerstone of Data Centric-AI, an approach that focuses primarily on data quality rather than models. For the past few years, synthetic data gained attention because of a wide range of applications such as data augmentation, rebalancing, bias and fairness adjustment or privacy to name a few. However, most of the literature focuses either on images or speech, leaving a tremendous number of datasets and domains of application aside.
In this paper, we present a highly configurable benchmark suite to compare different data synthesizers according to several metrics and across various tabular datasets. The purpose of such suite is to allow a fair and systematic comparison between synthesizers on various datasets. We do not try to come with yet another set of metrics, but instead leave the user selecting the metrics to be used. In particular, as a first experiment, we ran the suite to compare Fabric synthesizer with the different synthesizers provided by Synthetic Data Vault (SDV), using SDV evaluation metrics.
Download this case study to learn more about:
- SDV open-source vs Fabric synthetic data generation capabilities
- The ecosystem needed to run synthetic data succesfully for different use-cases
- Buy vs build