What you’ve always wanted to know about this exciting AI trend
Synthetic Data has been quite a buzzword on the top of everybody’s tongue for the last few months. While its benefits seem to be tremendous for organizations, there seem to be a lot of controversies and concerns around topics such as responsibility, compliance, privacy, and even the overall concept and usability of synthetic data.
If you’re new to the field, then look no further: you’ve come to the right place! In this article, we will highlight the top 10 most frequently asked questions on synthetic data that are driving a heated discussion in the data science community and lay out the answers from experts in the field.
You’ll see how, although it seems a little like “black magic”, synthetic data has no secrets. Except, of course, when we’re talking about the privacy of real-world data.
1. What is Synthetic Data?
Rather than being collected from real sources (e.g., “real data”), synthetic data is artificially generated by a computer algorithm. Synthetic data is not “real” in the sense that it does not correspond to actual activities, but if it is generated in a data-driven way (i.e., attending to the properties of the original data), it holds real data value, mimicking real behavior and providing the same insights.
2. How is Synthetic Data generated?
Synthetic data can be generated through a variety of techniques. One of the most common nowadays is Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs). However, there are simpler methods to generate synthetic data, such as the well-known Synthetic Minority Oversampling Technique (SMOTE) or random sampling. Depending on the purpose for which the synthetic data is required, different models can be considered. Please refer to “What is the method to generate Synthetic Data?” and read more about different methods to generate synthetic data.
3. What are the benefits of using Synthetic Data?
Synthetic Data can be used for several purposes. It can be used to augment or replace real-world data where data access or the collection of additional data is impossible, impractical, expensive, or ethically problematic, thus saving time and money for organizations. It also enables an additional layer of privacy and security, easing the burden of regulatory compliance and breaking data silos. Finally, it boosts AI development, as teams are able to generate large amounts of data with specific characteristics to improve machine learning, guaranteeing that they have access to all the data they need to make them work flawlessly.
4. What are common use cases for Synthetic Data?
Synthetic Data has a plethora of use cases. In general machine learning tasks, synthetic data can be used for data augmentation (increasing the amount of data without collecting additional real data), model training and testing (improving the accuracy of models and addressing issues such as overfitting), and bias mitigation (increasing the representation of minority class concepts or subgroups). Within or between organizations, synthetic data enables data sharing, and privacy protection: it can be used to substitute real data in situations where real data cannot be shared (e.g., for healthcare domains, financial services, and social sciences) or used to test and improve other privacy-enhancing techniques without risking data leakage.
5. How can Synthetic Data be used if it is not “Real Data”?
Although we associate it with “fake” data, synthetic data can be generated in a way that resembles realistic data, i.e., preserving the characteristics of real data – its structure, statistical properties, dependencies, and correlations. Synthetic data is very similar to the original data in terms of granularity as well: synthetic data records are not real, but they contain the same information. Also, it is important to discuss that, although the concept of “synthetic data” might be new to most people in the data science community, the idea of creating new, artificial data for some purpose is not new. In fact, strategies for creating (e.g., “synthetizing”) new values based on some observed information regarding the data have been used extensively in the past for AI applications: sampling from a statistical distribution adjusted to the data, performing data imputation to fill in missing values, or using data oversampling techniques. Finally, recall that the process of synthetic data generation can be performed using several techniques: choosing the most appropriate one will depend on the objective for which synthetic data is needed. Please refer to “What is the method to generate Synthetic Data?”.
6. What are the limitations of using Synthetic Data?
Since synthetic data is generated based on real data, the same concerns that are posed by real data apply. In other words, if the quality of real data is not audited and validated, synthetic data may replicate undesired properties of the original data, such as fairness constraints, class imbalance, and missing data, among other inconsistencies. If the goal is to mimic the characteristics and behavior of the original data, then this may not be a concern to solve prior to the synthetization process. However, if the goal is to improve the performance of machine learning models, it is essential to profile your data and address these quality issues before continuing to the process of synthetic data generation.
Another common concern with synthetic data is ensuring that it accurately reflects the characteristics of the real data it is intended to augment or replace. To surpass this concern, it is important to define to what end the synthetic data will be used. Depending on the objective, different models can be more appropriate to perform the data generation, and different metrics can be analyzed and optimized. Please refer to “How to evaluate Synthetic Data?” to know more about this topic.
7. What is the best method to generate Synthetic Data?
Since there could be a multitude of use cases for synthetic data, this is not a question that can be answered directly. Choosing the best model for synthesization depends on several factors:
the intended use case (data-sharing, privacy preservation, or machine learning development), the characteristics of the data (the amount of data available to train a synthesizer, the type of data at hand such as structured, unstructured, tabular, time-series), and the downstream application (e.g., supervised or unsupervised learning) or even the constraints associated with the domain where the synthetic data is to be used (e.g., telecommunications, healthcare applications, financial services, retail).
Depending on the final purpose of the synthetic data, there is a trade-off to be analyzed in what concerns several aspects. As an example, if synthetic data is needed to test a privacy-enhancing technology, then it might not be necessary to ensure that all the data distributions and relations are exactly mimicked. In this case, a simpler and less expensive computational model might do the trick. On the contrary, if the synthetic data is designed for machine learning development, then the choice of a synthetic model must be more thoughtful, possibly going for state-of-the-art generative models that are able to fully capture the behavior of the original data.
8. How can biases be addressed with Synthetic Data?
If the original data is efficiently profiled and validated prior to synthetic data generation, the synthetization process can be tailored to the mitigation of fairness issues, by focusing on underrepresented concepts in data (e.g., gender or race categories). It is also important to profile the generated data to ensure that it still reflects the original properties of the original data and that it does not introduce new biases inadvertently. By definition, this should not be the case, but nevertheless, beyond data profiling, evaluating and validating synthetic data against real-world data is a fundamental step for a successful synthesization process. See “How to Evaluate Synthetic Data?” and read more about how synthetic data can help overcome data bias.
9. How to evaluate Synthetic Data?
Similarly to choosing the most appropriate models for synthesization, choosing the most appropriate metrics is also dependent on the goal for which synthetic data will be used. We may pay more attention to one metric while slightly adjusting another (e.g., higher privacy rather than utility), depending on the use case. Nevertheless, we may define three essential pillars for synthetic data quality: privacy, fidelity, and utility. Privacy refers to the ability of synthetic data to withhold any personal, private, or sensitive information, avoiding connections being drawn to the original data and preventing data leakage. Fidelity concerns the ability of the new data to preserve the properties of the original data (in other words, it refers to “how faithful, how precise” is the synthetic data in comparison to real data). Finally, utility relates to the downstream application where the synthetic data will be used: if the synthetization process is successful, the same insights should be derived from the new data as from the original data. For each of these components, several specific statistical measures can be evaluated. Learn more about Synthetic Data Quality metrics here.
10. Is synthetic data legal and ethical to use?
Since synthetic data records are not “real” data (they do not encode actual real-life events, but either mimic them), they do not qualify as personal data. For that reason, regulations and laws designed to protect personal data, such as GDPR or CCPA, do not apply. Additionally, as synthetic data records do not have a one-to-one match with original records, identity disclosure is considered much harder, or impossible in some cases. For more information, please refer to this comparison between anonymization and synthetic data.
And as a bonus question …
How to get started with Synthetic Data?
If you’re up to getting your hands dirty but you’re still unsure about how to address this new concept, start by testing our no-code synthetic data generation experience with Fabric! If you are more into code, we have recently launched ydata-sdk: you can quickstart synthetic data generation in Google Colab or Jupyter Notebook.
The Data-Centric AI Community is also a great place for you to get more familiar with the synthetic data benefits. You can get help with troubleshooting or additional questions that you may have. That’s definitely a plus when learning a new topic! Please consider joining us and sharing your learnings in this exciting new field!