Frequently Asked Questions

We've put together some commonly asked questions to give you more information about the YData solution.

If you have a question that you can't find the answer to, please contact us at hello@ydata.ai.

How does YData ensure quality in the synthetic generated data?

YData has an automated quality and privacy control process for every dataset generated with the goal to control the quality, utility, and privacy of the newly generated data.

For the quality, we use divergence metrics, correlation measures, and non-parametric tests, for the utility we apply the TSTR (Train Synthetic Test Real) methodology. In what concerns measuring privacy leakage, we perform various tests, such as inference attacks.

How YData ensures synthetic data is compliant with privacy regulations?

YData has a strong foundation in the literature that supports the solution in regard to privacy safety. There is no privacy certification nor known process to date that certificates a solution that's compliant with privacy regulation and, compared to traditional anonymization tools, which also do not have any form of certification, synthetic data copes with the GDPR and other privacy regulations in the sense that this data is generated having random noise as input, being impossible to trace a synthetic record back to a record in the original data - the same way as GDPR defines to be privacy by design.

How does YData ensure the data never leaves your infrastructure?

YData's platform is deployed on your infrastructure (either cloud or on-premises), ensuring that there's never a data transfer between your company and YData.

Is synthetic data safe to share or sell?

Yes. It is not proprietary and does not contain PII (Personally identifiable information), plus YData generates an automated and detailed report regarding the quality of the generated data as well as the privacy level of the newly generated dataset.

What if I want to know which record a synthetic one refers to? Can I trace it back to the original one?

You can't. If you want to perform operational activities, business as usual, and single record operations, you'll need real data. In general, to be able to trace back to an original record goes against the concept of privacy by design, foundational to synthetic data.

How different is synthetic data compared with other PET (Privacy Enhancing Technologies)?

There is some trending PET besides Synthetic Data, such as Differential Privacy, Federated Learning, and Homomorphic Encryption, and each of them was created for different purposes. Differential privacy will be the best option for private analytics over a big dataset. Why? It works really well for big data and has low computational expenditure.

On the other hand, although synthetic data needs high graphic computational power, it is a method that provides the same granular format of data to data scientists, and solves problems like data augmentation and balancing, so common in data science projects. Moreover, synthetic data can be combined with differential privacy.

Do you transform the data? Preprocessing, cleaning, or other?

Data synthesization is the process of generating new data, not transforming the existing one. However, before YData generates new data, there's a preprocess of the original data in order to create new data with higher quality. However, the newly generated data is not a transformation of the real dataset, nor its records are traceable.

How does YData deal with BIAS in the data?

This is a very controversial question because bias can happen at several different levels, meaning it can happen at the level of the data collection, the data processing, or during the process of building a model. In the first case, the data itself is collected in a biased manner, for example, a class of certain types of events is not collected on purpose, in this case, it's very hard to solve the problem unless the collection process is changed.

The other two are the ones that can be fixed or influence the analysis. If the original data already contains any type of bias, that means that the synthetic data will contain that same bias, but not create it or make it worse. Nevertheless, it is possible to fix bias through the process of synthesizing data, when domain knowledge is available.

How does your synthetic data differ from the ones generated using SMOTE or ADASYN?

SMOTE and ADASYN are very popular and familiar algorithms to be used by data scientists to deal with highly imbalanced datasets. While they have proven to be highly beneficial for low-dimensional data, the same does not apply when it comes to high-dimensional data. In fact, both methods increase the bias towards the classification in the majority of classes.

Our solution proves to increase the results of classifiers when dealing with highly imbalanced datasets because high-dimensional datasets are more common to be found in real life than simpler datasets.

What kind of automated data preprocessing do you perform in the data?

We take care of all the data preprocessing that might not be dependent on business insights. We cover the following steps with our data preprocessing: encoding, inconsistencies detection, missing values imputation, and normalization while ensuring scalability. There is a slight variation for time-series datasets in what concerns the imputations of missing values and normalization processes.

Time-series
Missing values imputation has also a time dependency, different from tabular data we’ve to take particular care of the dependency with the time-lags from the dataset.

Normalization process for time-series:
In this case, there’s a need to invest in strategies to encode and compress the time-series. This makes the process not only computationally less expensive but also, improves overall understanding of the input time-series.

What metrics do you use to evaluate data quality of the generated data?

The metrics vary depending on the type of data that we will be working with. As we support both tabular data with no time dependence and time-series data we’ve created a distinct validation flow for each type of data.

In general, we apply several methods that can be classified into structural similarity, dissimilarity measures, and the Train Synthetic Test Real (TSTR). The last one confirms the assumptions on the validity and the quality of the data to be used to develop models, and consists of training either a set of classifiers or regressors with synthetic data and afterward benchmark results on unseen real data.

What metrics do you use to evaluate the data privacy of the generated data?

We also use different methods to be able to measure the level of privacy ensured by the newly generated data, from simple methods of Neighbouring Observation Removal to Data Source Prediction and Membership inference attacks.

How does YData tackle Bias and Fairness in data?

There are many forms of Bias and Fairness in the data science lifecycle. The way YData helps organizations is by providing an easy way for the data science teams to understand the data they’re working with through an automated data exploratory analysis and then they can leverage the data synthesization tool to balance those datasets. We’ve published an article on how to fix race bias in a census dataset by generating synthetic records for the African-American population.

Would you like to see our solution in action?

Try now