Skip to content
Back

How to evaluate the re-identification risk in Synthetic Data?

overall-fabric-privacy-score

While allowing for meaningful data behavior, it is crucial that synthetic data safeguards individual privacy. Therefore, ensuring the efficacy of synthetic data applications also requires a strong assessment of re-identification risks.

Consider for instance an organization that wants to use synthetic data to train machine learning models using sensitive data. The challenge lies in achieving optimal results while also maintaining people’s records private and safe.

In a previous article, we covered how YData Fabric evaluates the privacy of synthetic data by determining the closeness between data points in synthetic and real data. In this article, we’ll focus more deeply on how to assess the re-identification risks associated with synthetic data.

What is the re-identification risk of my synthetic data?

To ensure that the synthetic data aligns with the privacy requirements of your specific use cases, Fabric’s synthetic data quality PDF report outputs the Identifiability Score, which indicates the likelihood of malicious actors using the information in the synthetic data to re-identify individuals in real data.

The Identifiability Score ranges from [0, 1], where a score of 0 indicates minimal risk, and scores closer to 1 imply a higher likelihood of re-identification. For instance, in the example below, the synthetic data returns an identifiability score of 0.27, indicating a low risk of re-identification.

sdq_identifiability_score

How robust is my synthetic data to privacy attacks?

Beyond the risk of re-identification, there is also the possibility that a malicious attack can determine whether a particular record of the real data was used to train the synthesizer. Our synthetic data quality report measures such risk using the Membership Inference Score, also bounded within the [0,1] interval.

A lower score would suggest that the synthetic data is subjected to a potential private vulnerability and that the synthesization process may need to be adjusted to mitigate this risk. In turn, high scores indicate a robust level of protection against this type of attack, as indicated in the example below where the membership inference score is close to 1 (0.98).

sdq_membership_inference_score

At this point, you might be wondering “Which privacy metric should I take into account?” when evaluating the privacy of your synthetic dataset. In reality, these scores measure different privacy properties, and factors such as the chosen synthesis method and the characteristics of your real data contribute to each score differently. 

To guide the user through this evaluation, Fabric provides a global view of the privacy of the synthetic data by combining these privacy metrics into a single weighted score, which helps determine the suitability of your artificial data for privacy use cases. The example below shows this computation, where the synthetic data returns an overall privacy score of 98%, with a meter plot indicating a very good synthesization result:sdq_overall_privacy_score

Conclusion

As organizations increasingly embrace synthetic data for various applications, the need to comprehensively evaluate the risk of re-identification becomes indispensable.

If you’re looking to leverage synthetic data for your privacy use cases, try Fabric Community to explore the available synthetic generation methods, or contact our team to understand how Fabric can help in your specific needs and use cases.

Back
Artificial Intelligence on Financial Inclusion

The impact of Artificial Intelligence on Financial Inclusion

How AI assists banks and financial institutions in enhancing financial inclusivity The advancement of technology has paved the way for banks and financial institutions to reach out to all corners of the Earth. However, according to the...

Read More
Synthetic data in Retail; Data profiling in Retail; Machine Learning in Retail

How to successfully adopt AI in Retail

The Power of Data Quality, Orchestration, Profiling, and Synthetic Data Retail is not only a fast-paced but also a highly competitive landscape, demanding from the players to be always ahead of the competition. The adoption of AI and...

Read More
LLMs Impact Data Science Projects

How Large Language Models Impact Data Science Projects

The advent of Large Language Models (LLMs) is undeniably leaving its mark across several fields and industries. In the realm of Data Science, they can also prove extremely transformative in the way technical teams manage and analyze their...

Read More