A set of techniques to ensure privacy while exploring data
In recent years, we have witnessed numerous breakthroughs in machine learning techniques in a wide variety of domains, such as computer vision, language processing, reinforcement learning, and many more.
Many of these applications involve training models using sensitive private data — for example, diagnosis of diseases with medical records, location-based user activity prediction, e-commerce behavior prediction, etc. involves accessing and storing a huge amount of personal data.
But in this process of ensuring data privacy, we’ve to be aware that not only the data needs to be secure. Many studies have shown that sensitive data can be recovered from Machine Learning (ML) published models.
Therefore, privacy-preserving techniques are crucial for deploying ML applications.
Broadly speaking there are two main ways of attacking to recover private information:
1. White Box attack: In this case, the attacker has full access to the model weights and parameters. They can even change the data during execution resulting in wrong predictions and which may cause serious troubles in real-life applications.
2. Black Box attack: Here, the attacker does not have access to the weights but they use repeated strategically formed queries on the model to know the behavior of the model. This behavior can be used later to gather sensitive information.
To protect ML models against attacks and ensure that data remains private, many defense techniques have been proposed. Further, we’ll be discussing some of them.
Private Aggregation of Teacher Ensembles
Privacy Aggregation of Teacher Ensembles (PATE), uses the concept that if multiple models that are trained on disjoint data agree on the same input, then no private data in their training examples is leaked because all models have reached the same conclusion. By this method, we are providing intuitive privacy. Moreover, the PATE Framework contains an additional step to ensure no attacks can be performed against the teacher’s private data, be it a white box or black box. For this, a “student” model is used, which learns from public data that has been labeled by the teachers. This removes the need for the teachers in subsequent queries and ensures the student model only learns the generalization provided by the teachers.
In this technique, we introduce some noise to the original data before sending it to the server. How the noise will be distributed among the collected data is determined by some probabilistic functions like the Laplace mechanism. Because of this noise, it is not possible to say if the data of an individual is correct or not. But, as a whole, we can train our model on this data and learn the generalization.
According to the McGraw-Hill Dictionary of Scientific and Technical Terms, synthetic data is “any production data applicable to a given situation that is not obtained by direct measurement”. Synthetic data is artificially generated to replicate the statistical components of real-world data but does not contain any sensitive information. By taking a small amount of real data we can build algorithms to replicate the features of real data into newly generated data. Now we can train our ML model on the computer-generated data. Thus the sensitive data is never revealed.
Conclusion
As they say, “with great power comes great responsibility”. If you are leveraging the power of ML on sensitive data you must ensure the privacy of the individuals. We hope this article showed you the first step toward data privacy in Machine Learning.