Skip to content

Privacy preserving Machine Learning

Laptop screen showing bitcoin stats

A set of techniques to ensure privacy while exploring data

In recent years, we have witnessed numerous breakthroughs in machine learning techniques in a wide variety of domains, such as computer vision, language processing, reinforcement learning, and many more.

Many of these applications involve training models using sensitive private data — for example, diagnosis of diseases with medical records, location-based user activity prediction, e-commerce behavior prediction, etc. involves accessing and storing a huge amount of personal data.

. . . 

In fact, due to this, we’ve all already heard something about data privacy. It is considered to be “the most important issue of the next decade”.

But in this process of ensuring data privacy, we’ve to be aware that not only the data needs to be secure. Many studies have shown that sensitive data can be recovered from Machine Learning (ML) published models.

Therefore, privacy-preserving techniques are crucial for deploying ML applications.

Broadly speaking there are two main ways of attacking to recover private information:

  1. 1. White Box attack: In this case, the attacker has full access to the model weights and parameters. They can even change the data during execution resulting in wrong predictions and which may cause serious troubles in real-life applications.

  2. 2. Black Box attack: Here, the attacker does not have access to the weights but they use repeated strategically formed queries on the model to know the behavior of the model. This behavior can be used later to gather sensitive information.

To protect ML models against attacks and ensure that data remains private, many defense techniques have been proposed. Further, we’ll be discussing some of them.

Private Aggregation of Teacher Ensembles

Privacy Aggregation of Teacher Ensembles (PATE), uses the concept that if multiple models that are trained on disjoint data agree on the same input, then no private data in their training examples is leaked because all models have reached the same conclusion. By this method, we are providing intuitive privacy. Moreover, the PATE Framework contains an additional step to ensure no attacks can be performed against the teacher’s private data, be it a white box or black box. For this, a “student” model is used, which learns from public data that has been labeled by the teachers. This removes the need for the teachers in subsequent queries and ensures the student model only learns the generalization provided by the teachers.

Differential Privacy

This method was popularized by a 2006 paper named “Calibrating noise to sensitivity in private data analysis.” (Dwork et al.) It is particularly helpful for large datasets.

In this technique, we introduce some noise to the original data before sending it to the server. How the noise will be distributed among the collected data is determined by some probabilistic functions like the Laplace mechanism. Because of this noise, it is not possible to say if the data of an individual is correct or not. But, as a whole, we can train our model on this data and learn the generalization.

If you’re interested in applying and exploring Differential Privacy, IBM has made available, here, a library with a set of methods.

Encrypted Learning

Photo by Markus Winkler on Unsplash

 Photo by Markus Winkler on Unsplash

In this technique, instead of training the model on raw data we first encrypt the data, meaning that the model will be trained using encrypted data. Even though the data is encrypted, it is still possible to perform computations on this data using multi-party computation (MPC). The main idea behind MPC is to that split a piece of data into multiple encoded parts called secret shares. Individually, the shares do not reveal anything about the original data. But, if two parties perform the same operation on a set of shares and then combine them, the result will be the same as if that operation was performed on the original data. Thus the deep learning model can still learn from the encrypted data.

Federated Learning

In Federated Learning (FL), we bring the model to the data instead of bringing the data to the model. First, the server initializes the weights of the ML. Then the server sends the model to the client devices that have the data. Each client trains the model locally and computes an updated weight but the update is not done by the clients. The server receives the updates and computes a weighted average of the updates. The weights are assigned based on the training set used by a client. Then, the global model is updated using some stochastic process (like SGD). Several such rounds are performed until the model achieves good accuracy. In this way, the original data never leave the client's devices. Moreover, as the model is trained on disjoint data hence it only learns the generalization and does not remember the sensitive data.

An interesting use case of the use of FL in production systems is from Google — Improving Google keyboard query suggestions.

Synthetic Data

Synthetic data by YData

Synthetic data by YData

According to the McGraw-Hill Dictionary of Scientific and Technical Terms, synthetic data is “any production data applicable to a given situation that is not obtained by direct measurement”. Synthetic data is artificially generated to replicate the statistical components of real-world data but does not contain any sensitive information. By taking a small amount of real data we can build algorithms to replicate the features of real data into newly generated data. Now we can train our ML model on the computer-generated data. Thus the sensitive data is never revealed. 


As they say, “with great power comes great responsibility”. If you are leveraging the power of ML on sensitive data you must ensure the privacy of the individuals. We hope this article showed you the first step toward data privacy in Machine Learning.


Correlation Matrix for Multivariate Data

How to Profile Datasets with a big number of Variables?

As the Data-Centric AI paradigm has come to prove that focusing on data quality will have the most transformative impact in industries across all verticals, more and more companies and organizations worldwide are starting to look for the...

Read More
 Air quality in California, San Francisco United State, Golden Gate Bridge

A Machine Learning Approach to Predict Air Quality in California

Predicting air quality is a complex task that has become increasingly relevant in urban areas due to air pollution's critical impact on human health and the environment. In this context, machine learning techniques have proven to be...

Read More

How to Visually Evaluate Your Synthetic Data Quality?

As Synthetic Data becomes a must-have for the future of AI, guaranteeing its quality becomes indispensable. Fidelity, one of the main pillars of synthetic data evaluation, is crucial in ensuring that synthetic datasets accurately represent...

Read More