Skip to content

The impact of Machine Learning in data privacy

Hacker binary attack code

As the world moves towards digitalization, more and more personal and private information is being gathered every day. It’s a must to process and explore this data for organizations to innovate.

In a new information era, where data is the new oil, data privacy is emerging as one of the biggest concerns for governments and society. In this landscape, new regulations and privacy laws are emerging worldwide, changing the data projects landscape and new challenging realities.

It’s time to rethink Data Privacy!

Masking … a solution?

One of the most commonly used solutions by those who need to manipulate and perform research or development with sensitive data is data masking. Data masking hides data elements from users that are considered to be sensitive, and that cannot be shown to the individuals that are working with the data itself. Typically, it replaces the data elements with similar-looking fake data but ensures vital parts of personally identifiable information. Different from encryption, data masking is designed to not be reversible at all, making it completely useless for the attackers that might try to reverse the masking applied. But is it enough to prevent database individuals' re-identification? The answer is simple, it’s not.

The risk of re-identification is real, regardless of whether masking or encryption has been applied. It makes the work for the attackers much harder but it’s still possible. This topic is well-studied in the privacy community:

  • In 2016, Australia’s Federal Department of Health published medical billing records of about 2.9 million Australians online. These records came from the Medicare Benefits Scheme (MBS) and the Pharmaceutical Benefits Scheme (PBS) containing records of around 10 percent of the population. With the release of this potentially sensitive data, researchers tested its security against re-identification attacks. Using only publicly available information, the researchers were able to decrypt the information within the MBS dataset (Big Data and the Risk of Re-Identification).

  • In the US, it was found that 87% of the population can be uniquely identified based on 5-digit ZIP, gender, and date of birth; 53% are likely to be uniquely identified with the only place (city, town, or municipality), gender, and date of birth. Even at the county level, it is still possible to re-identify 18% of the total population from the US. (Simple Demographics Often Identify People Uniquely).

  • Netflix prize data it’s another example of how masking and encryption can easily be “reversed”. Using solely data from 2005, researchers from MIT were able to re-identify Netflix users through the combination with Amazon Product open database. Based on this matching user profile it’s possible not only to uncover users' shopping habits, full names, or even the political beliefs of supposedly anonymous individuals (Who’s Watching?).



Photo by freestocks on Unsplash


Does Machine Learning amplify data privacy issues?

As explained in our previous post Deep Learning and its applications, Machine Learning (ML) is a subset within the field of AI, that requires large datasets so it can “learn” patterns with high levels of accuracy. But how it affects data privacy?

The same problems regarding data privacy that were pointed out with the rise of Big Data are also relevant for ML:

  • Possibility to re-identify personal information from large datasets;

  • Availability of high dimensional data due to the reduction of storage costs;

  • Mining of unstructured data using Deep Learning techniques and the possibility to incorporate high dimensional data in one single model.

This leads to a whole new level of data available and possibilities to re-identify private information, although minimal personal characteristics were made available. Following I’ll give a simple example:

Supposing that a company is performing some market analysis based on customers' feedback. Due to privacy concerns, all personal information such as name, age, gender, etc. was deducted from the dataset to be analyzed. It’s legit to think that now that it’s impossible, for example, to know the age or even the gender of the customer behind a certain feedback correct? Wrong! There are ways to re-identify gender completely based on the subtle differences in word choice — Gender classification on Twitter.


In summary, data that undergone pseudonymisation, meaning masking or removal of personally identifiable data is no longer sufficient to comply with the new legal frameworks for data privacy.


Due to the increased risk of re-identification (higher volumes of personal data being shared, computational power increase, the bigger amount of available data, etc.), new regulations regarding data privacy have been published: in 2016, the Federal Attorney-General introduced the Privacy Amendment (Re-identification Offence) in Australia; the European General Data Protection Regulation (GDPR) in 2016 and the California Consumer Privacy act signed in 2018, are some of the measures taken to guarantee data privacy.

With these new regulations, a new definition of truly private data has been created. As cited at Recital 26 in GDPR:

The principles of data protection should apply to any information concerning an identified or identifiable natural person.

Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.

The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.



Overall, the concepts of privacy and security have changed since the arrival of Big Data and Machine Learning, and organizations need to adapt to ensure the best protection of their customer's data.

New privacy regulations are thriving for new levels of data privacy- through regulation of how data can be used, ensuring that collected data is manipulated in a more transparent, fair, and secure way.

Organizations need to review their data policies either for internal or external processes, such as anonymization and privacy methods usage, to stay innovative and leverage the latest advances in technology.

YData makes data access and control simpler

YData makes data access and control simpler with new Fabric platform

Startup launches improved platform with new name YData Fabric to provide simplified access and control of quality data. YData becomes a Microsoft partner, and the platform is available on the Azure and AWS marketplaces. YData, the startup...

Read More
From model-centric to data-centric

From model-centric to data-centric

A new paradigm for AI development — focused on data quality In my last blog post I’ve covered the rise of DataPrepOps and the importance of data preparation to achieve optimized results from Machine Learning based solutions. The stakes of...

Read More
High scores in Retail Banking

A data-centric AI approach to Credit Scoring in Retail Banking

Credit scoring in retail banking traditionally involved manual evaluation of payment behavior, age, wage, gender, zip code, and other personal information. However, with the growth of financial institutions and the volume of data,...

Read More