The impact of Machine Learning in data privacy
As the world moves towards digitalization, more and more personal and private information is being gathered every day. It’s a must to process and explore this data for organizations to innovate.
In a new information era, where data is the new oil, data privacy is emerging as one of the biggest concerns for governments and society. In this landscape, new regulations and privacy laws are emerging worldwide, changing the data projects landscape and new challenging realities.
It’s time to rethink Data Privacy!
Masking … a solution?
Photo by freestocks on Unsplash
Does Machine Learning amplify data privacy issues?
As explained in our previous post Deep Learning and its applications, Machine Learning (ML) is a subset within the field of AI, that requires large datasets so it can “learn” patterns with high levels of accuracy. But how it affects data privacy?
The same problems regarding data privacy that were pointed out with the rise of Big Data are also relevant for ML:
Possibility to re-identify personal information from large datasets;
Availability of high dimensional data due to the reduction of storage costs;
Mining of unstructured data using Deep Learning techniques and the possibility to incorporate high dimensional data in one single model.
This leads to a whole new level of data available and possibilities to re-identify private information, although minimal personal characteristics were made available. Following I’ll give a simple example:
Supposing that a company is performing some market analysis based on customers' feedback. Due to privacy concerns, all personal information such as name, age, gender, etc. was deducted from the dataset to be analyzed. It’s legit to think that now that it’s impossible, for example, to know the age or even the gender of the customer behind a certain feedback correct? Wrong! There are ways to re-identify gender completely based on the subtle differences in word choice — Gender classification on Twitter.
In summary, data that undergone pseudonymisation, meaning masking or removal of personally identifiable data is no longer sufficient to comply with the new legal frameworks for data privacy.
Due to the increased risk of re-identification (higher volumes of personal data being shared, computational power increase, the bigger amount of available data, etc.), new regulations regarding data privacy have been published: in 2016, the Federal Attorney-General introduced the Privacy Amendment (Re-identification Offence) in Australia; the European General Data Protection Regulation (GDPR) in 2016 and the California Consumer Privacy act signed in 2018, are some of the measures taken to guarantee data privacy.
With these new regulations, a new definition of truly private data has been created. As cited at Recital 26 in GDPR:
The principles of data protection should apply to any information concerning an identified or identifiable natural person.
Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.
The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
Overall, the concepts of privacy and security have changed since the arrival of Big Data and Machine Learning, and organizations need to adapt to ensure the best protection of their customer's data.
New privacy regulations are thriving for new levels of data privacy- through regulation of how data can be used, ensuring that collected data is manipulated in a more transparent, fair, and secure way.
Organizations need to review their data policies either for internal or external processes, such as anonymization and privacy methods usage, to stay innovative and leverage the latest advances in technology.