Published on July 26th, 2019 📆 | 4841 Views ⚑
0Current methods for anonymizing data leave individuals at risk of being re-identified
With the first large fines for breaching GDPR upon us, even anonymized datasets can be traced back to individuals using machine learning, researchers from University of Louvain and Imperial College London have shown.
The researchers say their paper demonstrates that allowing data to be used â to train AI algorithms, for example â while preserving peopleâs privacy, requires much more than simply adding noise, sampling datasets, and other de-identification techniques.
They have also published a demonstration tool that allows people to understand just how likely they are to be traced, even if the dataset they are in is anonymized and just a small fraction of it shared.
Companies and governments both routinely collect and use our personal data. Our data and the way itâs used is protected under relevant laws like GDPR or the USâs California Consumer Privacy Act (CCPA).
Data is âsampledâ and anonymized, which includes stripping the data of identifying characteristics like names and email addresses, so that individuals cannot, in theory, be identified. After this process, the dataâs no longer subject to data protection regulations, so it can be freely used and sold to third parties like advertising companies and data brokers.
The new research shows that once bought, the data can often be reverse engineered using machine learning to re-identify individuals, despite the anonymization techniques.
This could expose sensitive information about personally identified individuals, and allow buyers to build increasingly comprehensive personal profiles of individuals.
The research demonstrates for the first time how easily and accurately this can be done â even with incomplete datasets. They say their findings should be a wake-up call for policymakers on the need to tighten the rules for what constitutes truly anonymous data.
In the research, 99,98% of Americans were correctly re-identified in any available âanonymizedâ dataset by using just 15 characteristics, including age, gender, and marital status.
First author Dr Luc Rocher of UCLouvain said: âWhile there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog.â
To demonstrate this, the researchers developed a machine learning model to evaluate the likelihood for an individualâs characteristics to be precise enough to describe only one person in a population of billions.
They also developed an online tool, which doesnât save data and is for demonstration purposes only, to help people see which characteristics make them unique in datasets.
Senior author Dr Yves-Alexandre de Montjoye, of Imperialâs Department of Computing, and Data Science Institute, said: âThis is pretty standard information for companies to ask for.
âAlthough they are bound by GDPR guidelines, theyâre free to sell the data to anyone once itâs anonymized. Our research shows just how easily â and how accurately â individuals can be traced once this happens.
âCompanies and governments have downplayed the risk of re-identification by arguing that the datasets they sell are always incomplete. Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for.â
Co-author Dr Julien Hendrickx from UCLouvain said: âWeâre often assured that anonymization will keep our personal information safe. Our paper shows that de-identification is nowhere near enough to protect the privacy of peopleâs data.â
The researchers say policymakers must do more to protect individuals from such attacks, which could have serious ramifications for careers as well as personal and financial lives.
Dr Hendrickx added: âIt is essential for anonymization standards to be robust and account for new threats like the one demonstrated in this paper.â Dr de Montjoye said: âThe goal of anonymization is so we can use data to benefit society. This is extremely important but should not and does not have to happen at the expense of peopleâs privacy.â
Gloss