Blog #2 in our Series on The Business of Responsible AI : The Problem with Anonymization
This is the second in my series on the Business of Responsible AI (Artificial Intelligence), where I will explore topics related to responsible AI: artificial intelligence techniques and methodologies, as applied to security software, that are ethical and protective of human rights such as data privacy. Responsible AI is not only the ethical thing to do, it can also make good business sense!
I’m happy that notions of responsible AI are already widely being discussed, debated, and even turned into regulation. In my first blog posting, I mention several initiatives worldwide to provide a regulatory framework for data privacy, such as The European General Data Protection Regulation (GDPR) and the US’s California Consumer Privacy Act (CCPA). Both the GDPR and the CCPA emphasize de-identification. For example, Recital 26 of the GDPR defines anonymized data as “data rendered anonymous in such a way that the data subject is not or no longer identifiable.” De-identification is obviously important globally, and not just in the EU and the US; for example, de-identification was also a large focus of discussion in my conversations with the Privacy Commissioner of Canada. Many regulatory systems propose punitive fines if a corporation does not properly de-identify datasets.
De-identification is important... but limited
De-identification is the most important technique for anonymization. The idea behind de-identification is straightforward: strip away any information from someone’s data that could otherwise be used to re-identify the individual, and you have something that is anonymous. De-identification as an important component of data privacy and responsible AI makes sense: nearly all AI systems require data, and machine learning in particular often requires large amounts of data that may contain information about individuals that need to be protected.
There are many ways to de-identify data. One common method in the world of security is to anonymize sensitive fields with tokenization methods that turn plain text strings like “Stephan Jou” into inscrutable sequences like “f6d5f70811dfccf693f4873931b0516ac50f85e4” or even format-preserving strings like “Gabel Johnson”.
Unfortunately, while de-identification is important, there are limitations that can render de-identification alone insufficient to guarantee data privacy. These problems are especially worrisome in the context of the large data sets that go along with AI and machine learning systems. So, in other words, responsible AI requires us to take a good hard look at de-identification and consider other approaches to building an AI system that is respectful of data privacy.
Let’s dig in.
There’s no effective way to anonymize location data
Studies have shown that you can identify 95% of all individuals using just four random locations, and over 50% of all individuals with just two random locations[1]. That might seem surprising at first, but it’s easy to understand intuitively: for many of us, if all an attacker had was your home address and your work address (two locations), they’d be able to figure out who you were. The combination of a work location and a home location is sufficiently unique that this information can be used to re-identify someone in an otherwise-anonymized dataset. Also remember: home and work addresses are easily obtainable from data brokers and even public data sources.
De-identification becomes increasingly difficult with wider datasets
The more columns we have in a dataset, the harder it becomes to de-identify. Scientists describe such wide tables as “high dimensional”, and highly dimensional datasets are now the norm. In the past, where personal information was primarily obtained through surveys, a survey might have, say 30 questions so the columns of information stored about individuals would be limited to those 30 data points. In today’s world of AI systems, the internet, and social media, it is not unusual for datasets to have hundreds, thousands or even more dimensions.
Here’s the problem: the more columns of data you have on an individual, the easier it becomes to uniquely identify someone based on a combination of non-anonymized columns. Even if, say, the name and addresses are scrambled, there’s a good chance that you can figure out who someone is using the other columns. In other words, de-identification isn’t effective with highly dimensional datasets.
Netflix used to have a competition called the Netflix Prize. Netflix would release customer movie rating information to a data science community and challenged them to build a better recommendation algorithm, to improve its movie and TV recommendations. Before the Netflix Prize competition in 2010, the average customer had over 200 movie ratings and rating dates. That’s over 400 dimensions for every Netflix user in the movie rating dataset alone, ten years ago.
So even though the Netflix dataset had no name or location information at all and only had movie ratings and dates, is it still possible to re-identify someone in the data?
Sadly, yes.
Anonymized datasets are subject to linkage attacks
Datasets no longer live in a vacuum. Instead, there is a large population of data sets out there that can be publicly available for download, queried through open APIs, or purchased from data brokers. Unfortunately, while open data, government transparency, and open integration are well-intentioned movements, the more datasets that exist, the greater a malicious actor can de-identify a dataset by employing a linkage attack.
In a linkage attack, you are combining your anonymized dataset with other datasets to re-identify individuals. There’s where the linking happens: link something anonymous with enough clues from other datasets to defeat the anonymization.
Let’s go back to the Netflix Prize. This competition was cancelled because of privacy concerns, triggered by a linkage attack in 2007 where university researchers at the University of Austin, Texas linked Netflix's movie rating dataset with public movie reviews from the Internet Movie Database, IMDb[2]. In other words, they were able to match people who posted positive and negative reviews in IMDb with individuals in the Netflix dataset, to re-identify some of the Netflix customers.
The Netflix example is famous, but maybe it feels obvious. After all, it’s not surprising that you can perform a linkage attack using two movie-related datasets. But it’s also possible to perform a linkage attack on datasets that may not seem to be related.
Here’s a surprising example: computer scientists were able to take a public NYC taxicab database, and combine it with paparazzi photographs of celebrities, to figure how which celebrities took which cab rides and whether they were a good tipper or not! It’s a fascinating example of a surprising linkage attack, that is worth a read. Spoiler: Bradley Cooper and Jessica Alba are poor tippers!
You probably wouldn’t have expected taxicab data to be combined with celebrity photos in a linkage attack. And that’s the challenge: even if you have a perfectly anonymized dataset, there is no way to guarantee that it can’t be linked to another dataset in the future to re-identify someone’s identity.
The good news
This might feel like challenging problems. And they are! But here’s the good news: it is possible for AI systems that require large datasets to provide a valuable service, but still be responsible and respectful of data privacy. There are techniques like differential privacy and federated learning that can help, and a culture of privacy by design can go a long way to ensure that sensitive information is protected and respected.
We’ll cover that in the next article in this series.
[1] Yves-Alexandre de Montjoye et al., Unique in the Crowd: The privacy bounds of human mobility, Scientific Reports 3 (2013)
[2] Narayanan and Shmatikov, U of Texas, Austin (2008). Robust De-anonymization of Large Sparse Datasets.