When a business or government reassures you that the data they share is “anonymous”, that anything that could identify you has been removed, just laugh. Or cry. “Many current datasets can be re-identified with no more than basic programming and statistics skills,” Princeton University researchers warned in June.
Take, for example, New York’s supposedly anonymous database of all 174 million taxi rides taken in 2013, which was made public after a freedom-of-information request. Software engineer Vijay Pandurangan re-identified the drivers and their plate numbers using just a few hours of computer time — in part because the original de-identification was done poorly. Then Anthony Tockar of Neustar Research dug out photos of celebrities getting into taxis where the licence plate was visible, cross-referenced them with the taxi data, and revealed where those celebrities had gone next.
It’s not just celebrities whose privacy is at risk. Tockar could also figure out the home addresses of frequent visitors to Larry Flynt’s Hustler Club. From there, it would be easy enough to cross-reference those addresses with property records, voter registrations and other public information to get names.
“Holy shit, can you imagine someone just plotting all the trips from a single gay bar? Listing off all the connected residential addresses? And not only that, any subsequent trips home from those addresses the next morning? Taking the walk of shame to a whole new level!” wrote user ‘abalone’ at Hacker News.
“Likewise trips could be used to deduce affairs and other deceptions by fellow residents. ‘You said you were working late, but the only taxi trip to our building that night was from a bar.'”
Location data is particularly revealing. Our smartphones are effectively tracking devices. That’s why law enforcement and intelligence agencies are so keen to access this telecommunications metadata.
“As most people spend the majority of their time at either their home or workplace, an adversary who knows those two locations for a user is likely to be able to identify the trace for that user — and to confirm it based on the patterns of movement,” the Princeton researchers wrote.
“It’s not just political rivals or disgruntled ex-partners who’d be interested. Insurance companies and credit providers are always on the lookout for indications of risk.”
It’s easy. According to research by Yves-Alexandre de Montjoye and others, more than 50% of mobile phone users can be identified from just two randomly chosen location data points. With four points, the figure rises to 95%. Most people reveal vastly more than that through social media — either by stating their location directly, or giving it away indirectly by posting photos of what they see.
It gets worse.
“Many de-identified datasets are vulnerable to re-identification by adversaries who have specific knowledge about their targets. A political rival, an ex-spouse, a neighbour, or an investigator could have or gather sufficient information to make re-identification possible,” the Princeton researchers wrote.
“As more datasets become publicly available or accessible by (or through) data brokers, the problems with targeted attacks can spread to become broad attacks. One could chain together multiple datasets to a non-anonymous dataset and re-identify individuals present in those combinations of datasets.”
Here’s the key policy problem.
Most privacy law, including the US, is based on the concept of protecting personally identifiable information (PII). Definitions vary, but the US National Institute of Standards and Technology (NIST) is typical: “any information about an individual … including (1) any information that can be used to distinguish or trace an individual’s identity, such as name, social security number, date and place of birth, mother’s maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information”.
Australia’s updated Privacy Act, which came into force on March 12, broadens the definition of personal information to include information where an individual’s identity is “apparent, or can reasonably be ascertained”.
But as the research is demonstrating, individuals’ identity can be “reasonably ascertained” from all manner of data with ever-decreasing effort — perhaps not from one dataset, but certainly by cross-referencing it with others.
It’s not just political rivals or disgruntled ex-partners who’d be interested. Insurance companies and credit providers are always on the lookout for indications of risk.
New Zealand’s privacy commissioner has floated the idea of making the re-identification of anonymised data illegal. Perhaps he’s onto something.