So basically it's saying that removing the names and addresses isn't sufficient to anonymize a data set. Didn't we already know that 10 years ago?
They are saying you just need to know 4 times and places that a person has been, and you have a 90% chance of identifying their entire history via the user_id in the data set. And if you know the price of a transaction they made, the probability goes up. Is that very surprising?
So basically it's saying that removing the names and addresses isn't sufficient to anonymize a data set. Didn't we already know that 10 years ago?
Yes, but businesses that trade in aggressive data tracking and/or analysis have been quite successful at downplaying the meaningfulness of their data in the common person's eyes. Anyone who doesn't understand working on the datasets has to rely on the press, and they've largely listened to the fox standing next to the henhouse. Why wouldn't they? He's got a nice suit and a mastery of language that makes nay-sayers look like conspiracy theorists.
But it requires the attacker to know a lot of information in the first place: 4 time/date pairs.
In other words, if you already know a lot of information about a person, you can get even more information from the "anonymous" data set. Why is that surprising?
You only need 33 bits of information to uniquely identify a living person (and probably no more than 38 bits to identify any person who ever lived).
Everything provides some information, 0.01 bits here, 3 bits there - e.g. a bit of information such as "understands english" is already 2.5 bits. It's just a meter of integrating all those observation into one coherent estimate.
Eff.org did similar a while back; also, browsing habits for years have been clearly a "Fingerprint," same as the keystrokes, depth, etc, were found recently on a monitor.
Even better, creepier, was the government study 15 years ago that could identify people by how they walk, only, via video camera / surveillance.
Or, remember the AOL study? That was only IP addresses, and many people were "Unmasked," this was meta data / search logs, only and identified individuals. That was a decade back. Seemingly, this article only uses citations that are 0-4 years old, it's been a well trod issue for a while, even in journalistic circles.