
You’re easy to track even when your data has been anonymized - SkyMarshal
https://www.technologyreview.com/s/613996/youre-very-easy-to-track-down-even-when-your-data-has-been-anonymized/
======
merricksb
Different article about same study discussed here 3 months ago:

[https://news.ycombinator.com/item?id=20513521](https://news.ycombinator.com/item?id=20513521)
(261 points/94 comments)

------
jandrewrogers
It is worse than anonymized data: you are easy to track in databases that are
not storing data about you at all, only measuring the broader environment you
operate in (typically for innocuous purposes). This data is inherently
"anonymous" in that it was never designed to be associated with or track a
person but you can nonetheless reconstruct identity and other information
about people from this data.

Individual anonymity in a technical sense is impossible in an environment with
network connected sensors. Above a critical mass of sensors, which we have
_far_ exceeded in most urbanized areas, there are no technical measures that
can keep a person from being tracked.

~~~
SlowRobotAhead
Example of what you are talking about specifically?

~~~
jandrewrogers
When you move through any environment, it leaves a discernible trace in a (to
most people) surprisingly large swath of sensor platforms that exist solely
for boring industrial purposes like building management, measuring the
operating environment of equipment to improve efficiency, etc, never mind
systems actually designed to indirectly capture people (like cameras). Your
existence perturbs the environment and leaves a faint footprint in the data.
As a trivial example, transient proximity of people creates small fluctuations
in measured temperatures. There are analytical techniques that reliably and
systematically isolate and amplify those traces so that you can fingerprint
and track a person using them. The typical urbanized environment is _littered_
with these sensors and it has been repeatedly demonstrated that the
measurements coming off these sensors can be used to constructively identify
specific individuals in the environment.

I think the gap for most people is not the existence of these sensors, which
capture nothing about a person in any kind of direct way, or that people
perturb their environment in some abstract way, but the existence of analytic
techniques that allow someone to reconstruct detailed personal information
from large collections of extremely oblique measurements of the broader
environment.

The analytic methods for doing this type of reconstruction are quite clever
and non-obvious, which I guess would need to be the case for it to be
surprising. It is nothing at all like typical web or enterprise analytics --
you are using measured physics and constraints on that physics to infer
environmental dynamics that you can't measure directly.

~~~
pacala
The other comprehension gap is that, thanks to Moore's law, these methods can
be deployed at scale. Everyone is now a target, 24/7\. In the good old days of
XXth century and Bond movies, it took a highly paid analyst to target someone
personally. Which economically limited the intrusion to a tiny sliver of the
population.

------
dictum
For many Internet related phenomena, you can find an example where it already
happened a long time ago with AOL.

[https://en.m.wikipedia.org/wiki/AOL_search_data_leak](https://en.m.wikipedia.org/wiki/AOL_search_data_leak)

------
SlowRobotAhead
I’m not sure if I’m missing the point. Using the reference link; I think
they’re saying there is a 75-85% chance you are the only person in your zip
code with your gender and your birthdate.

This does not seem that surprising or a new technological development.

~~~
EsssM7QVMehFPAs
The Nature paper is about the use of ML to augment regular statistical methods
with a higher degree of match quality. This is because the neural network is
able to identify dataset features beyond mere obvious group intersects.

------
JimmyRuska
Lets say with web-logs. What is "Anonymized"? Even when the datasets get
"anonymized", at scale, information leaks.

Some of those IP addresses are static IP addresses, how could you tell whose?

Some of those URLs may end up with some form of PII in the query string from
some forgotten backend service

Some of those user agents ended up having a kaspersky unique identifier

someone saves your website and when they open it some tag re-fires capturing
they opened from /Users/John.Smith/yourpage.html

Facebook and google and others add link decoration tracking in the url, and
suddenly a unique identifier appears across various hits, even if it wasn't
added by the site owner.

There may be account identifiers, hashes or tokens linked to emails, phone
numbers. So while the log dataset is market as low risk if lost losing some of
the mapping tables at any point in the future would turn it into full PII.

------
brenden2
The only way to properly anonymize data is to aggregate it in such a way that
you can't undo the aggregation. Anonymized data can nearly always be de-
anonymized if you have either a) sufficient volume of data or b) access to the
raw non-anonymized source data.

The problem is that most data surveillance systems store the raw source data
instead of just keeping metrics in aggregate form. Thus, it's almost always
possible to de-anonymize data.

~~~
icebraining
Yes. For example, the GDPR distinguishes full datasets where identifiers have
been removed (calling it "pseudonymisation") from truly anonymized data. The
former is still subject to the regulation, just like any other personal data.

------
aiiane
This is yet another article that conflates de-identified data with
anonymization.

There are ways to create anonymous datasets, but they generally involve
aggregation, not just removing the identifiers.

It's unfortunate that general lay person understanding of the concepts at work
here doesn't tend to extend to this distinction. It would help drive privacy
conversations if this were more commonly understood.

------
gwright
From the article:

> It isn’t all bad news. These same reidentification techniques were used by
> journalists working at the New York Times earlier this year to expose Donald
> Trump’s tax returns from 1985 to 1994.

Flippant comments like that make it hard to take the authors seriously. Their
concern for privacy apparently evaporates when the techniques are applied
against people they don't like.

~~~
pmoriarty
The argument has been made that the public deserves access to every
President's and Presidential candidate's tax records, so they can make
informed decisions about whether to elect or re-elect them.

Things such as conflicts of interest, crimes, and lies about where/how they
got their money and whether they're really as wealthy as they claim to be,
whether they've cheated on their taxes or paid unfairly low taxes considering
their enormous wealth are all things that could influence these critically
important decisions on the part of the public.

A further argument is that officials serving in public office don't have the
same expectation of privacy that private citizens do.

In view of these two arguments and others it's not difficult to see why the
authors of this article need not consider the revelation of Trump's tax
returns a good thing merely because they don't like him.

Further, there is no evidence in the article that its authors would not be
concerned about the privacy rights of other people they don't like who aren't:
1 - the President of the US, and 2 - not public officials.

~~~
gwright
I'm not unaware of the argument that candidates should release their tax
returns, but it is _not_ the law right now.

Up until the point that access to Trump's tax returns was mentioned the
article was warning about the false privacy associated with anonymizing
identity.

I can understand the argument that candidates should reveal their financial
history. But that doesn't mean otherwise reasonable concerns about false
anonymity should be suspended when talking about the anonymity of one
particular person who has explicitly asserted their privacy rights.

Even if you think the authors were making a more general statement about all
candidates and not just Trump, that seems like a terrible argument to me. In
the cases of candidates for office, voters are free to penalize candidates who
don't reveal enough information about themselves by not voting for them. There
is no need to soften any privacy concerns about anonymized identities.

~~~
pmoriarty
_" voters are free to penalize candidates who don't reveal enough information
about themselves by not voting for them"_

Compare these two hypothetical scenarios:

1 - Voters don't have access to the candidate's tax records

2 - Due to released tax records, the voters know for certain all of the below
facts about the candidate: A - The candidate paid no taxes, B - The candidate
cheated on their taxes, C - The candidate is not as rich as they claim to be,
D - The candidate's businesses lost money so they're not as good a
businessperson as they claim to be

In the first hypothetical scenario the voters the voters know there's a
possibility that the candidate might be hiding something, in the second
hypothetical scenario the voters know for certain that the candidate is a
lawbreaking, tax cheating, lying hypocrite.

In which of these hypothetical scenario do you think the voters are going to
penalize the candidate more?

~~~
gwright
This is a false choice and a different one than I suggested. You would have to
consider this scenario also:

A released tax records B released tax records C didn't release any records

FWIW, I've talked to an accountant about the idea of Trump revealing his tax
records and the bottom line is that they would be sufficiently complicated
that there is no possibility that the average person would be able to
interpret them accurately, so you'll be left with the spin from the various
media organizations, hardly a source of objective truth.

So I would assert that requiring candidates to release their tax records
doesn't actually provide any useful information for a voter.

Remember that Trump's tax records, are already examined by the IRS and I
believe have been audited. So there shouldn't be any question of illegal
activity being hidden, unless you want to assert that the IRS can't be trusted
either.

There are also other concerns about tax records revealing information about
3rd parties. And finally tax records aren't really a useful way to understand
the intricacies of a business. If you are really interested in that you would
want the audit report for the underlying business and not just tax records.

