
Stanford quantifies the privacy-stripping power of metadata - zbjornson
http://techcrunch.com/2016/05/17/stanford-quantifies-the-privacy-stripping-power-of-metadata/
======
jrcii
An MIT project reached the same conclusion a couple years ago
[http://www.independent.co.uk/life-style/gadgets-and-
tech/mit...](http://www.independent.co.uk/life-style/gadgets-and-tech/mits-
immersion-project-reveals-the-power-of-metadata-8695195.html)

Someone from MIT also contributed to this study in the same vein
[http://www.nature.com/articles/srep01376](http://www.nature.com/articles/srep01376)
From the abstract, "[I]n a dataset where the location of an individual is
specified hourly, and with a spatial resolution equal to that given by the
carrier's antennas, four spatio-temporal points are enough to uniquely
identify 95% of the individuals."

~~~
notsnowden
Those projects are related, but definitely different. The first is about email
metadata. The second is about re-identifying cellphone location data. This
paper is about telephone call and text metadata, which is what the NSA
collects.

------
rayiner
> The law currently treats call content and metadata separately and makes it
> easier for government agencies to obtain metadata, in part because it
> assumes that it shouldn’t be possible to infer specific sensitive details
> about people based on metadata alone.

This quote from the Stanford News article is incorrect. The reason metadata is
carved out is because there is Supreme Court precedent carving out metadata:
[https://en.wikipedia.org/wiki/Smith_v._Maryland](https://en.wikipedia.org/wiki/Smith_v._Maryland).
And that case has nothing to do with what can or cannot be inferred from
metadata.[1] It distinguishes call data from call metadata because the latter
is routinely recorded and used by phone companies for various purposes:

> First, we doubt that people in general entertain any actual expectation of
> privacy in the numbers they dial. All telephone users realize that they must
> "convey" phone numbers to the telephone company, since it is through
> telephone company switching equipment that their calls are completed. All
> subscribers realize, moreover, that the phone company has facilities for
> making permanent records of the numbers they dial, for they see a list of
> their long-distance (toll) calls on their monthly bills.

[1] Because that's totally irrelevant to the 4th amendment.

~~~
tripzilch
I know this is not how the law and precedents work in the USA, but for the
sake of common sense, someone needs to call this out:

The quoted line of reasoning is absolutely disingenuous when considered in a
modern and realistic setting.

What is being wilfully ignored is the change in _quality_ of the information
gleaned from the analysis of data as it is being done in bulk.

Nobody in 1979 could have foreseen the sort of information that can be
extracted from the unimaginably large fire-hose of metadata we generate today.

It's a _completely_ different thing if you had a list of who-called-who-when
in 1979. How was this data kept? Well, for starters it probably wasn't
centralized. Was it even digital? Probably, yes? Even if it was digital, the
computers of that day could only handle trivial amounts of data. Factor in the
ubiquity of phones and phone-usage today versus back then, to even consider
the concept _vaguely comparable_ is ridiculous.

It's the difference between getting records from the electricity company so
that you know which parts of your house were illuminated when, versus getting
"records" from all the individual CCD elements of various cameras installed in
your house so that you know the same thing, which (tiny) parts of your house
were illuminated when. That's the same thing right? It's just a _tiny_ bit
more fine-grained (/s).

Just because people might be okay with the former (say because you can see
what lights are on from behind the curtains, on the street), doesn't mean
they'd be fine with the latter.

The _quality_ , and therefore the privacy-expectations, of the information
extracted from the data changes as you blow up the number of records by some
orders of magnitude. Almost nobody from 1979 could have imagined what that
would mean. Almost nobody could even have fathomed the amount of computational
power a desktop computer could throw at it. Hell, not even most people today
can grasp that.

So there's "innocent" metadata that via some unfathomable process can be
transmuted into some rather more detailed and revealing information. It's
really quite hard to get a proper perspective on it (our brains aren't made
for reasoning about graphs this size). I think it's more fair to call this
process "magic", than otherwise. And in that case it really doesn't matter
where it came from, say magic works, does it matter if the government knows
all the details of your life from divining tea-leafs or divining phone
metadata?

Now add to this, that thanks to having sufficiently-advanced our technology,
it is _also_ possible, using similar techniques of "magic", for the phone
companies to keep records for billing purposes in a really clever
(magic/encrypted) way to _shield_ the data from those kinds of divination
while still being able to do their billing. The only reasonable expectation I
have is for these paradigm shifts to be applied on both sides, equally.

------
stepvhen
5 years or so ago I was on a jury once concerning sexual misconduct, and spent
a good hour or so in the deliberation scanning the submitted call records,
looking for gaps in correspondence between the two relevant phone numbers. It
was pretty easy to identify moments when the two parties were together, that's
when they stopped texting, and checked they that matched up with all of the
testimonies. They did, and it solidified a guilty verdict with my fellow
jurors (sans one, on one count, but we had 7 already).

~~~
tsunamifury
What a tragedy of justice that you used an absence of evidence as a deciding
factor. There is so much logically wrong with that, even in correlation.

~~~
Terr_
Suppose somebody's alibi is "I was driving between these two cities at the
time, and then I spent the night working."

But their car-odometer _didn 't change_, and their home-electricity usage
dropped to zero that day.

Both of those observations are surely "evidence". Similarly, a digital signal
still carries information, even though the zeroes are an "absence" of voltage.

~~~
woodman
The logic isn't even close:

Odometers increment on driven cars. The odometer did not increment. The car
was driven. Unsatisfied.

People having sex are together. People who are together don't text each other.
During a certain hour a person didn't text 68 contacts. At this certain hour,
this certain person had sex with 68 people. Satisfiable.

~~~
stepvhen
It's more like: these two numbers are in constant contact all other hours of
the day. At this hour midday, when others, including one of the relevant
parties, say those two numbers are together, there is no correspondence.
Later, the correspondence picks up as before.

~~~
woodman
You seem to be aware, at least on a subconscious level, of the larger point -
as you've demonstrated by including additional evidence. The larger point
being that the absence of phone records can, at best, disconfirm. There are
far too many alternative explanations for it to be used alone or as
confirmation.

~~~
rayiner
Remember, in court we're not talking about logical proof (eliminating every
alternative but the desired conclusion), but statistical proof (eliminating
alternative conclusions with 51% or 95% certainty). Given an assumed or
established statistical model of the probability of related events (e.g.
knowing that a secretary at a business makes a log entry 99.9% of the time
when she sends out a mailing), a non-happening (say the absence of a log
entry) can easily establish a conclusion to the desired level of certainty.

~~~
woodman
Darn, I already replied to this point moments ago. There is a difference in
the application of this logic in the court room where it can be challenged,
and the juror room where it cannot - right?

------
shas3
I am reminded of an interesting observation by the mathematician Terence Tao
about how our anonymity on the Internet and in a connected world is so fragile
[1]. Basically, because there are only 3 billion internet users, every person
can be uniquely identified by a 31 bit number. The uncovering of each bit gets
one closer to the identity of the person. Seen in this light, one would expect
metadata to uncover quite a few of the 31 bits!

[1]
[https://plus.google.com/+TerenceTao27/posts/8vmpA9fgRMq?iem=...](https://plus.google.com/+TerenceTao27/posts/8vmpA9fgRMq?iem=4&gpawv=1&hl=en-
US)

------
UVB-76
Link to the actual article:
[http://www.pnas.org/content/early/2016/05/10/1508081113.full](http://www.pnas.org/content/early/2016/05/10/1508081113.full)

~~~
jegoodwin3
Thanks for this. I would have been happier if the authors had phrased their
findings in term of differential privacy rather than the effectiveness of the
algorithms they were able to achieve.

[https://en.wikipedia.org/wiki/Differential_privacy](https://en.wikipedia.org/wiki/Differential_privacy)

When setting policy, it is better to have theoretical mathematical results
rather than empirical effectiveness, since you can bet the technological
frontier of privacy violation is a moving one. As with cryptography, you want
solid foundations in unbreakable maths -- not 'we can't break this cipher with
what we know today'. Probably, someone can.

------
ErikAugust
This was particularly striking to me:

"We kill people based on metadata” - General Michael Hayden

~~~
e12e
I'd probably strengthen that to "We kill bystanders based on metadata".

------
aandon
Great Snowden interview (by Neil deGrasse Tyson) where he explains why
collecting "only metadata" is no excuse:
[https://soundcloud.com/startalk/a-conversation-with-
edward-s...](https://soundcloud.com/startalk/a-conversation-with-edward-
snowden-part-1)

------
ljk
> _cross-referenced with social networking information and other public data
> sets, such as Yelp and Google Places_

Probably would be safer to not publicize your every move publicly too, even if
it'd only slow the process down a little

~~~
arca_vorago
The real problem with issues like this is that while I have chosen to remove
myself from certain social media (have a facebook but it's poisoned data), is
that friends and family aren't as paranoid or knowledgeable about privacy
implications as I am, so I have to remind people to not tag me, not upload
photos of me, etc.

Friends and family are reporting on their friends and family without even
understanding the implications of what they are doing.

It may seem benign at the moment, but given the nature of the turn-key
totalitarian state, it's when that key gets turned and the cat starts getting
walked back that this sort of information leakage from unexpected sources
really becomes an issue.

~~~
ljk
Good point, and facebook generates "phantom profiles" for people without an
account too to make the data gathering even easier for them

~~~
arca_vorago
It does, hence why I suggest people manage their own but spike the data... and
be sure to use tor or similar to connect. (Which I only check every six months
or so, just to make sure facebook doesn't try to pull any more "oh hey, we
made your entire profile public" stunts again.)

~~~
dredmorbius
How and what do you check for?

~~~
arca_vorago
Just review privacy settings, check for tags from friends, simple stuff.

------
dimino
> in part because it assumes that it shouldn’t be possible to infer specific
> sensitive details about people based on metadata alone.

> One of the government’s justifications for allowing law enforcement and
> national security agencies to access metadata without warrants is the
> underlying belief that it’s not sensitive information.

Is this true? I didn't think this was an actual part of the argument for using
metadata, but that metadata wasn't covered under current laws, and was
therefore easier to get.

I was working under the assumption that it was an unintentional oversight, not
an intentional hole in legislation.

~~~
DanBC
In England the government say (when they want to expand laws to make it easier
for them to get metadata) that it doesn't contain any content, it's just about
who you call or who calls you and for how long. They never say that
individuals cannot be identified from this - information about identified
individuals is the point of gathering the data.

So they seem clear about the difference between content and meta, and that
metadata will identify people.

They're less clear about the further de-anonymisation aspects of "just"
metadata, and it's hard to know if that's because they don't know or don't
care.

------
drallison
This paper will be read by two of the authors in the Stanford EE Computer
Systems Colloquium, EE380,
[http://ee380.stanford.edu](http://ee380.stanford.edu). EE380 is a public
lecture--anyone is welcome to attend or watch the live stream video. The talk
video will be posted to YouTube the day following the presentation. For
details, see the announcement at
[http://ee380.stanford.edu/Abstracts/160518.html](http://ee380.stanford.edu/Abstracts/160518.html).

------
zipwitch
There was an interesting and informative blog post along similar lines (that I
think was linked here not that long ago), on, "Using Metadata to find Paul
Revere".

[https://kieranhealy.org/blog/archives/2013/06/09/using-
metad...](https://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-
find-paul-revere/)

------
lucb1e
You can draw social graphs from who calls who, just like Facebook can? No
shit. (And Facebook is scary right?)

You can find a person's city of residence in 57% of the cases? That's pretty
bad, I'd almost feel oddly relieved my data is saying so little, but I'm
afraid the NSA would do a better job.

You can predict who is pregnant? And who owns a rifle? Alright now we are
getting somewhere. The article didn't mention how many cases succeed here so
I'm not sure if I should be impressed, since the rifle hotline or a licensing
agency thing (I don't know how that works) would be pretty obvious.

------
asdf333
There are studies like this from 20-30 years ago using medical data....it
isn't new but good that people are aware of it.

------
beefsack
I feel the media has taken a useful word in the technology world and made it
almost useless for general usage. I'm scared to even mention "metadata" to
people even in relevant technical context as the word has become politicised
and loaded, just like I can't call myself a "hacker" any more.

------
alexchantavy
> In combination with independent reviews that have found bulk metadata
> surveillance to be an ineffective intelligence strategy, our findings should
> give policymakers pause when authorizing such programs.

If metadata has such power, why do they say that it is an "ineffective
intelligence strategy"?

~~~
jmcgough
I think they mean that it's not as useful in discovering and preventing
unknown threats, but it's great for tracking a particular person (so it's a
better tool for surveillance of citizens).

