
Differential Privacy - sr2
https://privacytools.seas.harvard.edu/differential-privacy
======
eddyg
This[0] video from Apple's WWDC gives a nice overview of how Differential
Privacy is being used in iOS. Basically, Apple can collect and store its
users’ data in a format that lets it glean useful info about what people do,
say, like and want. But it _can 't_ extract anything about a single specific
one of those people that might represent a privacy violation. And neither can
hackers or intelligence agencies.

[0]
[https://developer.apple.com/videos/play/wwdc2016/709/?time=8...](https://developer.apple.com/videos/play/wwdc2016/709/?time=812)
(the "Transcript" tab has the text of the video if you want to read instead of
watch.)

~~~
devsquid
Its cool they are using DP for some analytics. But its not quite the holy
grail Apple and its fans has been selling it as. Because any analytics
campaign using DP will always eventually average out to pure noise or end up
being non-anonymous.

Heres a great interview from the ms researched that invented the technique
[http://www.sciencefriday.com/segments/crowdsourcing-data-
whi...](http://www.sciencefriday.com/segments/crowdsourcing-data-while-
keeping-yours-private/)

One of the quotes I always liked from it is "any overly accurate estimates of
too many statistics is blatantly non-private"

------
JoachimSchipper
I like [https://blog.cryptographyengineering.com/2016/06/15/what-
is-...](https://blog.cryptographyengineering.com/2016/06/15/what-is-
differential-privacy/) as an introduction.

Differential privacy is cool. However, I looked at Google's RAPPOR algorithm
(deployed in Chrome, and clearly designed with real-world considerations in
mind) in some depth, and I found that RAPPOR needs millions to billions of
measurements to become useful, even while exposing users to potentially
serious security risks (epsilon = ln(3), so "bad things become at most 3x more
likely"). Much better than doing nothing, but we'll continue to need non-
cryptographic solutions (NDA's etc.) for many cases.

------
BucketSort
The coolest part about differential privacy is its guarantees about over
fitting.

~~~
AdamSC1
Oh I hadn't considered the statistical advantage here.

You do lose out on a lot of human bias in the research process, but you also
create blind errors that are hard to validate.

I know in my work there is plenty of times I run analysis and go back and
manually check some entries as a sanity check - pros and cons here!

~~~
samscully
The thresholdout method [0] for preventing overfitting on a test set is an
interesting application of this.

Here's a talk on differential privacy applied to the overfitting problem [1]

[0] [http://andyljones.tumblr.com/post/127547085623/holdout-
reuse](http://andyljones.tumblr.com/post/127547085623/holdout-reuse)

[1]
[https://www.youtube.com/watch?v=9mqXjdnZA18](https://www.youtube.com/watch?v=9mqXjdnZA18)

------
jey
I think this is the canonical review article:
[https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf)

(No, I haven't read it...)

~~~
habosa
Aaron Roth was my professor at Penn. He's definitely the expert on
differential privacy. Fun fact: his dad won the Nobel Prize in Economics a few
years ago.

------
cjbprime
I don't like differential privacy very much.

Take GPS data, for example: NYC has released a taxicab dataset showing the
"anonymized" location of every pickup and dropoff.

This is bad for privacy. One attack is that now if you know when and where
someone got in a cab (perhaps because you were with them when they got in),
you can find out if they were telling the truth to you about where they were
going -- if there are no hits in the dataset showing a trip from the starting
location that you know to the ending location that they claimed, then they
didn't go where they said they did.

Differential privacy researchers claim to help fix these problems by making
the data less granular, so that you can't unmask specific riders: blurring the
datapoints so that each location is at a city block's resolution, say. But
that doesn't help in this case -- if no-one near the starting location you
know went to the claimed destination, blurring doesn't help to fix the
information leak. You didn't _need_ to unmask a specific rider to disprove a
claim about the destination of a trip.

I think that flaws like these mean that we should just say that GPS trip data
is "un-de-identifiable". I suspect the same is true for all sorts of other
data. For example, Y chromosomes are inherited the same way that surnames
often are, meaning that you can make a good guess at the surname of a given
"deidentified" DNA sequence, and thus unmask its owner from a candidate pool,
given a genetic ancestry database of the type that companies are rapidly
building.

~~~
obastani
The attack you suggest is ruled out by differential privacy. The precise
guarantee is a bit complicated. The first thing to note is that the output of
a differentially private mechanism must be random. Then, the guarantee is that
Pr[output] does not change by very much whether or not you are included in the
dataset. In other words, even if you were omitted from the dataset, there the
chance that the algorithm produced the same result is very similar.

This definition rules out the attack you suggest. In particular, if you are
removed from the dataset, then the probability of the output (i.e., a ride
starts in the region) goes from very large to very small. Therefore, the
algorithm you describe (i.e., adding noise to the start location) is not
actually differentially private.

The confusion arises because oftentimes adding noise is sufficient. For
example, the average of n real numbers in [0,1] is affected by at most 1/(n-1)
if you delete one point from the dataset. Therefore, you can just add a little
bit of noise and the dataset becomes differentially private.

For the dataset you describe, a sibling comment proposed the correct mechanism
-- you have to add noise to the count returned by the query, not the start
location. (Technically I think you could just add noise to the start location
like you propose, but the amount of noise would have to be large enough that
all the start locations overlap by a sufficient amount.)

~~~
cjbprime
Thank you! Makes sense.

------
projectramo
At one point, I know someone who wanted to give money to a large medical
organization so that they could show their patients the tradeoff between
various interventions. (efficacy vs side-effects).

It was going to be donated money to build an app that belonged to the
institution.

The institution would not let their own researches publish the data on the app
even though it was anonymous. They didn't want to take the risk.

It would be great if this lead to accepted protocols that made it so that
people didn't have to think about it. "Oh yeah, we'll share it using DP" and
then people could move ahead using data.

------
rectang
Shades of the AOL search data leak:

[https://en.wikipedia.org/wiki/AOL_search_data_leak](https://en.wikipedia.org/wiki/AOL_search_data_leak)

 _Of course_ personally identifiable information will be extracted despite
this model. "Differential Privacy" is cynical academic malpractice -- selling
a reputation so that when individuals are harmed in the course of commercial
exploitation of the purportedly anonymized data, the organizations that
profited can avoid being held responsible.

We never learn, because there is money to be made if we pretend that
anonymization works.

~~~
AdamSC1
To be clear, I think you're right that no tracking is better than trying to
protect data.

However, it's important to understand that 'anonymization' is very different
than the practice of "Differential Privacy."

I'm no expert but here is how I understand it as a simplified example:

Imagine your information is stored in a spreadsheet. It is storing your
weight, height, zipcode, age and name.

The 'anonymization' spreadsheet would still have a unique row dedicated to you
(similar to a spreadsheet) and it may replace your name with an ID# or an
encrypted string. Now, just like in the AOL dataleak that information being
stored as a single line item is still easy to backtrack as there is likely no
one else with your weight, height and age combination in your zipcode. So a
hacker can identify a single person.

Differential Privacy would store information differently, perhaps in separate
spreadsheets, one that is list of heights, one that is a list of weights, etc,
etc. No two spreadsheets would store the information in the same order (#3 on
the height list would not be #3 on the weight list) and it may even contain
some incorrect dummy information.

There would be some sort of algorithmic relation however that allows a system
to create outputs in which the data has meaningful information (trends, means,
standard deviations etc) but it can not be back-tracked to identify any single
unique row.

Differential Privacy allows us to see the trend "Males age 45 are taller on
average than Females age 45" but not say "User #155083 is age 45, weighs
195lbs, and lives in zipcode 10001"

That's a big difference in privacy, and while it isn't perfect it is a step in
the right direction. While I wish more companies would adopt a no-data policy,
it is at least better that they are responsible as can be with the data they
have.

~~~
rectang
Such obscuring is vulnerable against sidechannel attacks to re-link the record
fragments. For example, misspellings, incidental geographic information,
topics -- any pattern which is unusual, not anticipated by the modeler and
deliberately obliterated.

What links the AOL fiasco and this one is that both believe they have thought
of everything important. They're wrong -- and there will always be a future
attacker to prove it. You can't fight information theory.

Differential Privacy is an excuse to get around sensible no-data policies --
by making irresponsible promises, it will result in _more_ privacy violations,
not less.

~~~
frankmcsherry
> You can't fight information theory.

I totally agree with this statement, but I think you are confused about
differential privacy. Its guarantees _are_ information theoretic
(specifically, a bound on the relative Bayes factors of _any_ conclusion, with
and without _any_ individual record).

You are obviously welcome to be skeptical, but much of what you've posted so
far is not correct.

~~~
rectang
Of course I don't dispute the math. I maintain that these guarantees will not
be achieved in practice because they rely on impossibly airtight
implementation and impossibly omniscient modeling.

~~~
frankmcsherry
Interesting. Do you have similar concerns about cryptography?

Edit: to more strongly bind 'similar': would you also say of cryptography that
it is "cynical academic malpractice"?

~~~
rectang
I fully expect privacy disasters based on imperfect implementations of
Differential Privacy. Do you not? Do the researchers not?

The superior alternative is to _avoid sharing sensitive datasets_ and _avoid
keeping data whenever possible_. No such alternative exists for many
applications of cryptography.

But we live in an era where organizations find our private data impossibly
tempting and are content to sacrifice the rights of individuals so long as
they can't fight back. This research gives such entities the excuse to build
tools that should not be built and publish data that should not be published.
By saying "OK _now_ it's safe (if you did everything right)" rather than
"don't do that", it is the enabler of future privacy fiascos.

The answer, if there is one, is probably legislative: hold entities criminally
liable for data breach. Should such legislation pass, I wonder how much
interest in this research will wane.

~~~
frankmcsherry
If you want to rip in to Apple or Google or Uber for claiming they should have
a pass for using privacy tech, feel free. Understand that this is distinct
from most research on differential privacy.

The US Census collects demographic data about as much of the population as
they can manage, and releases summary data in a large part to support
enforcement of the Civil Rights Act. They have a privacy mandate, but also the
obligation to provide information in support of the rights of subpopulations
(e.g. Equal Protection). So what's your answer here? A large fraction of the
population gets disenfranchised if you go with "avoid sharing the datasets".

You end up with similar issues in preventative medicine, epidemiology, public
health, where there is a real social benefit to analyzing data, and where
withholding data has a cost that hasn't shown up yet in your analysis.
Understanding the trade-off is important, and one can come to different
conclusions when the subjects are civil rights versus cell phone statistics.
But you are wrong to be upset that math allows the trade-off to exist.

~~~
rectang
"Privacy tech" is a perverse description, since this tech's existence results
in a net _loss_ of privacy -- without it, the data-sharing applications it
powers would be more obviously irresponsible and more conservative decisions
would be forced. A less Orwellian name would be "Anonymization tech".

If it were possible to wish away this tech, I absolutely would -- just like I
would wish away advanced weapons technology if I could. In our networked era,
the private data of individuals is being captured and abused at an
unprecedented, accelerating rate, and whatever good this tech does cannot
begin to make up for its role in facilitating and excusing that abuse.

