
Differential Privacy for Dummies (2016) - Tomte
https://robertovitillo.com/2016/07/29/differential-privacy-for-dummies/
======
lalaland1125
One very interesting thing about differential privacy is its connection to
generalizability in machine learning. In particular, a model can only provide
differential privacy if that model is completely generalizable.
Generalizability is the ability of a model to do well on data outside its
training set. Non generalized models violate differential privacy because you
can find out if someone is in the training set by simply measuring the error
rate on that person. A low error rate provides evidence that that particular
person was in the training set.

This connection is interesting because it implies that differential privacy
techniques applied to machine learning must in turn result in more
generalizable models, which is very useful in general. This can also provide a
simple "upper bound" on how differentialy private your model is. Simply
compare the performance on the training and test set and see how much it
differs.

~~~
maksimum
> This can also provide a simple "upper bound" on how differentialy private
> your model is.

This is a very interesting idea. Are you aware of any resources that go
through the mechanics?

~~~
frankmcsherry
I think what they are getting at is the following, which isn't a proof or
anything (if you are looking for rules to follow, the much simpler "your model
isn't differentially private" is probably even more accurate):

Let X be a set of iid samples drawn from a distribution D, let M be a model
trained with eps-DP on the set X, and let x' be a fresh sample from D.

Let's think about the random variable M(x'), which represents your test loss
or whatever. Differential privacy says that the density of this random
variable is pointwise within a factor of exp(eps) of M'(x') where M' is the
random model resulting from training on X \cup { x' }. But M'(x') is
distributed per a random x', random X, and M' trained on both just like your
training loss or whatever.

You can push linearity of expectations through all of this to show that your
expected loss on test and training should vary by at most a factor of
exp(eps), but I'm not aware of a pointwise test for a specific model (and it
may not make sense, as DP isn't a property of specific models, but rather
distributions of them).

Edit: you can read a bit more about a special case of this at
[https://windowsontheory.org/2014/02/04/differential-
privacy-...](https://windowsontheory.org/2014/02/04/differential-privacy-for-
measure-concentration/)

~~~
maksimum
Thanks, using DP to bound deviations of arbitrary functions of a model is a
neat idea.

I wonder if it it makes sense going from generalization error to an estimate
of something similar in spirit to (but not as strong as) differential privacy,
as the top-level comment suggested?

For example I want to empirically argue that a particular function of my GMM,
let's say the log-likelihood of x_i, is "private." To do so I form C iid
train/test splits. For each split I estimate the density of the likelihood for
train and test samples, and estimate an upper bound on their ratio. As a
result I get C samples of epsilon, and I can use some kind of simple bound
(Chebyshev?) on the probability of epsilon being within an interval.

The idea is that we already have some "privacy" [1] from the data-sampling
distribution. So we don't necessarily need to add noise to our algorithm. And
it would be interesting to measure this privacy (at least for a particular
function of the model) empirically.

[1]
[http://ieeexplore.ieee.org/abstract/document/6686180/](http://ieeexplore.ieee.org/abstract/document/6686180/)

------
chrispeel
An upcoming talk [1] entitled "From Differential Privacy to Generative
Adversarial Privacy" implies we can do better than DP:

 _...Our results also show that the strong privacy guarantees of differential
privacy often come at a significant loss in utility._

 _The second part of my talk is motivated by the following question: can we
exploit data statistics to achieve a better privacy-utility tradeoff? To
address this question, I will present a novel context-aware notion of privacy
called generative adversarial privacy (GAP). GAP leverages recent advancements
in generative adversarial networks (GANs) to arrive to a unified framework for
data-driven privacy that has deep game-theoretic and information-theoretic
roots. I will conclude my talk by showcasing the performance of GAP on real
life datasets._

[1] [https://www.eventbrite.com/e/from-differential-privacy-to-
ge...](https://www.eventbrite.com/e/from-differential-privacy-to-generative-
adversarial-privacy-tickets-42737831003#/)

------
bo1024
Looks like a very nice introductory article. One warning is that implementing
DP algorithms naively can lead to privacy violations. For example, the python
code in the article:

    
    
        def laplace_mechanism(data, f, eps):
          return f(data) + laplace(0, 1.0/eps)
    

This won't actually be private because floating-point numbers are different
than real numbers. Real DP projects deal with this more carefully.

~~~
maksimum
I've seen this limitation brought up before, but I don't have an intuitive
understanding of when it would apply. Is the essential issue spacing between
floats?

The scenario I'm thinking in is each user's entry is a real number from
U[1e15,1e15+1], and we want to output sum. The sensitivity is still 1. So the
Laplace Mechanism outputs something like sum(DB)+laplace(0,1/eps), which will
be exactly sum(DB) because laplace(0,1/eps) is smaller than the gap between
floats >=1e15.

For this to be an issue, the actual sum(DB) needs to be implemented in full
precision right? Otherwise it seems like the sensitivity isn't necessarily 1.

~~~
frankmcsherry
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.366....](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.366.5957)

------
motohagiography
We do need a way to get sensitive data sets to researchers, there is a lot of
public good and discovery to be had when we do.

Cynically, when we were finding that de-identification of data was not a
cryptographically secure or a viable solution for providing data sets to
researchers who couldn't be trusted to protect them, a nagging voice would
suggest k-anonymity and now DP were a way to obfuscate the distribution of
risk in the solutions through layered niche abstractions, and use the ensuing
confusion to get the data toothpaste out of the tube.

Today, I like the idea that DP provides a set of information theoretic
criteria for candidate algorithms for protecting privacy and anonymity. It
also appears to provide practical tools for technologists to reason about
information theory problems in their day to day work.

It's hard to imagine it was acceptable to say, "sure you can do
cancer/autism/health research, but first we need a viable generalized
homomorphic encryption solution before we share our data with you," but that
was the state of health information security up to even recently.

The need for a class of non-cryptological (e.g. not certified algos), and non-
zero-sum information privacy protection tools reflects how we actually use
data today.

I am still wary that project types will wave DP around like a talisman to ward
off security analysts threatening their deadlines, but that's not DP's fault.

There is a joke in here about how the response of security people to DP and
privacy is often snark, but I will leave that as an exercise to the reader.

~~~
maksimum
FHE and DP aren't concerned with the same definition of privacy. FHE aims to
guarantee exactly what is revealed. DP tries to guarantee what can be inferred
from what is revealed.

~~~
motohagiography
Indeed, and I think the definition of privacy that security technologists were
applying to health data (in my example) implied FHE or other solutions that
were out of reach, as a barrier to doing important research.

It was worth emphasizing these are very different concepts. My association of
them was because they are posed as potential solutions in the same business
problem domains.

------
ec109685
I wonder if Differential Privacy will be a solution to GDPR restrictions. If
the user’s data is purturbed enough that it can’t be used to identify the user
in it, do some of the restrictions the law mandates to its use reduce?

~~~
FlyingLawnmower
Of all privacy preserving approaches, I believe Differential Privacy has the
strongest chance of meeting GDPR standards.

------
amelius
> Differential privacy formalizes the idea that a query should not reveal
> whether any one person is present in a dataset, much less what their data
> are.

I see one problem here: you almost never know all the possible queries in
advance.

