

Cynthia Dwork and the Brilliant Idea of Differential Privacy - wglb
https://rjlipton.wordpress.com/2015/02/06/cynthia-dwork-and-a-brilliant-idea/

======
kordless
The primary implication is that your use of centralized services will
eventually lead to information leakage of data you didn't intend to share with
anyone else. Remember this?
[http://en.wikipedia.org/wiki/AOL_search_data_leak](http://en.wikipedia.org/wiki/AOL_search_data_leak)

As our digital footprints expand, so does the probability we'll leak data we
didn't expect. One solution to this problem would be self-hosted SaaS
services. Keep your data where it belongs: at home.

~~~
schoen
I think Dwork's account is that the continued use of services _without careful
statistical analysis and addition of noise_ will lead to information leakage.
She didn't seem to argue for or against decentralization, but rather that
there is no such thing as a non-personally-identifiable fact or a non-privacy-
sensitive fact (eventually _every_ fact might lead to some inference that a
data subject didn't want to be drawn, in some hypothetical situation).

------
otakucode
I read a paper on ArXiv.org several years ago, at least 5 or 6 I would guess,
that used these ideas to come up with a process by which you can modify a
dataset such that any statistics gathered could not possibly be used to de-
anonymize any participant. I have since frequently wondered why no one was
aware of or using these techniques. Is it just that there is a lack of a
usable implementation?

~~~
schoen
There are some implementations, like PINQ:

[http://research.microsoft.com/en-
us/projects/PINQ/](http://research.microsoft.com/en-us/projects/PINQ/)

One difficulty is that the modification isn't necessarily a one-time thing
that then allows the dataset to be _published_. In many of Dwork's proposed
applications, the dataset is maintained by a trusted party and gets modified
on-the-fly by adding appropriate noise in response to particular database
queries.

So for most of these, you can't just imagine "do this to the database, and
then you can publish it on your web site"; it has to be interactive.

In that case, there just may not be that many situations (yet?) where people
agree on who can and can't be trusted with the personal data, and want to
cooperate to set up this kind of relationship.

------
wglb
With a follow-up article: [https://rjlipton.wordpress.com/2015/02/07/still-a-
brilliant-...](https://rjlipton.wordpress.com/2015/02/07/still-a-brilliant-
idea/)

------
schoen
I went to Dwork's talk about differential privacy at the Berkeley School of
Information last week. It was a very good presentation.

The motivation was that we don't know how to anonymize real data sets
(particularly in the presence of potentially unanticipated auxiliary
information, such as a related database that we didn't realize exist, or
related statements by people about themselves or about populations) so that
they can't be deanonymized. For example, taking people's names, addresses,
etc., out of a database isn't good enough because the database still reflects
their demographic or biographical history, and there may be only one person in
a particular context who has that history, or you may just be able to match up
the records in the database with records in a different non-anonymized
database or one that contains other fields.

Therefore, Dwork defines a probabilistic notion of privacy loss which
describes a bound on the probability that an adversary can learn about the
people in the database (I think described in the basic case by the probability
that an individual's presence or absence in the database is distinguishable to
the adversary, although that's not a precise translation of the rule).

The actual implementations of this idea generally involve adding noise to the
database, for example deliberately introducing noise in the sampling or survey
process that created the original database, or by having a trusted party that
controls the database that adds noise in response to particular queries by
untrusted users. Then it's possible to make probabilistic arguments about how
much noise of what kind is sufficient to satisfy the probability bound.

As a result, people can do analysis of the datasets to find population
statistics that are of interest to them, but with a low probability of being
able to learn things about individuals in the database deliberately or
accidentally.

An interesting example of a mechanism that Dwork gave that was proven to
satisfy one kind of differential privacy bound (although it was invented much
earlier) was a technique used by researchers who were studying sexual
behavior. They gave people a spinner and asked them to use it (privately) and
answer the question truthfully if the spinner came up one way, and just to say
"yes" if it came up the other way.

[https://en.wikipedia.org/wiki/Randomized_response](https://en.wikipedia.org/wiki/Randomized_response)

That gave people a certain level of deniability (you could actually choose the
level by the design of the spinner) and so their incentive to worry about
being personally associated with their answers was lower. As a result,
researchers concluded that survey participants would be more willing to give
honest answers to embarrassing questions (when the spinner told them to). But
the resulting answers did have a clearly-understandable relationship to the
truth so you could calculate some population statistics accurately (mostly
about the true prevalence of "yes" answers in the population). Dwork explained
that this can meet the differential privacy bound, although most of her own
research has been on adding noise at query time rather than at collection
time.

In that case, you get a "privacy budget" which gets "spent" as researchers
query the database and examine it in various ways.

