
Google's differential privacy library - simonpure
https://github.com/google/differential-privacy
======
mattb314
If you're new to differential privacy and looking for an introduction, I
highly recommend the Dwork and Roth book, especially the first three chapters:
[https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf)

Frank McSherry also has some good resources if you enjoy his writing style:
[https://github.com/frankmcsherry/blog/blob/master/posts/2016...](https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md)

In particular, I think it's important to keep in mind that differential
privacy is as much focused on establishing a framework for measuring
information leakage as it is coming up with clever algorithms to preserve
privacy (although there are a lot of clever algorithms). I think of it as more
analogous to big-O notation (a way of measuring) than to dynamic programming
(an implementation technique).

~~~
m463
I think the primary focus of differential privacy is that "the spice must
flow". They need to keep collecting and using this data.

~~~
infogulch
I guess. It's more like: figure out how to measure how much spice is flowing.
The resulting knowledge will be a new tool: powerful and morally indifferent,
as all tools. You choose how to use it.

------
ekzhu
Just want to point it out in case people miss it: it has a Postgres extension
you can use directly in your databases.

------
herf
I keep trying to understand if these ideas will be useful for epidemiology.
Right now I think there is a long way to go for multivariate statistics.

It seems that differential privacy can handle one column of data, with some
categorical filters (like "smoker: yes/no"). But epi researchers _have to_ do
multivariate correction for many lifestyle factors. These kinds of corrections
seem very difficult to manage in such a datastore - but if you cannot do them,
you just find some correlation with age or location that isn't what you
intended to find.

In other words, these kinds of "lots of columns at once" queries are really
important to epidemiology, and my impression is that differential privacy is
not so strong here. Anyone have a better impression of what might be possible
in the future?

~~~
sneeuwpopsneeuw
You are missing the point of differential privacy. This is an oversimplified
explanation but I see it like this (This is also how my professor and his phd
assistant at my university explained it to us).

Differential privacy provides some simple mathematical foundations to share
any database data what so every without revealing the data about anyone
specific. An example could be when you store someones name, birthday and
illness. Differential privacy in a simplified way says,a name is a direct link
to a person so remove it. A birthday could potentially be used to link to a
user but not directly so replace it with a range so age between 20-30 for
example, instead of the specific birthday. The illness is the data someone
else wants so that stays. Now someone else can get information from your
database without getting to any specific user or person. (there are a lot of
other things that can be done such as adding random numbers to the result when
you for example ask for am average age)

Where this whole thing starts to break down is when this is applied to real
situations. Sure everything mathematically shows that you can not get to a
specific user. But when you do already have a large amount of data or there
are multiple of these databases you can quite easily combine them to find
specific users or people back in the data. And these type of attacks are
already happening, with people adding large data breaches together to find
username, email and password combinations for example. This way they can for
example find out if you have a pattern in you passwords such as a base
password + a specific thing extra at the end.

~~~
dllthomas
As the other poster mentioned, this sounds much more like non-DP
anonymization, which (as you note) is usually surprisingly vulnerable to
deanonymization through various approaches.

With Differential Privacy, you instead add randomness such that you can't tell
whether _the answer you_ got includes any individual person, for whatever
question you're asking.

IIUC, RAPPOR adds that randomness it to the original data; Leap Year (where I
worked for a while) adds it to answers to specific queries. There are huge
tradeoffs and they're suitable for very different settings. I am not sure
which approach is taken here.

Edited to add:

Skimming the docs, it seems to be the latter - ask questions of the exact
data, returning answers that are noisy. This requires ongoing trust of the
entity holding the data (so it's most applicable to circumstances where they'd
have that data regardless), but is much more flexible.

------
agseward
Related: IBM Research has a differential privacy Python library for machine
learning and analytics

\- [https://www.ibm.com/blogs/research/2020/06/ibm-
differential-...](https://www.ibm.com/blogs/research/2020/06/ibm-differential-
privacy-library-the-single-line-of-code-that-can-protect-your-data/)

\- [https://github.com/IBM/differential-privacy-
library](https://github.com/IBM/differential-privacy-library)

------
drewda
Related: the OpenDP project out of Harvard and sponsored by Microsoft:

\-
[https://projects.iq.harvard.edu/opendp](https://projects.iq.harvard.edu/opendp)

\-
[https://github.com/opendifferentialprivacy/](https://github.com/opendifferentialprivacy/)

------
scott31
How is this different than a VPN?

~~~
dllthomas
Say you have a big pile of data that you want to keep private _even from your
analysts_. Technology in this vein is for you!

------
1vuio0pswjnm7
"In this directory, we give a simple example of how to use the C++
Differential Privacy library.

Zoo Animals

There are around 200 animals at Farmer Fred's zoo. Every day, Farmer Fred
feeds the animals as many carrots as they desire. The animals record how many
carrots they have eaten per day. For this particular day, the number of
carrots eaten can be seen in animals_and_carrots.csv.

At the end of each day, Farmer Fred often asks aggregate question about how
many carrots everyone ate. For example, he wants to know how many carrots are
eaten each day, so he knows how many to order the next day. The animals are
fearful that Fred will use the data against their best interest. For example,
Fred could get rid of the animals who eat the most carrots!

To protect themselves, the animals decide to use the C++ Differential Privacy
library to aggregate their data before reporting it to Fred. This way, the
animals can control the risk that Fred will identify individuals' data while
maintaining an adequate level of accuracy so that Fred can continue to run the
zoo effectively.

The animals have implemented a CarrotReporter tool in animals_and_carrots.h to
obtain DP aggregate data to report to Fred. We document one of these reports
in report_the_carrots.cc."

Tech companies love to use that line, "We take privacy seriously."

That seriousness is certainly reflected in this example, which appears to
compare users with zoo animals, tended to by a "farmer".

If Fred is anything like Google, he wants this per animal carrot consumption
data for some other reason(s) besides simply ordering more carrots.

This example makes privacy sound like some sort of resource allocation
problem. What is the minimum carrots we must provide in exchange for animal
data.

What if the animals are not "fearful that Fred will use the data against their
best interest" but instead they know Fred is using the data for reasons other
than ordering more carrots, profiting from that use and not sharing any of the
profits.

~~~
londons_explore
> profiting from that use and not sharing any of the profits.

Walmart profits when you buy their products. Yet you don't hear anyone
demanding Walmart shares their profits with their customers.

Let's not mix up the user privacy debate with the 'are companies allowed to
make profits and not share them with me' debate.

~~~
1vuio0pswjnm7
Let's not compare Google with Walmart.

The differences far outnumber any similarities.

I was not making a general argument for profit-sharing, I was calling
attention to this presumed idea in the example of users worrying "they will
use my data against my best interest". Obviously they will not use your data
against you in a way that causes measurable injury (damages). If they did, you
could sue them and potentially win. They are not that stupid.

However they may use your data for purposes other than the reason you allowed
them to collect it. They will likely use the data to further _their_ best
interest; they will not tell you exactly how they use it nor will they cause
you any injury. The only claim you potentially have is to the value of your
data, which they utilise in their pursuit of profits.

You might not get a "share of profits", but you could claim the value of the
data they obtained from you. If many users make the same claim, in the
aggregate, that could be a substantial amount of data that carries a
substantial amount of value.

------
justicezyx
I was told by an engineer from Leapyear Technologies
([https://leapyear.io/](https://leapyear.io/)) that this library was mostly
primitive functionalities that are behind the current mainstream practices.

Disclaimer: Myself not an expert in the field.

~~~
ThePhysicist
Applying DP to a simple computation like an average or median isn't that hard,
what's more tricky is to ensure reasonable privacy guarantees when allowing
unlimited interactive queries or large-scale sample generation from high-
dimensional data:

You can apply DP to individual datapoints or attributes, but the amount of
noise that you need to add to reach reasonable privacy guarantees is then
quite high. Hence it makes more sense to add noise to the result of a
computation, as the sensitivity of many practically relevant computations to
individual datapoint values will often be small, hence the required amount of
noise to mask the contribution of each individual datapoint will also be low.
The problem is just that often a single datapoint can contribute to the result
of many (sometimes nearly infinitely many) computations, and every DP
computation result you return to a user (or adversary) will reduce you privacy
budget. There are some approaches to remedy this problem, like adding "sticky
noise" or remembering queries to ensure no averaging of noise is possible, but
all of them have their drawbacks. Therefore we still see quite limited use of
DP in interactive data analysis or machine learning, because it is quite hard
to strictly ensure reasonable privacy guarantees in those cases.

Would be interesting to know if LeapYear has come up with something better,
but they don't seem to have any source code or datasets available for public
scrutiny.

------
ve55
Differential privacy is cool, but does Google actually _use_ any of this
themselves? I'm hard-pressed to remember a time where Google collected much
less than what they were allowed to in order to respect my privacy.

Edit: thanks for providing examples, although I do personally note in my
response to them below that I don't believe it's evidence in favor of Google
actually incorporating these algorithms at scale to help users

~~~
bsimpson
Yes:

\- Gboard

\- Smart Compose

\- Chrome

\- Maps

[https://blog.google/technology/safety-security/privacy-
every...](https://blog.google/technology/safety-security/privacy-everyone-io/)

~~~
ve55
These seem to be cases where Google might be collecting less data in one area,
but where they can easily supplement or cross-reference the data from another
area to not actually lose any information.

For example, they mention Rapor for Chrome, but with almost all websites
having Google analytics installed, they clearly have full data on what the
user is doing regardless.

For Google maps it may be used to show how busy a restaurant is, but they
still collect location history, query history, route history, etc from users,
so it's trivial for the data to still be used or reconstructed.

Even _mentioning_ usage of differential privacy for a single feature of Gmail
seems pointless to me when they obviously not only have everyone's emails in
full history, but also develop countless algorithms to scrape content from
emails (e.g. purchase history from common retailers).

Perhaps I'm too cynical, but at least to me it seems to be a very common
pattern. I'd personally bet that differential privacy techniques that actually
give users notable information-theoretic anonymity are very rarely used by
Google in general. A few usage example of differential privacy are good and
better than none, but with a company of their size I don't think it (yet)
makes any true statement.

~~~
throwawaygoog10
You seem to be missing what differential privacy is. It's not about collection
of data, it's about the _use_ of that data. It's no secret that Google has an
incredible amount of logging data, but the ways we can use it are very
limited. Folks seem to be under the impression that we can wily-nily just go
ahead and build products that harvest everything about you and link up the
dots across organizations. That's so funny, because it'd make things so much
easier sometimes. :P

Instead, we have very strict privacy rules and experts to review the designs
for the use of this data. If I even want to train a ML model over real data I
have to have an approved privacy review that shows how you maintain privacy.

Where I use differential privacy algorithms in my line of work is to do ad-hoc
analysis over suggestions placed in front of users. I have dimensions to
aggregate across, but I want to ensure that no one bucket can deanonymize a
user. k-anonymity used to be the thing (e.g. if a bucket has <50 people in it,
that's too few), but even a large bucket can deanonymize users which is where
k-anonymity comes in. I sincerely don't care who the users are, I just want to
know how our features get used to try and save them more time.

Do I have access to the underlying logs? Yes. Can I use that to make
decisions? No. I can however use the anonymized data to make decisions, and
even store that longer than the underlying data exists (most logs exist for
<14d).

Differential privacy also makes it possible to train models like SmartCompose
by ensuring that the tokens it trains over are diffuse enough to not point
back to any one person.

> I'd personally bet that differential privacy techniques that actually give
> users notable information-theoretic anonymity are very rarely used by Google
> in general.

For existing things, sure. They did their best, but this is new, reified
research. As they're replaced they're being replaced by features which use
differential privacy techniques.

~~~
ve55
I appreciate the quality response. A lot of the focus here seems to be
'prevent other consumers from finding things out about our users', which is
good and important. I usually think more about it from Google's perspective,
which is that they have the data, and perhaps they're not using it for X right
now, but they have the _potential_ to, and that potential is what creates this
significant power imbalance and centralization that I'm often concerned over.

Obviously Google employees cannot go around reading+using all of my personal
communications for whatever they want to, but just that Google _has_ all of
them, to me, is too much power given to a single actor, even if they are
generally not abusing this power.

With those said, differential privacy is still a great tech, so it's still
great that they're open-sourcing and encouraging things like this. But I'll
likely remain concerned about the centralization of the world's data at the
same time.

