Differential privacy provides a system that can allow the sharing of databases w...

bostik · on Feb 5, 2020

Differential privacy provides a lot less protection than you would think (or want to believe). A few months ago I saw a talk by E. Kornaropoulos, about his paper "Attacks on Encrypted Databases Beyond the Uniform Query Distribution"[0].

The main take-away from the talk - an in fact all the talks I saw on the same day - was that while DP is touted as a silver bullet and the new hotness, in reality it can not protect against the battery of information theoretical attacks advertisers have been aware of for couple of decades, and intelligence agencies must have been doing for a lot longer. Hiding information is really hard. Cross-correlating data across different sets, even if each set in itself contains nothing but weak proxies, remains a powerful deanonymisation technique.

After all, if you have huge pool of people and dozens or even hundreds of unique subgroups, the Venn-diagram-like intersection of just a handful will carve out a small and very specific population.

0: https://eprint.iacr.org/2019/441

DarthGhandi · on Feb 5, 2020

Australian government released "anonymised" healthcare data to researchers. Within months a good chunk of it was deanonymised, including celebrities and some politicians themselves.

There's a lot of privacy snakeoil out there and even large govt departments fall for it.

https://pursuit.unimelb.edu.au/articles/the-simple-process-o...

sroussey · on Feb 5, 2020

This has happened with NIH data in the US as well. There is a preprint available.

mattb314 · on Feb 5, 2020

Personally I'm not super bullish on differential privacy outside a couple specific use cases, but correlation attacks and cross referencing against external data are exactly the vectors that differential privacy is intended to protect against: it requires that the results of any query or set of queries would be identical with some probability even if a specific person wasn't present in the dataset.

It's possible I'm misreading, but your paper seems to focus on the very anonymization techniques diff privacy was invented to improve on, specifically because these kinds of attacks exist. While I agree it's no silver bullet, the reason is because it's too strong (it's hard to get useful results while providing such powerful guarantees) rather than not strong enough.

I've found the introduction to this textbook on it to be useful and very approachable if others are interested: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

ThePhysicist · on Feb 5, 2020

We're building an analytics system that is based on differential privacy / randomization of data. It's possible but there are many limitations and caveats, at least if you really care about the privacy and not just apply differential privacy as a PR move. Most systems that implement differential privacy use it for simple aggregation queries, for which it works well. It doesn't work well for more complex queries or high-dimensional data though, at least not if you choose a reasonably secure epsilon: Either the data will not be useful anymore or the individual that the data belongs to won't be reliably protected from identification or inference.

After spending three years working on privacy technologies I'm convinced that anonymization of high-dimensional datasets (say more than 1000 bits of information entropy per individual) is simply not possible for information-theoretic reasons, the best we can do for such data is peudonymization or deletion.

BlueTemplar · on Feb 5, 2020

Database sharing has been (in theory) illegal inside the government for decades for this very reason. Why would private companies be allowed to do it ?

bostik · on Feb 5, 2020

You posted your reply while I was writin my own. Do you happen to have pointers to any really good research results and/or papers?

I want to be better equipped to respond to this slowly emerging "DP is a silver bullet" meme and your response implies that you'd have actual research to back the position up.

ThePhysicist · on Feb 5, 2020

We don't publish research papers, here's a good article from another privacytech startup (not ours) that discusses some of the shortcomings of differential privacy:

https://medium.com/@francis_49362/dear-differential-privacy-...

Here's my simple take: Imagine you want to protect individuals by using differential privacy when collecting their data. Imagine you want to publish datapoints that each contain only 1 bit of information (i.e. each datapoint says "this individual is member of this group"). To protect the individual, you introduce strong randomization: In 90 % of cases you return a random value (0 or 1 with 50 % probability), and only in 10 % of the cases you return the true value. This is differentially private and for a single datapoint it protects the individual very well, because he/she has very good plausible deniability. If you want a physical analogy, you can think of this as adding a 5 Volt signal on top of 95 Volt noise background: For a single individual, no meaningful information can be extracted from such data, if you combine the data of many individuals you can average out the noise and gain some real information. However, averaging out the noise also works if you can combine multiple datapoints from the same individual, if those datapoints describe the same information or are strongly correlated. An adversary who knows the values of some of the datapoints as context information can therefore infer if an individual is in the dataset (which might already be a breach of privacy). If the adversary knows which datapoints represent the same information or are correlated he can also infer the value of some attributes of the individual (e.g. learn if the individual is part of a given group). How many datapoints an adversary needs for such an attack varies based on the nature of the data.

Example: Let's assume you randomize a bit by only publishing the real value in 10 % of the cases and publish a random (50/50) value in the other cases. If the true value of the bit is 1, the probability of publishing a 1 is 55 %. This is a small difference but if you publish this value 100 times (say you publish the data once per day for each individual) the standard deviation of the averaged value of the randomized bits is just under 5 %, so an adversary who observes the individual randomized bits can already infer with a high probability the true value of the bit. You can defend against this by increasing the randomization (a value of 99 % would require 10.000 bits for the standard deviation to equal the probability difference), but this of course reduces the utility of the data for you as well. You can also use techniques like "sticky noise" (i.e. always produce the same noise value for a given individual and bit), in that case the anonymity depends on the secrecy of the seed information for generating that noise though. Or you can try to avoid publishing the same information multiple times, this can be surprisingly difficult though, as individual bits tend to be highly correlated in many analytics use cases (e.g. due to repeating patterns or behaviors).

That said differential privacy & randomization are still much more secure than other naive anonymization techniques like pure aggregation using k-anonymity.

We have a simple Jupyter notebook that shows how randomization works for the one-bit example btw:

https://github.com/KIProtect/data-privacy-for-data-scientist...

bostik · on Feb 5, 2020

Thank you. That was a good read, and gave a few things to think about.

ergl · on Feb 5, 2020

It's not far-fetched. Differential privacy is going to be used for the US census this year. Here's a report on it: https://arxiv.org/abs/1809.02201

Also, it's not a magical solution. Here's one of the issues from the linked paper (edited for clarity):

"The proponents of differential privacy have always maintained that the setting of the [trade-off between privacy loss (ε) and accuracy] is a policy question, not a technical one. [...] To date, the Census committee has set the values of ε far higher than those envisioned by the creators of differential privacy. (In their contemporaneous writings, differential privacy’s creators clearly imply that they expected values of ε that were “much less than one.”)

DataWorker · on Feb 5, 2020

“In addition, the historical reasons for having invariants may no longer be consistent with the Census Bureau’s confidentiality mandate.”

us census, RIP

jefftk · on Feb 5, 2020

> If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.

One of the leaks they talk about way from Experian, a credit reporting agency. Not only would this approach work poorly for them, it wouldn't be legal (they need to be able to back up any claims they make about people, which requires going back to the source data).