Hacker News new | past | comments | ask | show | jobs | submit login
The Algorithmic Foundations of Differential Privacy (2014) [pdf] (upenn.edu)
85 points by sonabinu on Mar 25, 2018 | hide | past | web | favorite | 13 comments

Differential privacy appears regularly on Hacker News, with either theoretical articles or projects that aim to implement it. Yet often there is a huge gap between both.

For example, Apple has touted about its use of differential privacy, but researchers [1] have shown that the privacy budget is reset every day and the parameters, buried inside the code, lack a proper derivation. Similarly, Uber seems to use DP for internal analytics. However, the proposed model does not seem really robust and does not provide accurate results at all [2]. One should always carefully review claims associated with implementations of differential privacy.

[1] https://arxiv.org/abs/1709.02753

[2] https://github.com/frankmcsherry/blog/blob/master/posts/2018...

One problem I've found with differential privacy is that no one talks about how to set \epsilon. I've read this book, and it's quite well written and complete, but as the title says it focuses on the algorithmic foundations.

This paper [1] is much better for practitioners, and actually gives very reasonable values for the privacy guarantee (e.g., (1.2, 1e-9)), and builds on this great paper: [2]. Worth a read if you train neural networks.

[1]: https://arxiv.org/pdf/1710.06963.pdf [2]: https://arxiv.org/pdf/1607.00133.pdf

It's like the early days of cryptography. Everybody was rolling their own algorithms because no one realized how hard it is to do that properly. Eventually we all wised up. I'm hopeful that DP will follow a similar path.

Based on my limited understanding* of differential privacy, it falls short on exactness (of aggregate values) and robustness (against malicious clients). I've lately been studying the literature on function secret sharing and I think it is a better alternative to DP.

Take this paper: https://www.henrycg.com/files/academic/pres/nsdi17prio-slide...

Prio: Private, Robust and Scalable Computation of Aggregate Statistics

Data collection and aggregation is performed by multiple servers. Every user splits up her response into multiple shares and sends one share to each server. I've understood how private sums can be computed. Let me explain it with a straw-man scheme.

Example (slide 26):

x_1 (user 1 is on Bay Bridge):- true == 1 == 15 + (-12) + (-2)

x_2 (user 2 is on Bay Bridge):- false == 1 == (-10) + 7 + 3 ...

If all users send shares of their data to the servers in this manner AND as long as at least one server doesn't reveal the identities of the people who sent it responses, the servers can exchange the sum of the shares they've received. Adding the three responses will allow the servers to infer that there are _ number of users on Bay Bridge without revealing their identities.

This system can be made robust by using Secret-shared non-interactive proofs (SNIPs). This allows servers to test if Valid(X) holds without leaking X.

The authors also bring up the literature on computing interesting aggregates using private sums: average, variance, most popular (approx.), min and max (approx.), quality of regression model R^2, least-squares regression, stochastic gradient descent.

Bottom line: I found the discussion on deployment scenarios very interesting. Data servers with jurisdictional/geographical diversity, app store-app developer collaborations for eliminating risk in telemetry data analysis, enterprises contracting with external auditors for analyzing customer data, etc.

* - I understand the randomized response and, to some extent, the RAPPOR technique (used for collecting Chrome telemetry data) but the other literature in that community goes over my head.

* * - This technique is a black box to me at the moment.

Apple's epsilon reset problem is real, but it's worth pointing out that they use additional heuristics based on hashing that plausibly add another layer of privacy [1]. Plausibly, not provably, but it's a bit more than just resetting epsilon. I believe Google and Microsoft use similar tweaked forms of differential privacy. In particular, note that all of these companies -- again, going off public papers -- use the "local" variant of differential privacy, which requires less trust on the user's part.

The question of "lifetime" differential privacy, for a single user across different computations and datasets, is still fairly open as far as I know.

[1] https://machinelearning.apple.com/docs/learning-with-privacy...

I don't know anything about DP, so my question might be unrelated. But I think perhaps someone can answer it. Almost 20 years ago, I was told of the following problem at the university:

You have two identical databases (sets of n bits) that do not communicate. You want to know a single bit from the database. How many bits do you have retrieve from each database so that neither database would learn about which bit you were looking for?

The simplest answer is n, retrieve all bits. But we were also given a better answer, square root of n - you order the bits into a square, ask for a xor of random subset of columns but to the first/second database respectively with/without the column you're looking for.

And here is my question, we were also told that this can be done even better, in cube root of n bits. But I never learned the answer, and since I wonder, was that claim correct? Does anyone know this problem and the better solution?

What you are looking for is "Private Information Retrieval". For the cube root result, check out: http://www.tau.ac.il/~bchor/PIR.pdf

Some interesting concepts to consider: K-anonymity, ℓ-diversity, t-closeness.

From the paper (https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069) I wrote a while ago:

While the risk of re-identification (of a record or individual participant) might be virtually non-existent with synthetic data, one could predict unknown attributes of a known individual, given an ideal model of synthesis. In other words, an attacker could find unknown attributes of some individual with a certain probability by looking for the closest match in the synthetic data. This is known as attribute disclosure.

There are several methods for quantifying attribute disclosure, most notably t-closeness, which is defined as: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold. A table is said to have t-closeness if all equivalence classes have t-closeness.

In short: the distribution of a particular sensitive value should not be further away than a distance t from the overall distribution.

Using the t-closeness metric circumvents issues associated with k-anonymity and ℓ-diversity. Briefly, k-anonymity states that a certain attribute class should be present in at least k records, which introduces ambiguity in the data set. However, if each of the k equivalence classes are the same, properties could still be resolved simply by elimination. The ℓ-diversity metric circumvents this problem by adding a further requirement: in addition to the class to being seen in k records, these records must have at least ℓ ‘well represented’ values. But if an attacker knows the real-world distribution of values, then attributes could still be disclosed with a certain probability, simply by combining different data sources

Way back in 1978, Demillo, Lipton and Dobkin published a note in the IEEE Transactions on Software Engineering (SE-4(1):73- 75 · February 1978) called "Even databases that lie can be compromised" The basic idea was to look at the idea of giving slightly wrong answers to median type queries in order to protect results for individuals. They showed that, even when the query system deliberately lied, it was possible to compromise the data base. I am surprised that this note is not listed in the bibliography of the differential privacy article.

Peter Kairouz [1] gave a talk [2] recently describing mechanisms to get around the limitations of DP described in other comments.

[1] https://web.stanford.edu/~kairouzp/

[2] https://youtu.be/6Uur2_TnwYE

I love the concept of differential privacy, but it seems hard to incentivize the "data hoarders" to actually use it, even if you ignore the challenges of building real-world differentially private systems. Google and Apple use it for some things, but in general it doesn't seem like something the market will use by itself.

Also doesn't help that differential privacy itself is maybe too arcane and subtle for the public to talk about and demand, unlike for example encryption which people probably generally at least understand to mean something along the lines of hiding their data in some sense.

I’d just love to debate a fascist Justice Department over the meaning of “You will not be affected...” as I am led off into detention. Real rights are often negative rights. My home ownership has little to do with entitlement to activities within. It has everything to do with keeping you out of my house if that’s what I choose. Get it?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact