What they show is that as you use Apple's implementation, the differential privacy parameter grows (providing weaker guarantees as time passes). They don't show that they can bypass the mechanism and it's guarantees, just that Apple has rigged the implementation to decay the guarantees as you continue to use it (note: decay stops if you stop using Apple stuffs).
The hype machine surrounding Apple did not want to hear this and were just caught up with the idea "Apple could data mine you while maintaining ur privacy".
Differential privacy does statistically disclose information about you, and whether the quantitative bound used is up to your standards is a decision you should make before it happens. Given that user-level understanding of DP is so low, I don't think defaulting people in to levels chosen by others is a good idea. I 100% guarantee Mozilla doesn't have the background to make this choice responsibly, and doesn't have near the DP expertise the RAPPOR team has / had (who I also wouldn't trust to chose things for me).
Ideally, the use of differential privacy would make you more willing to opt in (when you have a choice) rather than being a smokescreen for organizations that simply want to harvest more data (which was Mozilla's stated motivation).
Edit: fwiw, there are some cool "recent" versions of differential privacy that let the users control the amount of privacy loss on a user-by-user basis. So, you could start at 0 and dial it up as you feel more comfortable with the tech. This incentivizes organizations to be more transparent with what they do, as it (in principle) increases turnout.
Edit2: For context, Apple's "default" values appear to be (from this paper) epsilon = 16 * days. That means that each day you are active, the posterior probability someone has about any fact about you can increase by a factor of exp(16) ~= 88 million. So, numbers matter and I am (i) glad Apple made it opt-in, (ii) super disappointed they aren't at all transparent about how it works, and (iii) thankful that the paper authors are doing this work.
So 16 per day sounds like a lot more than 1 or 2 per day, but what do these numbers mean? Presumably 16 per day is a theoretical maximum if you were to generate every kind of privacy related data ever day. But is 16 really a lot? How high would that have to cumulatively go in order to be useful for extracting reliable info on an individual? Wouldn't the info collected on an individual still have to be associated with them? Frankly I'm not really able to determine any of that from the paper.
Source: Google's RAPPOR project
I pointed to some open source repos on my blog post from 2015 https://www.quantisan.com/a-magical-promise-of-releasing-you...
Then you infer an estimate using Bayes' theorem.
Otherwise it is not private, as a reply has pointed out.
Saying "whatever you want" will incur a very large sampling error, especially as the population of those saying whatever increases.
What is needed is a notion of scalable privacy, where as the population of those saying "whatever" increases the privacy strength also increases yet the absolute error remains at worst constant.
Honest question, just too lazy to read the manuscript you link...
" the estimation error quickly increases with the population size due to the underlying truthful distribution distortion. For example, say we are interested in how many vehicles are at a popular stretch of the highway. Say we configure flip1 = 0.85 and flip2 = 0.3. We query 10,000 vehicles asking for their current location and only 100 vehicles are at the particular area we are interested in (i.e., 1% of the population truthfully responds “Yes"). The standard deviation due to the privacy noise will be 21 which is slightly tolerable. However, a query over one million vehicles (now only 0.01% of the population truthfully responds “Yes") will incur a standard deviation of 212. The estimate of the ground truth (100) will incur a large absolute error when the aggregated privatized responses are two or even three standard deviations (i.e., 95% or 99% of the time) away from the expected value, as the mechanism subtracts only the expected value of the noise."
"In this paper, our goal is to achieve the notion of scalable privacy. That is, as the population increases the privacy should strengthen. Additionally, the absolute error should remain at worst constant. For example, suppose we are interested in understanding a link between eating red meat and heart disease. We start by querying a small population of say 100 and ask “Do you eat red meat and have heart disease?". Suppose 85 truthfully respond “Yes". If we know that someone participated in this particular study, we can reasonably infer they eat red meat and have heart disease regardless the answer. Thus, it is difficult to maintain privacy when the majority of the population truthfully responds “Yes".
Querying a larger and diverse population would protect the data owners that eat red meat and have heart disease. Let’s say we query a population of 100,000 and it turns out that 99.9% of the population is vegetarian. In this case, the vegetarians blend with and provide privacy protection of the red meat eaters. However, we must be careful when performing estimation of a minority population to ensure the sampling error does not destroy the underlying estimate."
If the coin comes you heads you answer truthfully. If it comes up tails, you flip the coin again and answer if the yes if the coin is heads and no if the coin is tails. You can then no longer know if anybody is (or is not) a dog.
The probabilities can be adjusted to provide more or less privacy (while making the data less or more useful). For example, if you only answer truthfully 0.1% of the time it would be hard to know anything about anyone, at the cost of knowing the total number of dogs less precisely.
I am not sure how any of that helps in a mass collection of data like OS telemetry.
Which text were you reading that lead you to this conclusion?
> It is likely that some attackers will aim to target specific users by isolating and analyzing reports from that user, or a small group of users that includes them. Even so, some randomly-chosen users need not fear such attacks at all...
For the limitation, the whole section 6.1 explains that this only protects a single question. If you collect more than single question, you must rely on other techniques to protect the privacy.
The text you've quoted is about how a random subset of the population is already immune to the issue of repeated queries, not that subsampling the population helps in any way. If you don't interrupt the quotation mid-sentence, it reads:
> Even so, some randomly-chosen users need not fear such attacks at all: with probability (1/2 f)^h, clients will generate a Permanent randomized response B with all 0s at the positions of set Bloom filter bits. Since these clients are not contributing any useful information to the collection process, targeting them individually by an attacker is counter-productive.
The whole of section 6.1 is not about how it only protects a single question, it is about how one ensures that the single-question protections generalize to larger surveys, concluding that
> This issue, however, can be mostly handled with careful collection design.
But my point is precisely that this technique helps with a single question. As soon as you are doing continuous mass collection you don't really get any privacy protection from this technique, and you have to rely on other techniques (encryption, etc).
But ... [please see reply to omarforgotpwd].
But, as someone mentioned, if the coin comes tails you should answer with another coin flip, not "yes".
Accordingly, you need to analyse aggregate statistics only, add random inaccuracy and apply data binning, anonymise reports. And you need to calibrate the noise. The latter is what the paper seems to be mainly focused on.
> We call for Apple to make its implementation of privacy-preserving algorithms public and to make the rate of privacy loss fully trans-parent and tunable by the user.
"Calibrating Noise to Sensitivity in Private Data Analysis" is about the matter.
This is a minor, pedantic point, but what you really mean to say is "the closer the changes get to 50%, the more privacy the user has". If you change all results, then it is easy to flip them all back.
This distinction trips up several folks, where the research world initially believed (and some in official stats still believe) that a part of privacy is literally not publishing the true answer (e.g. above: literally flip every output).
What you actually want is
Pr[output | input] ~= Pr[output | input']
: Noise addition is a common way to obscure real-valued data, and some official stats bureaus have the ridiculous rule that "you always add noise, and you never add less than X in absolute value", leading to releases where you can be 100% confident that the true value is not in a range around the published number.
But still, there are some questions that you'd arguably never want to say "yes" to. Such as, did you visit some verboten site (terrorist, child porn, etc) today?
So how can an algorithm "know" which questions it's safe to use differential privacy with, and which it isn't?
Or would you argue that it's safe enough to use differential privacy with even such questions?
Or, in other words, the question is not "Did you visit this site," the question is "Did your coin come up tails or did you visit this site," which is a perfectly safe question to say "yes" to.
It may help with categories of sites, like "did you visit a news site today" or "did you watch porn".
For people who don't really need to care about privacy, I guess that differential privacy is good enough. But there's a gotcha there, for people who ought to care, but are clueless. I'm reminded of that ex cop in Philadelphia, who believed Freenet's claims about plausible deniability.
Edit: A prudent standard for calling something "[foo] privacy" is arguably PGP.