
What is Differential Privacy? - sohkamyung
http://blog.cryptographyengineering.com/2016/06/what-is-differential-privacy.html
======
ianmiers
This is something Apple really needs to release all the details of. Even if
they got the crypto exactly right, they could have picked a privacy budget/
security parameters that just leaks everything.

And there is every reason to be skeptical about Apple's ability to design even
mildly complex crypto given iMessage's flaws. Although the break in iMessage
wasn't practically exploitable, that was luck and the fact that the only way
to detect if a mulled ciphertext decrypted required attachment messages. The
cryptographic mistakes were bad. Given any way to detect decryption of mulled
ciphertexts for standard messages (e.g. sequence numbers, timing, actively
synching messages between devices, delivery receipts from iMessage instead of
APSD), Apple's crypto design bugs would have eliminated nearly all of the E2E
security of iMessage.

Remember, this isn't a boon for user privacy. Apple is now collecting far more
invasive data about users under the claim that they have protections in place.
At best it preserves the status quo and does so only if Apple both picked the
parameters correctly and implemented it correctly.

At this point Apple's position should be best summed up as: we have
drastically reduced your privacy except not because magic that we (i.e. Apple)
do not fully understand.

~~~
mnem
Apple have designed/implemented several quite successful crypto and security
systems too.

~~~
ianmiers
What have they actually designed from scratch? Most of what comes to mind
(Facetime, filevault) it is off the shelf stuff/ at least there were well
known designs to ape which had been subject to analysis. This, well if they
just copied what google did, then maybe.

But at some level, my comment is pretty harsh on Apple when they have a better
track record than most for privacy and encryption. But remember, this is Apple
making you less secure by grabbing data and then saying they took care of the
issue.

With iMessage, Facetime, Filefault, if it failed, you were were no worse off
than if you used something else. In this case, you actually are. That means
there is a higher burden to getting it right than just name dropping magic
crypto pixie dust.

~~~
czr80
> With iMessage, Facetime, Filefault, if it failed, you were were no worse off
> than if you used something else. In this case, you actually are.

What would someone be better off using?

~~~
ianmiers
Something that didn't report these statistics. An old version of IOS for one.
Does android do this type of reporting?

------
bo1024
I've read a couple articles and haven't seen any details about how they're
going to apply DP (differential privacy).

It's important to clearly distinguish what DP can and cannot do. DP is just a
technique for taking a database and outputting some statistic or fact about
it. The output has some noise added to it.

The guarantee of DP is (roughly) that anyone looking at the output alone won't
learn much about anyone in the database. This also holds for anything you do
with that statistic.

Think about this carefully when thinking about what DP does and doesn't
promise. Also think about the difference between "privacy" and "security".

Example of what DP does protect against: If Apple is recommending products to
people based on others' download habits, and this recommendation is based on
differentially private statistics, then no other user or group of users can
infer anything about my downloads. In fact, even engineers at Apple, if they
can only see the statistics and not the original database, cannot infer
anything about my downloads.

Example of what DP does not protect against: government accessing the data.
The database still has to exist on Apple's servers. The government can get to
it just as easily as before via warrants or so on. DP is not cryptography.

My assessment: On one hand it is awesome that Apple is taking a lead in using
differential privacy and thinking about mathematical approaches to privacy. On
the other, there are many facets of privacy and right now I think people are
more concerned about _security_ of their data and privacy from the government,
or else privacy from companies like Apple itself. DP doesn't address these; it
only addresses the case where Apple has a bunch of data and wants the
algorithms it runs not to leak much info about that data to the world at
large.

~~~
frankmcsherry
> The database still has to exist on Apple's servers.

It doesn't, which is part of the reason Apple wants to do this. You can still
do differential privacy without collecting all the data, you just get less
accurate results. See page 232 in [1], re "The Local Model".

[1]:
[http://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf](http://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf)

Edit: the article even says this down when it talks about RAPPOR.

~~~
conradev
Even though the technique doesn't require collecting all of the data, Apple
still does collect a huge well of data.

They store all iCloud sync material (backups, photos, contacts, calendars,
mail, documents, etc.) without end to end encryption, and have all of the
iMessage metadata.

~~~
Twisell
This is so wrong. The only thing that isn't user-encrypted so far is the
iPhone backup on their server (Of course it is encrypted but Apple have the
key to decrypt it as needed).

The official explanation so far is that if the user forgot the password a
user-encrypted backup would just become some useless junk.

This is (officially) the sole remaining non user-encrypted personal data on
apple server that authority can reclaim using a warrant.

However after San Bernardino FBI mess, Apple start considering to also encrypt
iCloud backup.

So if you think you are right, proofs please...

~~~
corv
Are you saying Notes, Safari Bookmarks, Photos, etc are encrypted on iCloud?

How come they are accessible from iCloud.com? Decrypted by the browser on the
fly?

~~~
Twisell
It seems so: [https://support.apple.com/en-
us/HT202303](https://support.apple.com/en-us/HT202303)

However I reckon that technically Apple could access data or give data stored
on iCloud to NSA/FBI because they actually still hold the keys for that part
too (not only backup as I thought). Only the password/creditcard Keychain is
now claimed to be fully user-encrypted and can't be recovered by any mean by
apple.

For anything else than a warrant, they'll "just" have to breach every
engagement they made in their contract which would, as far as I know
constitute a pretty solid legal case that could only lead a public walk of
shame that could compromise the whole company's future.

If you don't trust them, don't use their cloud, I totally respect that. In the
end it always appeal to some degree of trust, even GitHub could be spying on
paid private repositories under the hood if they really wanted to. But for
what gain?

~~~
Natanael_L
AFAICT only storage is encrypted. They decrypt server side.

~~~
Twisell
Do you suggest that Apple is blatantly lying in the article I just cited?

~~~
Natanael_L
[https://www.apple.com/privacy/approach-to-
privacy/](https://www.apple.com/privacy/approach-to-privacy/)

> All your iCloud content like your photos, contacts, and reminders is
> encrypted when sent and, in most cases, when stored on our servers. All
> traffic between any email app you use and our iCloud mail servers is
> encrypted. And our iCloud servers support encryption in transit with other
> email providers that support it.

> If we use third-party vendors to store your information, we encrypt it and
> never give them the keys. _Apple retains the encryption keys in our own data
> centers_ , so you can back up, sync, and share your iCloud data. iCloud
> Keychain stores your _passwords and credit card information_ in such a way
> that Apple cannot read or access them.

The End.

I always find it amusing when people downvote me for telling them Apple is
doing _what they admit to be doing_.

------
cromwellian
Some dissenting views on the utility of differential privacy:
[https://medium.com/@Practical/differential-privacy-
considere...](https://medium.com/@Practical/differential-privacy-considered-
harmful-f420ce717846)

Also, Apple is woefully low on details, theoretical privacy should be
accompanied by openly published research papers that are peer reviewed. I
understand they won't release the source, but would you trust Apple if they
said they invented a new encryption algorithm, but refuse to publish an
academic paper on it? I'd be interested precisely in what they're doing. Are
they claiming they're doing federated learning, by gathering anonymous image
data from photos, uploading it to their cloud, training DNNs on it, and then
shipping the results back down to clients for local recognition? Surely
they're not training on device, as this is very RAM and CPU intensive.

------
guelo
Apple backed themselves into a corner by marketing themselves as the super-
privacy company in contrast to Google. The problem is that all the data
collection lets you do some really useful stuff that benefits the user. So now
they're spreading FUD while trying to pretend that they're not collecting the
same type of data that Google does. Google has been using differential privacy
for a while in different projects.

~~~
Twisell
But I can't forget that Google is the company that somehow managed to suggest
ads on my personal phone based on browsing on my professional PC.

Theses devices are never on the same network, the only shared parameters is an
exchange account. As a rule I always log-out of the only Google service I
rarely use, so this must be some cookie/tracker dark magic.

Sadly I have no proof, but I use gosthery to block trackers since then. (Side
note: gosthery also claim to use DP btw)

~~~
techie64
It's fairly straight forward - cookies + Google's ad network + Analytics on
your phone allow them to track you across devices.

~~~
bla2
How would the cookie from the work get to the phone?

~~~
mtbcoder
You don't need to rely on cookies anymore to identify someone online:
[https://en.wikipedia.org/wiki/Device_fingerprint](https://en.wikipedia.org/wiki/Device_fingerprint)

~~~
bla2
Sure, but that's not cross device.

------
ekianjo
> On the other hand, when the budget was reduced to a level that achieved
> meaningful privacy, the "noise-ridden" model had a tendency to kill its
> "patients".

Uh, the graph is just showing you get an increased 25% estimated risk of
mortality from Warfarin, nothing close to "killing patients". Complete
exageration, since the mortality baseline is probably very low in the first
place.

~~~
aub3bhat
Good luck selling that to doctors. People who are trained to cut open a
patient aren't receptive to hypothetical harms and imaginary "budgets".
Consider the fact that cause of death records are public. The usual perception
of privacy by theory and security researchers and norms practiced in
healthcare differ enormously.

~~~
cyphar
The point of differential privacy is to allow for aggregate analysis, without
destroying the privacy of outliers. Researchers deal with noise all the time,
so is it so odd that a field of researchers believe that adding enough noise
to data released with studies will allow for conclusive analysis without
ruining privacy for individuals?

~~~
Create
Outliers are the reason databases exist. Any "average" is simply readily
apparent, therefore irrelevant for serious in depth analysis.

Adding noise and fuzzing has a long history in statistics since the '70s [1],
and while it does work on large numbers, it almost always messes up the
details ie. the error bars.

C.D. DP is essentially a cheap ripoff of the ideas implemented in ARGUS[2].

[1] 1977 Dalenius, see Do Not Fold, Spindle or Mutilate movement and earlier:

[http://tpscongress.indiana.edu/impact-of-
congress/gallery/fi...](http://tpscongress.indiana.edu/impact-of-
congress/gallery/first-census.html)

[2] [http://neon.vb.cbs.nl/casc/](http://neon.vb.cbs.nl/casc/)

~~~
cyphar
> Outliers are the reason databases exist.

Disagree. Data is why databases exist.

> Any "average" is simply readily apparent, therefore irrelevant for serious
> in depth analysis.

I said "aggregate", not "average". There are many kinds of aggregate analysis
useful (in Astrophysics, you can take many different samples from different
stars and use the aggregate to compute commonalities in the sample that you
would've detect with a single measurement). There is more to aggregate
analysis than averaging data.

As for the rest of your points, I'm not a statistician so I can't comment.
Also, I didn't downvote you (HN rules).

~~~
Create
sorry -- please substitute "average" (original was also in quotes) to category
or factor, and you still have the bin I am talking about. You can put any
label on it you like such as "commonality", as long as you remove details, ie.
other bins.

But as you say: your "aggregate analysis" NEEDS "many different samples from
different stars". Commonality is the result of your analysis based on
different samples. But since they are common, you can go and sample and have
the result without doing mass surveillance on every star.

ps: I am fully aware of photo stacking, but also note, that stars are not
humans, see context of privacy. Please look at argus or sdcMicroGUI from CRAN
to get a feeling for data utility vs. reidentification risk.

~~~
cyphar
> But since they are common, you can go and sample and have the result without
> doing mass surveillance on every star.

"Mass surveilance" reduces noise and lets you get more data in a shorter
period of time (telescopes have large fields of view, but they can't make time
pass faster). Stacking (which is what the technique is called in Astrophysics)
is very useful in this case. Not to mention that you can also do individual
analysis as well.

Actually, most interesting of all is that you can do this type of analysis _on
objects like neutron stars that we can 't observe directly because they're too
faint_. Because noise in telescopes can be modelled as a Poisson process,
stacking actually increases S/N in a way you can't do without making much
bigger telescopes.

PS. I'm not a statistician, so I can only speak to what I know. But my whole
point is that researchers do know how to deal with noisy data, regardless of
whether or not that noise is man-made or not. Interestingly enough, I found
out recently that the NASA pipeline actually breaks certain data sets they
have released (which have _papers_ written about them) so man-made noise is a
problem regardless of whether or not it's intentional.

~~~
Create
"Not to mention that you _can also do individual analysis as well._ "

This is the key point to argue against in the context of people, privacy and
mass surveillance.

It is the touchstone of privacy, anonymity and crowd protection.

Regarding noise suppression: yes, the more queries (available data whether raw
or extracted) the more you can filter (ask a Kalman student) to reduce your
error bars and margins. This is a reason why DP is overhyped. Also, if there
are no differences between queries, then data is redundant. See deduplication
(database) or scaling (measurement).

About the analysis pipeline: this is why the mantra "know your detector".
Coincidentally, this is why releasing only recorded datasets is next to
useless for people outside the given research group. You would need to capture
detailed knowledge of your data taking operations and instruments, which
happens rarely, if ever. Please cite a thing such as _" the NASA pipeline"_,
perhaps you mean a given mission/experiment? In any case, detector
recalibration is a usual, almost daily activity...

~~~
cyphar
> Please cite a thing such as "the NASA pipeline", perhaps you mean a given
> mission/experiment?

The specific pipeline I was referring to is the Kepler pipeline that NASA uses
to take their raw pixel data and produce photon counts that everyone uses for
their research (this wasn't a detector issue, it was a software bug at the
final stage of the data publishing process). The point was not the pipeline
issue, it was that noise is everywhere.

But as to your point, yeah okay. Maybe I shouldn't talk about statistics when
that's not my field. :D

------
EGreg
If you collect values of random variable Y from phones, where Y = X + N (N
being normally distributed with mean 0 and var Y = var X, say) then many
statistics can be calculated with that.

The law of large numbers says that after gathering statistics from many values
of Y, they will converge (for continuously differentiable functions of X) to
the values for X.

Yes?

Meanwhile each individual user will not send _so_ many samples as to identify
the true values of X with any useful accuracy.

------
hfsbtnye
I have to admit, I'm really starting to like the direction that Apple is
heading despite being previously disenchanted. I only wish that they would go
ahead and put everything under a free software license, since they're in the
business of selling hardware that's coincidentally bundled with their
software.

~~~
applester
To paint another viewpoint, Apple initially went all gung ho about privacy and
wanted to make not collecting data a big play (and fairly so, full respect to
them).

The recent WWDC obviously shows a big shift towards AI and ML applications
within the company. Some things are possible on the device, but many neural
nets just cannot be served from an iPhone reasonably. Hence, the move towards
more data collection. I really wish they give out more information here. Until
then, I'm not sure how much they are actually collecting after their
realization that they do need the data to do AI well.

~~~
tjl
They talked a bit more about differential privacy in the State of the Union.
Basically, they hash the data and add noise. By collecting data from a bunch
of people that noise gets averaged out. They also limit the amount of samples
(over a relatively short period of time) they can get from a single person so
they won't be able to identify them.

~~~
xufi
Intersting. That's a smart way to collect data while not having too much noise
flood in the dataset. I need to watch this State of the Union.

~~~
jon-wood
State of the Union is what the WWDC keynote used to be before it started being
watched by press and the public. Much more technical detail, and information
in the underlying frameworks rather than user visible features.

------
nxzero
Seems to be a huge amount of speculative commentary, which is acknowledged,
but to me, not a way that shows the potential variation in implementing DP.

For example, Apple could easily download all the data, do a DP on the impact
of adding the data to the existing aggregate data, clean out indentiers, and
add it to the database.

Key here is that Apple has all the data, then purges the indentiers from it,
which is completely different than removing the indentifiers before sending to
Apple. _______

(Apple:) "Hi, I'm Apple, Trust Me! Don't mind the black bag, I just likely
being mysterious, it's cool, right?"

(Me:) "Umm, no, no thanks!" _________

Apple needs to let go of the whole security through secrecy ploy, since it
looks more and more shady.

Imagine if security modules for devices where public and non-secure section of
the devices had to be encapsulated for EmSec and tamper proof. If this was the
case, security literally wouldn't be an issue; either everyone is impacted, or
no is impacted.

------
chmike
Does it mean that Apple will randomly insert turds in my messages so that it
looks like the average user ?

------
kordless
It's time for an Open Communications initiative. Time for companies to stop
owning the platform. Time for all of us to stand up for our right to
communicate with who we want, when we want, without being monitored,
inspected, blamed, or advertised to. Enough is enough. It's time for a change.

