Is this a typical Harvard thing?
Even referring to someone as "Dr. Whatever" in the third person is pretty unusual, if you've ever met them. If I was to speak of my college professors right now, I certainly wouldn't use "Doctor". Maybe if they were 70 or had a Nobel or something.
(Related: In the movie Avatar, one scientist introduces himself to another scientist as "Doctor Norm Whatever". It absolutely clangs, at least to my ears.)
It's an introduction to differential privacy to an academic audience (i.e. not necessarily computer scientists). I sweeps across a range of surprising real-life privacy attacks that are possible against anonymization approaches that feel good-enough. Really gives you a sense for sort of problem that privacy protection is in today's world of greatly increased data collection and computational power.
"A simple example, especially developed in the social sciences, is to ask a person to answer the question "Do you own the attribute A?", according to the following procedure:
1. Toss a coin.
2. If heads, then toss the coin again (ignoring the outcome), and answer the question honestly.
3. If tails, then toss the coin again and answer "Yes" if heads, "No" if tails.
(The seemingly redundant extra toss in the first case is needed in situations where just the act of tossing a coin may be observed by others, even if the actual result stays hidden.) The confidentiality then arises from the refutability of the individual responses.
But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. Thus, if p is the true proportion of people with A, then we expect to obtain (1/4)(1-p) + (3/4)p = (1/4) + p/2 positive responses. Hence it is possible to estimate p.
In particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be."
This kind of example also predates the definition of differential privacy by about 40 years , although the motivation is pretty much the same.
Cynthia Dwork and Moni Naor. Pricing via processing or combatting junk mail. In Proceedings of Crypto,
1992. Also available as http://www.wisdom.weizmann.ac.il:81/Dienst/UI/2.0/Describe/
So the main question I have is let's say I'm working with sensitive data like emails or doctors notes. How can I train an ML model that would still learn something useful without leaking private data.
When I say "leak", an example would be I train an RNN on some company data email data and when I feed the RNN "$AMZN" the network would say SELL.
How can I quantify how much the model has learnt and how much privacy has been leaked.
For those left wondering -- it's the first double-quotation mark leaning the wrong way in TFA.