Hacker News new | past | comments | ask | show | jobs | submit login

The point of differential privacy is to allow for aggregate analysis, without destroying the privacy of outliers. Researchers deal with noise all the time, so is it so odd that a field of researchers believe that adding enough noise to data released with studies will allow for conclusive analysis without ruining privacy for individuals?



As someone with access to data on 50 million patients and having studied aggregation of medical data for last 5 years I can assure you that its not easy as it sounds.

The amount of noise that "theoretically guarantees" privacy protection in terms of epsilon renders any reasonable analysis impossible. E.g. How about CDC telling you that there are 0 - 2000 cases of Ebola in Massachusetts.

There are theoretical guarantees provided by Differential privacy and then there are actual requirements of conducting public health or biostatistical reaseach with certain evidence value. The gap between the noise added by the former and tolerated by the latter is enormous.


0-2000 in a population of 3.6 million does not seem unreasonable to me.


Sure, but you are neither a doctor nor an infectious medicine specialist.

Its trained doctors and researchers who get to decide quality of medical evidence. There is a reason why we have detailed set of protocols and levels used for assessment of evidence.

http://www.cebm.net/oxford-centre-evidence-based-medicine-le...


Outliers are the reason databases exist. Any "average" is simply readily apparent, therefore irrelevant for serious in depth analysis.

Adding noise and fuzzing has a long history in statistics since the '70s [1], and while it does work on large numbers, it almost always messes up the details ie. the error bars.

C.D. DP is essentially a cheap ripoff of the ideas implemented in ARGUS[2].

[1] 1977 Dalenius, see Do Not Fold, Spindle or Mutilate movement and earlier:

http://tpscongress.indiana.edu/impact-of-congress/gallery/fi...

[2] http://neon.vb.cbs.nl/casc/


> Outliers are the reason databases exist.

Disagree. Data is why databases exist.

> Any "average" is simply readily apparent, therefore irrelevant for serious in depth analysis.

I said "aggregate", not "average". There are many kinds of aggregate analysis useful (in Astrophysics, you can take many different samples from different stars and use the aggregate to compute commonalities in the sample that you would've detect with a single measurement). There is more to aggregate analysis than averaging data.

As for the rest of your points, I'm not a statistician so I can't comment. Also, I didn't downvote you (HN rules).


sorry -- please substitute "average" (original was also in quotes) to category or factor, and you still have the bin I am talking about. You can put any label on it you like such as "commonality", as long as you remove details, ie. other bins.

But as you say: your "aggregate analysis" NEEDS "many different samples from different stars". Commonality is the result of your analysis based on different samples. But since they are common, you can go and sample and have the result without doing mass surveillance on every star.

ps: I am fully aware of photo stacking, but also note, that stars are not humans, see context of privacy. Please look at argus or sdcMicroGUI from CRAN to get a feeling for data utility vs. reidentification risk.


> But since they are common, you can go and sample and have the result without doing mass surveillance on every star.

"Mass surveilance" reduces noise and lets you get more data in a shorter period of time (telescopes have large fields of view, but they can't make time pass faster). Stacking (which is what the technique is called in Astrophysics) is very useful in this case. Not to mention that you can also do individual analysis as well.

Actually, most interesting of all is that you can do this type of analysis on objects like neutron stars that we can't observe directly because they're too faint. Because noise in telescopes can be modelled as a Poisson process, stacking actually increases S/N in a way you can't do without making much bigger telescopes.

PS. I'm not a statistician, so I can only speak to what I know. But my whole point is that researchers do know how to deal with noisy data, regardless of whether or not that noise is man-made or not. Interestingly enough, I found out recently that the NASA pipeline actually breaks certain data sets they have released (which have papers written about them) so man-made noise is a problem regardless of whether or not it's intentional.


"Not to mention that you can also do individual analysis as well."

This is the key point to argue against in the context of people, privacy and mass surveillance.

It is the touchstone of privacy, anonymity and crowd protection.

Regarding noise suppression: yes, the more queries (available data whether raw or extracted) the more you can filter (ask a Kalman student) to reduce your error bars and margins. This is a reason why DP is overhyped. Also, if there are no differences between queries, then data is redundant. See deduplication (database) or scaling (measurement).

About the analysis pipeline: this is why the mantra "know your detector". Coincidentally, this is why releasing only recorded datasets is next to useless for people outside the given research group. You would need to capture detailed knowledge of your data taking operations and instruments, which happens rarely, if ever. Please cite a thing such as "the NASA pipeline", perhaps you mean a given mission/experiment? In any case, detector recalibration is a usual, almost daily activity...


> Please cite a thing such as "the NASA pipeline", perhaps you mean a given mission/experiment?

The specific pipeline I was referring to is the Kepler pipeline that NASA uses to take their raw pixel data and produce photon counts that everyone uses for their research (this wasn't a detector issue, it was a software bug at the final stage of the data publishing process). The point was not the pipeline issue, it was that noise is everywhere.

But as to your point, yeah okay. Maybe I shouldn't talk about statistics when that's not my field. :D


downvoter: care to elaborate on the usefulness of a database with almost identical entries with MEANingless values?


Outliers are not the only useful thing in a set of data. If you remove outliers from most data sets, it doesn't suddenly become "almost identical" -- unless your outlier rejection system is "is it equal to 1".




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: