

What to do about re-identification? A precautionary approach to big data privacy - randomwalker
https://freedom-to-tinker.com/blog/randomwalker/what-should-we-do-about-re-identification-a-precautionary-approach-to-big-data-privacy/

======
beat
A basic problem with de-indentification of data is useful uniqueness. The
sensitive data that needs de-identified tends to be the data that provides the
uniqueness needed to make the data useful. Sufficiently scrubbed data often
doesn't work well for actual testing.

After years in banking and health care, I've seen businesses tilting at that
windmill too often. Hire outside consultants to help if you need to do it.

~~~
bcg1
The problem that I've noticed is that we DO hire outside consultants... but
too often that provides a security blanket for the decision makers, and they
think that just because they outsourced the work, that they have nothing to
worry about anymore.

I think the author's approach is pragmatic. I think more decision makers need
to realize that data sets can just as easily be a liability as they can be an
asset. I work in healthcare and I see slow motion trainwrecks waiting to
happen all around me.

------
hackuser
Narayanan's article is valuable, but I think he's far out in front of public
policy. AFAICT the default in policy is that privacy doesn't matter; that's
the issue on the table right now. How to implement it or deal with challenging
technical issues seem to be far down the road.

My impression and limited experience is that policy makers don't care about
privacy; they don't understand it or its implications, they haven't the
slightest understanding of technical issues, and they have no motivation to
learn because their consituents don't understand or care.

The way it changes is if people who do understand it, such those reading HN,
patiently spread the word to people you know socially, in business and to your
elected representatives. They need your help; they lack the resources to study
every public policy issue out there, especially those requiring technical
expertise.

~~~
nowarninglabel
>AFAICT the default in policy is that privacy doesn't matter

I think that may be true for some, but not for others. It depends a lot on
previous experience I think. My uncle was a county clerk and digitized a lot
of public records, for which he was rewarded by unhappy citizens filing a FOIA
act on his e-mail account (a great reason to never e-mail relatives at their
work address if you don't want it all potentially public someday). After that,
his office was a lot more cautionary about data releases, even though it was
public record.

For some counter-examples on the business side, Netflix got bit by the re-
identification bug and now is very careful about data releases. At Kiva, I
face this problem all the time of turning down research requests for what
would be interesting and useful analysis. We release an anonymized dataset of
our loan activity (that's taken a lot of work to properly sanitize). However,
we get requests nearly every day to release more information, which we can't
do, due to the real possibility of re-identification.

~~~
hackuser
Thanks; that's very informative.

> we get requests nearly every day to release more information, which we can't
> do, due to the real possibility of re-identification.

Is there a legal liability for re-identification? Either way, it's good to
know that you are being careful and thinking through the risks.

~~~
mukyu
Netflix got sued and investigated by the FTC for the data released for the
Neflix Prize contest. One the plaintiffs could be linked to their imdb account
by similar reviews and there were additional reviews in the Netflix dataset to
imply that they were lesbian while still in the closet.

AOL had to pay a settlement after the search data they released identified
people.

