

Anonymouse - ssclafani
http://blog.rapleaf.com/dev/2010/07/20/anonymouse/

======
cypherpunks01
I'm a bit confused about how this article fits in with the previously posted
article about the company (<http://news.ycombinator.com/item?id=1817631>).
There, they are harvesting PII data (e-mail) and re-selling access to related
information. In this article, they explain a machine-learning-type algorithm
to anonymize data to contain no PII. I don't really see how the two fit
together. What's their deal?

~~~
bpodgursky
Personally Identifying Information (PII) is defined as data which can uniquely
identify an individual--name, social security number, facebook ID. Rapleaf's
personalization service--which is designed to let websites personalize based
on who is viewing the website--does not serve PII. Instead, it serves
targeting data like "Age 40-50" and "Interests Basketball".

The idea behind the Anonymouse project is that a person should not be able to
be personally identified based on the targeting information served about them.
Only sets of targeting data which cannot be traced back to a unique individual
is stored in the personalization cookie.

For example, it would be okay to serve a targeting cookie about a person which
contained "Male, Wealthy, looking to buy a Ferrari", because thousands of
people fit that description, and the ad network or website cannot identify the
person with any reasonable specificity.

It would not be okay, however, to target a person as "52 year old Male, makes
$256,000, lives in Sometown OH, and was born on April 12th", because in all
likelihood, only one or two people fit that description, and the data would
effectively serve as PII, and this would be no better than dropping a facebook
ID in the cookie.

Let me know if anything's unclear, or if you want more details. From a CS
perspective, it's a really cool/hard problem, and something we've spent a lot
of time on. We're planning on writing an update blog post on where we are...
as soon as we're done coding : )

~~~
robertk
I still manage to marvel every time it seems Hacker News is community to the
creators of pretty much everything.

~~~
ynniv
... that gets extensive coverage on Hacker News.

------
seldo
This is pretty fascinating. I was expecting another cover-your-ass blog post,
but this is a pile of interesting data work which is new to me.

------
roel_v
A central concept in the article is k-anonymity, which I only learned about
maybe a year or two ago. Researching, it appears though that it has been
introduced in a conference article in 1998 and has been published in 2002
([http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.9182&rep=rep1&type=pdf)),
which I found rather surprising. It's a shame that the knowledge of this
concept seems to be mostly restricted to CS circles. Maybe the rather steep
technical barrier (research seems to be published mostly in technical context
- non-scientific metric: half of the url in the resuls of googling for
'k-anonymity' has 'cs' and 'edu' in the domain name; and no literature on
anonymizing datasets for people with only basic schooling in statistics and
database theory seems to be available) is preventing the uptake of better
database privacy protection.

Of course there's also the issue that incentives in collecting and storing
sufficiently anonymous data are not aligned (and sometimes in direct
competition) with the goals for which the data is collected. This company
seems to be making great strides in this field with this research, I hope they
keep pushing the edge and publishing their results.

------
gyardley
Technically, this is really interesting, but what's the business justification
for the engineering work?

I'm skeptical that these efforts will protect Rapleaf or any other company
from public relations disasters or class-action lawsuits or harmful
regulation, because the solution is too complex for reporters, lawyers, and
politicians to understand. (The cynic in me thinks that even if they _did_
understand it, that's not going to get in the way of a good story / lawsuit /
feel-good cause for the public.)

While I'm sympathetic, declaring that your dataset is 16-anonymous due to
cluster-based suppression isn't going to persuade anyone that's already
decided that behavioral targeting is the devil.

------
CGamesPlay
I assume that this was just a feel-good article written in response to their
name appearing in WSJ recently. The project itself has been around a few
months, at least publicly. The engineering described isn't particularly novel,
but it is a high-level overview of a large-scale project.

~~~
bpodgursky
We wrote this blog post back at the end of July about the project, not in
response to the WSJ article. There have been a few posts since then discussing
other aspects of the project, and there will be an update coming quite soon.

