
OkCupid Study Reveals the Perils of Big-Data Science - sonabinu
https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/
======
chatmasta
My first instinct in this case is to agree that the data is public, or at the
very least, protected by account login and therefore only legally protected by
the OkCupid terms of use. I see no possible legal argument against the methods
of data collection, unless OkCupid wants to sue the researcher for breaching
terms of use. Or, perhaps, users could sue OkCupid for failing to diligently
protect their data.

That said, this was an Academic study, presumably funded at least in part by
the University, and therefore subject to approval from an Institutional Review
Board (IRB) [0]. Granted it was not an American university, so perhaps no IRB
exists. And it was a small study by a group of graduate students, which seems
unlikely to be subject to as much scrutiny as a study officially funded and
endorsed by the university. However, this is no excuse for circumventing
research ethics. To quote the wiki article, "a key goal of IRBs is to protect
human subjects from physical or psychological harm," which implies that moral
arguments should hold at least as much weight as legal arguments when
considering the potential impact of a study.

It's entirely possible for a study to be legal, but immoral. Since this study
came from a university, any moral argument should be as valid as a legal one.
The trouble seems to be that nobody heard any argument, moral or legal, before
approving the study.

In this case, the failure seems to lie with the Danish University, who allowed
its graduate students to conduct a study without consultation from an IRB or
any kind of ethical oversight committee.

Interestingly, the data _is_ "public," in the sense that anyone could
replicate this study with some Python and OkCupid accounts. So even if
researchers are stopped by an IRB from collecting the data, that doesn't mean
another entity can't do the exact same study, free from any procedural
impediments. So it makes you wonder about the premise of questioning the study
in the first place. Perhaps a better use of this soapbox, instead of
lambasting the researchers, would be to raise awareness of the fact that _what
you put onto the Internet is public._

[1]
[https://en.wikipedia.org/wiki/Institutional_review_board](https://en.wikipedia.org/wiki/Institutional_review_board)

~~~
AnthonyMouse
> Perhaps a better use of this soapbox, instead of lambasting the researchers,
> would be to raise awareness of the fact that _what you put onto the Internet
> is public._

That is maybe not the best lesson to take from this.

The problem seems to be that people want to create a binary distinction
between "public" (meaning public domain, can be used for anything by anyone)
and "private" (meaning something known solely to a single individual), with no
intermediary steps between them.

But profile data like this is something that lives in the space in between.
You're clearly intending to share it with people in a particular category
(other similarly-situated users) but not _everyone_ and not _forever_. It's
the sort of data you expect not to be archived; if you delete your account
then it should go away.

That is not how the laws of physics work. If you allow the data to be seen by
other users and anyone can sign up to be another user then anyone can see the
data and anyone can copy it. But that is why we have social norms. That is why
finding the online dating profile of everyone who works at the same company as
you and then mass mailing it to everyone in the office is not considered
friendly behavior.

And probably what companies that run sites like this should be doing is
adopting a robots.txt that disallows crawling of user profiles and then rate
limiting to some maximum number of profile views (e.g. 10,000/month) that no
real user would ever hit but would provide a good hint to data hoovering
operations that they aren't wanted here.

~~~
ghaff
>a robots.txt that disallows crawling of user profiles

It's a voluntary mechanism and convention that requests that pages not be
crawled, which is somewhat different from what you wrote. (Which isn't to say
that it shouldn't be used buy AFAIK it has no legal significance.)

~~~
EdHominem
It's intended as a way to help robots crawl a site. If you have a robot that
doesn't need help it _should_ ignore the file.

Because so many people misunderstand it, robots.txt is often a list of what
you should crawl to find the good stuff.

------
curried_haskell
I really do not understand this argument or belief people have, that they can
send out their information into the world, via a network of unknown
intermediaries to a private third party in another country, and somehow their
data is still "private". That your data is posted for everybody to see on a
publicly accessible website, and somehow your data is still "private".

Here's something incredibly simple: if you don't want your data to be public,
then don't share it! Don't transmit it! Don't want the other users on OkCupid
storing your information? Why did you put it there in the first place!?!? The
internet is not now, nor was it ever, a private place.

These guys did nothing wrong as far as I'm concerned, besides maybe breaching
OkCupid's TOS. OkCupid can sue if it pleases them, but that's a matter for
civil law.

~~~
taberiand
If I'm out chatting with friends about my life and I'm in public then I may
have no real expectation of privacy, but that doesn't mean I won't be upset if
someone decides to follow me around with a spotlight shouting about everything
I'm saying

------
lucb1e
Meta: did something change in Wired's policy? Because I can read the article
with an ad blocker turned on. Googling for recent articles about it, I don't
see anything.

Edit: Never mind, I can scroll all the way through and I can see the whole
article. But if I read from the top and then scroll, it blocks me. So they
give me a chance to see "oh it's all there," start reading, and then block it
for me. Talking about dark patterns... I mean, I get it from a financial
standpoint, and I'm totally going to circumvent this blocking now (private
window, inspect element, I don't know what'll work yet), but it's just shitty.

~~~
sdfjkl
Firefox' Reader mode seems to bypass this nonsense in one click.

~~~
brudgers
And bypasses Flash and page jumps as javascript and advertisements load.
Particularly late loading hero ads.

------
FordPrefectAO
If you follow the links to get to the dataset (for instance to see if you're
in it), you arrive here[1], which says

    
    
      Unavailable For Legal Reasons
      This record has been suspended
    

[1] [https://osf.io/p9ixw/](https://osf.io/p9ixw/)

~~~
oneloop
Yes, because the guy has already started feeling the heat.

~~~
minimaxir
Another article said OKC did file a legitimate takedown with the data
repository host, and they complied.

~~~
gohrt
"legitimate"? Is UGC copyrighted by the platform host?

~~~
toomuchtodo
It's most likely licensed by the user to the host (through a ToS), and the
host has the ability to use that license to submit takedown notices of their
collective database (which is how US copyright law works).

Good!

~~~
pbhjpbhj
> use that license to submit takedown notices of their collective database
> (which is how US copyright law works) //

It would be contract law based on the click-wrap/T&Cs of the website wouldn't
it? Copyright doesn't cover facts, only their arrangement; ergo as long as he
didn't copy the way the content had been presented he would apparently be
clear of copyright infringement.

In Europe we have database rights law too, is there an equivalent in the USA?

~~~
toomuchtodo
I don't believe user profiles are going to be considered facts, but creative
works. OKCupid's corpus as a whole would be protected as a database, but the
individual profiles would not be considered facts.

~~~
pbhjpbhj
The individual fields describing the character of the users are surely factual
type data, even if fictional they don't amount to creative works. A mugshot
(face image) is not likely to be considered sufficiently creative to amount to
a work of itself, and that's the most creative component of a persons profile.

~~~
toomuchtodo
Your opinion and a court's opinion are substantially different.

------
MasterScrat
In the grand scheme of things this is good news. People need to realise they
can't trust random websites with their deepest secrets.

There is no reason to think no one is scraping dating websites, social medias
and other "public" personal data for personal profit. For blackmailing, social
engineering...

The sooner people will get this chilling feel that "wait, what I post online
may actually come back to bite me" the better.

------
dimino
This article begs the question. _Why_ does "Public not equal consent"?

~~~
peterwwillis
"Consent" is an agreement by two or more parties on some specific subject,
preferably when all parties are properly informed about the subject.

Putting your dating profile on a dating site, with preferences for matches,
means explicitly "I want these specific people to look at my dating profile -
FOR DATING". That is the obvious purpose of the site and the profile, and the
"preferred matches" only reinforce that, not to mention the internal filters
OkC allows you to use to block people who don't match a certain %.

There is not any explicit purpose or intent that says "I want this personal
information to be used by researchers in a large group study". There was no
explicit agreement between the user and these researchers. Hence, no consent.

However. The question then becomes, "Do they _need_ consent?"

The answer is: Check the Terms Of Service on the website. If the website
explicitly forbids using the data for research purposes, they can probably be
sued by the website for breach of TOS. If the TOS does not explicitly prohibit
research use, then no, they don't need consent.

Oh - and if it wasn't already obvious, the users don't own the data, so they
have no say in how it is used.

~~~
rue
> the users don't own the data, so they have no say in how it is used.

That statement requires proof, preferably in form of recent case law.

In either case, the TOS states: “You further agree that you will not use
personal information about other users of this Website for any reason without
the express prior consent of the user that has provided such information to
you.”

~~~
peterwwillis
> That statement requires proof, preferably in form of recent case law.

Uh, i'm pretty sure the company collected and possesses the data. It's on
their hard drives and they allow you to view it by connecting to their servers
using their software. A user, as an outside party, has no claim to someone
else's property. (But you're right, IANAL, so maybe there's some freaky
justification for being able to tell someone else what they can do with their
own data on their own servers?)

Well the TOS seems to spell it out! Unfortunately, as it is [probably] the
company's property, it is probably the company that would have to issue any
lawsuit. Then it's up to the courts to judge whatever legal argument is made.

~~~
rue
If the data is generated by or concerns the users, then they have ownership
(degree varying on jurisdiction).

------
aab0
So. I read the whole thing. And they cite several precedents. But what _are_
the perils? They never say.

------
vidarh
This guy better seek legal assistance quickly. EU data protection legislation
is strict, and data about sexual orientation etc. falls under particularly
strict protections. The combination of lack of consent and publishing
personally identifiable information in a different context, by a different
entity of the one that collected it, has plenty of potential for legal
challenges that could get rather expensive for him, and I think his glib
comment about how the data is already public won't get him very far in court -
EU data protection law takes context and what the subject of the information
has given consent to very seriously.

------
AdmiralAsshat
_When asked whether the researchers attempted to anonymize the dataset, Aarhus
University graduate student Emil O. W. Kirkegaard, who was lead on the work,
replied bluntly: “No. Data is already public.”_

The guy seems completely unrepentant to even the _idea_ that he might be
violating people's privacy.

~~~
oarsinsync
Worse still, it seems likely that they were using data that wasn't actually
public:

 _Their paper reveals that initially they designed a bot to scrape profile
data, but that this first method was dropped because it was “a decidedly non-
random approach to find users to scrape because it selected users that were
suggested to the profile the bot was using.” This implies that the researchers
created an OkCupid profile from which to access the data and run the scraping
bot. Since OkCupid users have the option to restrict the visibility of their
profiles to logged-in users only, it is likely the researchers collected—and
subsequently released—profiles that were intended to not be publicly
viewable._

~~~
AdmiralAsshat
Oh, I'm _quite sure_ it wasn't public. I had an OkCupid profile at one time
(thankfully deleted, and even then it had no identifying details and was
linked to a dummy email), and the only reason I agreed to sign up was the
site's assurance that absolutely NOTHING was publicly visible unless you had a
profile.

So, yeah, at the risk of breaking my attempted objectivity, I'd be personally
thrilled if the guy got sued.

~~~
sinxoveretothex
It seems to me that the problem here is that you were assured that things
weren't public. In that sense, I would say that OKCupid should be sued.

I mean, sure, academics might get stopped by lawsuits. Russian social
engineering hackers won't though.

~~~
MasterScrat
From
[https://www.okcupid.com/legal/terms](https://www.okcupid.com/legal/terms):

"You should appreciate that all information submitted on the Website might
potentially be publicly accessible. Important and private information should
be protected by you. We are not responsible for protecting, nor are we liable
for failing to protect, the privacy of electronic mail or other information
transferred through the Internet or any other network that you may utilize."

------
kevincox
I get that this isn't "socially acceptable" and that it requires an account to
access (accounts which can be created by the general public). But every time I
think about it I fall back to the conclusion that if they didn't release it it
would be fairly easy to collect anyways. So yes, this release made it easier,
but at some level I see that as a good thing to make people aware of.

I see why this got so much discussion, it's a very grey area from my point of
view.

------
gohrt
Article doesn't mention OKC ToS one way or the other. What are the ToS?

Most sites ban content scraping by users (And good sites have technical
countermeasures)

------
collyw
70,000 users is now big data....?

------
apecat
Here's a thing I wrote on my personal FB wall as I linked to this article.

####

Some creepy Danish assholes went ahead and released a full dataset on 70.000
random users of the OkCupid dating service "because data science".

After being confronted about the questionable ethics of this stunt, lead
researcher Emil O. W. Kirkegaard refuses interviews, stating that the data was
already public. Kirkegaard also implies that opponents of his methods are
"social justice warriors", an oftentimes euphemistic term for people
interested in things like gender and minority rights issues. Interestingly,
this term is repeated and specified as an area explicitly excluded from
Kirkegaard's interest in "civil liberties" on his personal website.

In addition to "civil liberties", the personal website of this "26 year old
polymath wannabe" (read: fuckwit neckbeard) lists interests such as online
privacy and the Pirate Party. That's right, you can't make this shit up:
[http://emilkirkegaard.dk](http://emilkirkegaard.dk)

These people simply don't care that the "public" information posted on a
dating site may cost certain people their jobs, families or even their lives,
at any point in the future. Open source intelligence like this may be used to
extort people at a later date or lower the threshold for invisible
discrimination everywhere from the job market to insurance claims.

Perhaps all this seems irrelevant to Kirkegaard, as his self-defined interest
in liberty explicitly excludes those most vulnerable to such structures.

The rest of us could do worse than to remember what is going on and question
how informed our consent really is: By default, everything you do on the
mainstream internet is recorded. Private browser modes don't really help to
protect you and neither do VPNs. Practically everything you typically do leaks
information that can be connected back to you. For a demo, see
[https://panopticlick.eff.org](https://panopticlick.eff.org).

And oh, dating sites don't exclude sharing your dating related personal info
with interested third parties ( [https://www.abine.com/blog/2012/privacy-on-
okcupid/](https://www.abine.com/blog/2012/privacy-on-okcupid/) ). But hey, you
probably don't live with one foot in the closet concerning any aspect of your
life you may have hinted at in a dating profile or its chat service. Right?

This time, professional stalkers in imaginary data science lab coats just
happened to unmask a bunch of information. The wilful cluelessness apparent in
this case is even worse than in the decade-old case there AOL released a bunch
of poorly anonymised data on search habits:
[https://en.wikipedia.org/wiki/AOL_search_data_leak](https://en.wikipedia.org/wiki/AOL_search_data_leak)

Normally, interested parties would have to pay a bunch of money for datasets
from data brokers or ad/analytics companies and perform a varying degree of
computation to unmask and correlate some of the personally identifiable from
different sources. That sounds like work, but it's of course increasingly
possible to automate.

So, take a step back and realize that the only thing protecting you from an
existence in something way worse than the oversight of Stasi, the KGB or
Securitate, is the thin veil of aging and increasingly inefficient legal
protections we happen to have in many western democracies.

Parties interested in exploring the utter and complete exploitation of your
most intimate information sometimes do so with a holier-than-thou aura of
privacy mindedness and selective personal liberty politics. All while
personally getting their filthy little hands ever dirtier when the power is at
their fingertips.

####

~~~
peatmoss
> an oftentimes euphemistic term for people interested in things like gender
> and minority rights issues.

I wouldn't call it euphemistic. It's willful dismissiveness. And I'd posit
that the Venn diagram intersection of users of the term and Ayn Rand super
fans is large.

~~~
TDL
I don't understand why it's necessary to make this political. What these guys
did is wrong, I see no need to paint those have similar politics with same
brush as this person.

~~~
peatmoss
The social justice warrior part was brought up by the grad student in
question. I'd argue that it was he who attempted to turn an isolated act of
bad judgement into a political statement. I am, in response, stating that
people who think in terms of labeling others as social justice warriors tend
to have shitty politics.

