
The OKCupid data release fiasco - miraj
https://socialmediacollective.org/2016/05/18/the-okcupid-data-release-fiasco-its-time-to-rethink-ethics-education/
======
yk
I think there are two important points, first that there is no longer a clear
distinction between public and private spaces. To pick the immediate example,
I am currently sitting in my bedroom, but on the other hand I am writing in a
public forum. Plus there are conceptually less clear examples, for example
chat rooms which are basically public, but I still have the expectation that
everybody present belongs to a small community.

The other important point is, that there is a shift to questions instead of
answers. Technology moves just too fast to give good guidelines in terms of
rules. It would perhaps help to define a questionnaire that researches have to
fill out, just to demonstrate that they thought about the question. For
example, the question "Which groups are especially vulnerable to a leak of the
data?" may prompt filtering of said groups and prevent some damage. (That
obviously does not help against black hats, but would probably helped in the
current OKCupid case.)

And lastly I think that there is a mismatch what "reasonable expectation of
privacy" means, most people do not have any idea what "big data" means, they
believe that Facebook reviews flagged comments, but do not understand that
they can (and I would guess do) analyze each and every interaction with
Facebook. (I don't mean to single out FB, it is just they are the most obvious
example due to their size and the sort of private nature of the network.)

~~~
yeukhon
Sorry, but neither point is convincing to me.

First, I don't get why sitting in one's bedroom and writing in a public forum
illustrate the clear distinction between public and private spaces. You
already noted the forum is public. IT IS PUBLIC. If you put up your web page
on the Internet and only search engine would ever come to your page, well,
your page is still the public domain. Being in a small community, take a cult,
well, if you show up in the public and doing your activity in the public, I
would expect that to be a public demonstration. In fact, if you try to run
naked with your cult in the middle of the street, you are going to get
arrested and charged for public misconduct. If you are running naked in a cult
house, well, that's private. The distinction is very clear in these examples.
I am not arguing there aren't examples that would illustrate a blur, but your
example is too weak.

My response to the second suggestion. Well, leak of data is a big deal. So if
my research somehow harms people with AIDS, am I allowed to collect the non-
AID patients? Another way to put it, filtering doesn't do any good and
provides no meaning. From an ethic standpoint, people aren't concerned about
what groups will be harmed from the outcome of the research - because the
whole point of research is to either validate or invalidate some hypothesis.
Research ethic should be about the treatment of data and the process of
conducting experiment.

However, I do agree that some users aren't aware of what online service do
with their application.

------
minimaxir
I wrote and posted a statistical analysis of the OKCupid dataset, but took it
down immediately once the information about how the data was obtained came to
light.

The problem here is that the data included in the release (answers to
questions) can _only_ be accessed by logged-in users, and the scraper they
used utilizes a login, in which case the researchers blatantly violated the
ToS and a DMCA takedown of the dataset was a valid response by OKC.

However, almost everything else in an OKC profile, including username, city,
and sexual orientation, is _public to logged out users by default_. (see this
analysis of mass-scraped OKC data:
[http://yakamo.org/?p=112](http://yakamo.org/?p=112)). That makes things a
grayer area than usual, and serves as a reminder that nothing online is
private.

~~~
us0r
> [http://yakamo.org/?p=112](http://yakamo.org/?p=112)

"I setup 7 servers with mpi4py and it took 48 hours to scrape 3.2million
profiles. Now i was ready to get the answer and find out how many profiles
where public."

I think that works out to around 20 requests/sec. At a certain point it
becomes borderline DoS?

~~~
curiousgal
20 requests per second per 7 servers so it comes out to 3 requests per second.
Not even close to a DoS attack.

~~~
anilgulecha
It's 20 rps -- not sure why you divided by 7. A DoS attack is measured by
total, not per server.

~~~
curiousgal
Well if you are going to total might as well add in the traffic from all the
other users. And in that grand scheme 20 requests per second won't matter.

I only divided by 7 because I thought he wasn't running all the 7 servers
simultaneously, come to think of it, I was wrong.

------
tuna-piano
I did some work for an organization, who, after the Ashley Madison leak,
downloaded the entire leak and searched through it to find out if any of their
employees were in the dump.

I wondered, wasn't the data still technically private data that was just
stolen and released? And wouldn't it be against the law to then download a
copy of that data yourself? But it was actually the legal department that
downloaded the dump, so I didn't have much to say.

~~~
dreamcompiler
Wouldn't it have been against the law for the company to have _acted_ on that
information? What were they planning to do, fire people who were having an
affair?

~~~
tuna-piano
Well, it was only for people using their work email addresses in the leak. And
I'm not sure, but I assume since most employees are 'at will' \- they can be
fired at any time for any reason?

------
dvhh
I understand the issue there, but if it has been done by researchers. Hasn't
this been done/ will be done by some entity with more defined agenda.

It was certainly unethical for the researcher to release the data, but the
public really really really need to be educated about putting information on a
publicly available and searchable database.

------
jh021093
It makes sense that people are upset that this data is collected and released
in this manner - yes it is public but serving it via the API (Not allowing a
user to scrape) lets its access remain under the control of the service(and
user, to some extent).

It seems, though, that this discussion is overlooking the fact that there are
companies out there which are gathering this exact same data(And then some) to
sell to other organizations who want to identify their users.

------
chrismcb
Is it time to think ethics education? Maybe we just to spend a little more
time and effort on ethics in education. The author is right, just because it
is legal, doesn't mean it is ethical. That is why we have ethics. As far as
data being public, I would claim it has to do with expectations of privacy.
Phone numbers in the white pages are public and really accessible. Relating
that information doesn't change anything. On the other hand the license plate
of the car parks in the parking lot is also public information, in the sense
that anyone can go down there and record it. But the owner has a certain
expectation of privacy. Plus that data isn't really accessible (will before
license plate cameras) so releasing that info changes is accessibility. The
issues with big data aren't really new. People have to consider that the data
they have involves real people.

~~~
meshko
No, people don't need to understand anything about "data they have", because
it is not people that have data, it is corporations, and they don't
"understand" things, they make profit. What people need to understand is that
they are data and that data is a product that can be used against them.

~~~
jonathankoren
Corporations are people. And I mean that not in the legal sense, but in that
they are made up of people. Real people that make real decisions, both ethical
an unethical, about real people, not "data".

------
kevin_thibedeau
I don't see how site scraping is a "data release".

~~~
fredophile
Because a lot of this information was previously only available to logged in
users of a particular service. Would you be okay with someone placing
recording devices under the tables of restaurants and then putting the
recordings and transcriptions online? Any conversations they pick up happened
in a public place where anyone could have overheard them.

~~~
meshko
Another bad analogy. 1) In public places I usually speak at the volume level
at which others can't hear me; if I start yelling in a public place I should
expect things that I yell to appear on youtube, in the local newspaper, in the
police log 2) I come to the restaurant to eat. If after eating I fill out a
survey which includes lots of personal possibly identifying questions, would I
have the right to be surprised if next day I find my answers on the website of
the restaurant? Have I read the fine print on that survey?

~~~
ubernostrum
How do you feel about sites which build "shadow profiles"?

~~~
meshko
I dislike them as I dislike all human activity which doesn't produce useful
good or services; probably more so as these are probably harmful and
parasitic. But I'm not sure how this is related to the subject at hand.

~~~
ubernostrum
Just pointing out that there are more ways to get data about you into a
service than you giving that service the data. Would you still accept it as a
"painful lesson" if it happened to you?

------
syewpo
I don't understand the problem. It should be fucking apparent by now that no
site on the internet can / will / wants to protect your identity. That's on
you.

Anyone accessing websites with their names, emails attached to their names,
emails attached to their phone numbers attached to their names, birthdays in
their handles, handles that aren't random strings, identifying passwords,
avatars in use on other accounts, images also hosted on flickr, g+, insta,
etc. should take it for fucking granted that there is now zero reasonable
expectance of privacy.

We had the chance for the internet to be a beautiful, anonymous playground,
but a few too many people showed up dead on the chan and now the crybabies
with body image issues have managed to make all of the above borderline
required in the name of stopping "cyber-bullying" and "terrorism."

(try to sign up for an account, say a meetup account -- requires an email.
Burn emails are blocked for "spam." So you need a gmail. Gmail requires phone
number. SIM requires passport identification in europe... > meetup.com
requires passport identification... a few alternatives still exist, mail.com
for example doesn't require phone verification... yet. Facebook requires an
actual scan of the passport if you put in estimably fake information.)

And with that thought paradigm, that your ID is a requirement of internetting,
comes stupid people getting butthurt by shit like this "oh, but I put forward
all this identifying information with the expectation that I would be
anonymous, oh, oh, oh".

All this because some bitch on youtube can't handle the fact that obesity is
not in fact a disease, and because something that kills less people than
accidentally suffocating in bed has become an excuse to usher in the police
states of america.

There was a better way, where this this would never happen. But we as a people
opted to go this new direction. Wear a fucking jacket if it's cold and
remember _you_ wanted to vacation on the baltic beaches in march, not me.

~~~
gohrt
Please don't post like this on HN.

------
reustle
IIRC, this data was all simply scraped from profiles, not socially engineered
out of people. While it might be morally grey, you need to half expect stuff
like this will happen when you post it. If you don't want things like your
sexual preferences to be public, then don't put them on a free public dating
website profile.

~~~
ubernostrum
If you didn't want the bank to let someone else have all your money, you
shouldn't have put your money in a publicly-accessible bank anyone could
visit. And you _certainly_ shouldn't have trusted the bank's promises that it
would impose rules on who can access money and how.

~~~
meshko
Terrible analogy. You are putting your money into the bank in order for the
bank to preserve the money and perhaps pay you interest on it. When you use a
website like OKCupid, you are getting a service from them in exchange for
ability to mine a sell personal information about yourself. And, amazingly,
this will lead to... people getting your personal information. The lack of
public understanding of this is precisely what makes a dump like this
ethically gray. Sure, it's bad that people's information leaked, but at least
it wasn't for profit and overall public is better off because more people
understand how data works.

------
meshko
OK can someone ELI5 what is the fallacy of the "it's already public so it's
ok" argument? I definitely understand that legal != ethical; but i suspect
this is not clear cut at all in this case. I think it is similar to the
responsible disclosure vs immediate disclosure. Sure, people get hurt by this
and it is real and is a problem, but on the other hand could the educational
benefits outweigh that impact? People need to learn that they are being
watched and understand how stuff works, and m.b. that lesson needs to be
painful. Is it perhaps better if the data is just accessible to everyone and
everyone knows about it then if it is only accessible to the highest bidder
and you don't even know about it.

~~~
jakevoytko
It's unlikely that I'll convince someone who feels a "lesson needs to be
painful," but here goes.

This is all about what tools enable.

This issue is critically important to people who have a much higher privacy
need than the average person. Consider an abusive ex tracking down their
victim in an unknown new city.

If the profile is indexable, the abuser is trying to either use OKCupid's
search (which returns results with a high degree of randomness), or using
things like "site:okcupid.com $interest1 $interest2", which is noisy, to say
the least. This task looks hopeless quickly.

But if someone provides a database of profiles, the job gets much more
tractable. Suddenly, you've enabled the attacker to filter by people who have
1,2,n interests of their ex, search with photographs of their ex, hell, even
use facial recognition. Some of these things were kind of possible before, but
with a database they are likely. If someone builds a "find your ex" web
frontend on this database that uses facial recognition technology, it becomes
accessible to everyone, and maybe even popular.

Saying that this lesson should be educational is theoretically nice, but
doesn't match how users interact with products in the real world. Anyone who
has worked on a product with millions of users knows that the average user (a)
doesn't know that there are settings (b) they don't know which things are
controlled by settings (c) they don't know how to find the settings, and (d)
they don't know how to change them. This is so common that I no longer believe
that users are responsible for the implications of default settings (and
OKCupid's defaults are, inconveniently enough, Public).

Tools usually are capable of both good and bad things, but a database of
profiles has limited upside, and a huge downside. Like, this thing had better
be curing a minor fatal disease for the amount of trouble that it will cause.

~~~
meshko
So i read your response and everything you say reinforces my feeling that it
is better to publish this than not to. At least this was done not for profit.
If the data is collected and the data is for sale (and it is, because i
understand it was purchased legally), there _will_ be someone who builds the
service for aggregation and searching like you describe. And it might be very
expensive, but that crazy ex will pay that $1000 to find the person she is
after. In other words, saying "it's unethical to make access to public info
easy" is useless; if information is public, someone will access it and someone
will make it easy to access. It's better if this is done in a public way and
for free, than if someone sells it quietly.

~~~
ubernostrum
The problem still comes down to your "lessons should be painful" approach, and
how disproportionate it is relative to the perceived error here.

To take an analogy: if a child plays with matches, you teach the child not to
do that, explain how fire is dangerous, etc. You don't light a fire and hold
the child's face down on it to burn them and make the lesson "painful". Yet
when it comes to deliberately making sensitive/intimate data about people
public that basically _is_ what you're advocating for: the potential emotional
harm that can be done to the people involved is off the charts.

~~~
meshko
Frankly I don't understand what are these terrible secrets that people are
comfortable sharing on a public web site for everyone to see, but not
comfortable with everyone seeing them. I just fail to see the pain.
Discomfort, sure, but pain and suffering?

