
Ethical issues in research using datasets of illicit origin - rbanffy
https://www.lightbluetouchpaper.org/2017/11/07/ethical-issues-in-research-using-datasets-of-illicit-origin/
======
tempthrowaway23
(self-promotion)

To anyone who is interested in this topic, I published a paper a few years
back on the topic of 'accidentally illicit' datasets. It's not my best work
but someone might find it interesting.

[http://firstmonday.org/ojs/index.php/fm/article/view/2739/24...](http://firstmonday.org/ojs/index.php/fm/article/view/2739/2456)

Amusing anecdote: I wrote this paper (2) after a reviewer insisted they would
reject my other paper (1) unless I tested the paper's algorithm on _10% of all
images on the Internet_.

Talk about a hostile review!

(They also demanded I remove the performance comparison which showed the new
technique to be some 1000x faster than existing techniques and more
reliable... Hmm).

The reviewer then dragged out the review/response process for so long that I
had time to write/review/publish the ethics paper above, in between one round
of reviewer comments (!)

I then took the freshly published ethics paper to the editor for (1), and
asked them to disqualify the hostile reviewer for making unethical demands and
refusing to withdraw them even when this was pointed out.

The editor agreed. The reviewer was then replaced by someone else, who
replicated the entire work of (1) _completely from scratch using only the
description in the paper_ , confirmed the result using their own datasets they
gathered privately, and who approved publication.

'Reviewer 1', they're always either the hero or the villain. It was an
interesting feeling to see the very worst type of reviewer being replaced by
the very best.

Anyway, that's the strange story behind this paper :-)

~~~
ghostbrainalpha
I love that story. But I have two questions.

1) What would be the possible motivation for such hostility from the first
reviewer?

2) Why did you create a temporary throwaway account but then promote a paper
with your real name and information?

~~~
rtkwe
1) The reviewer may have had similar work and wanted to hold up the publishing
of OP’s paper while finishing their own work. Or there’s some personal animus.
Or OP’s paper may have threatened to surplant the reviewers work. There’s a
whole plethora of reason for a particularly hostile review. It’s a big enough
problem that my partner who’s going through grad school now had the option of
requesting specific people NOT be on the reviews for her (first publishing in
gradschool as a first author!) paper because of scooping, animus towards her
PI, etc. that may have resulted in an overly hostile review.

------
dsacco
There is a very fine line between authorized data, technically public but
implicitly unauthorized data, and illegally obtained, unauthorized data.
Here’s an example of each in the financial sector, from my personal
experience:

1\. Financial account aggregators and “budget apps” like Min monetize their
business, in part, by selling huge amounts of data to the financial sector.
Sometimes companies like Second Measure take raw data from companies like
Yodlee and clean it, then resell it. Nowadays there is an entire industry of
alterative market research that has had all sorts of participants, from
Foursquare (locations) to Spark (email enhancement). This is technically
authorized, because it’s in the TOS. The users effectively contribute their
own data.

2\. I developed an extremely accurate, reasonably generalizable method of
forecasting vehicle production at several companies that relies on
implementing a VIN searching algorithm in conjunction with legally required
NHTSA recall lookup portals hosted by each manufacturer. This data is what
you’d call unauthorized, because no entity explicitly endorses your use of it.
For example, several colleagues and I knew well ahead of time that Tesla would
miss on production of the Model 3s because they were utterly unrepresented in
our data. But this data is public, so it’s fine to use from a legal and
compliance standpoint. It was lucrative data specifically because it had a
high signal for revenue, yet was hitherto unused and unidentified.

3\. I once found, in the course of looking for legally usable data, an actual
security vulnerability disclosing all users of a publicly traded QSR’s online
delivery service, along with their phone numbers, email addresses and last
four digits of credit cards. This is both unauthorized and illegal, because
the data is contaminated with personally identifiable information and it
clearly requires a vulnerability (not just scraping) to acquire.

I’ve seen overzealous data vendors accidentally slip from #2 into #3, which is
really bad for all concerned. It’s not a great look for the vendor, who will
likely be fired, and it represents a breach for the company who owns the data
and its users. Any firm that has purchased the data will likely be contamined
and be forced into a trading lockdown of that security for a period of time by
compliance.

My real concern is that illicit data like this is used in machine learning
research. Machine learning is already pretty frustrating - it’s common for me
to find research from a conference that I’m simply unable to replicate because
the training or experiment data is not available (this is annoyingly the case
with A/B experiment optimization research put out by giant companies in
particular). I worry that this trend of accepting machine learning research
without any requirement for total data transparency will incentivize
researchers to conduct their experiments using illicit data that doesn’t need
to be sourced.

~~~
RepressedEmu
Your second example is very fascinating. Is finding unique datasets like this
part of your job? How lucrative is something like that?

~~~
bllguo
I believe dsacco's posted about doing similar work before. Apparently there
are teams of people at hedge funds that comb the web to find these kinds of
datasets - non-obvious signals for financial metrics. Very eye-opening stuff.
I always find his/her anecdotes interesting.

------
ncw33
Good work, Daniel. A laboriously-gathered overview of current practice, and
discussion of how to determine whether uses of illegally-obtained data are
justified.

~~~
barry-cotter
It may be a wonderful paper and discussion but the author seems ridiculously
positive about IRBs.

[http://slatestarcodex.com/2017/08/29/my-irb-
nightmare/](http://slatestarcodex.com/2017/08/29/my-irb-nightmare/)

HN discussion of the above

[https://news.ycombinator.com/item?id=15127271](https://news.ycombinator.com/item?id=15127271)

~~~
ncw33
That's a different field entirely. Just because some American hospitals have
trouble organising medical ethics reviews, does not mean that European or even
American CompSci researchers will run into the same problems. Indeed, Daniel
is (I think) positive perhaps based on good interactions with the ethics
committee here (my wife is certainly happy with the ethics reviews she's had).

Even in that SSC discussion linked, many SSC commenters agree that their own
ethics system was far easier, even in medicine.

------
sitkack
We gave the Japanese amnesty for
[https://en.wikipedia.org/wiki/Unit_731](https://en.wikipedia.org/wiki/Unit_731)
in exchange for their biological warfare data. We had the option to do the
right thing (burn it, put it on display), but we validated it.

~~~
TheAdamAndChe
> We had the option to do the right thing (burn it, put it on display), but we
> validated it.

I'm not sure complete amnesty was the best thing to do, but I don't think
destroying the data would be the best either. If we had burned the data that
was already gathered, it would have been a complete waste of human life. Those
people experimented on were murdered, and burning the data would have made
those murders pointless. By using that data, we could prevent human death in
the future.

~~~
crankylinuxuser
If I remember correctly, there was also a similar set of unethical decisions
in Nazi Germany that generated a body of knowledge of how humans survive under
"situations", like lack of oxygen, rapid-(de)pressurization, decompression, %
body burned and live. Just a whole lot of horrible things.

The problem was, this was done scientifically. Controls, and all that. The
problem was they were intentionally treating their subjects (Jewish) like lab
rats.

But when WWII ended, we realized some of their data could be used for the
space race. And the US used it.

The ethical issues here are all encompassing. Do you get rid of it and let
them die in vain? Do you use it, and benefit on people executed over genocide
conditions? There's no good way to answer this, as the converse has a valid
argument as well.

I look back and say, well, I was born in the early 80's. It's about 60 years
before my time (would have to be born in the 20's to serve in the 40's).

------
m3kw9
Is the demand for it that is creating the rise of illicit data collections.

