
Twenty-nine teams use same dataset, find contradicting results [pdf] - alexleavitt
https://osf.io/j5v8f/
======
entee
This paper is awesome because it transparently folds the analytical approach
into the experiment being conducted.

There are two kinds of scientific study: those where you can run another
(ideally orthogonally approaching to the same question) experiment along with
rigorous controls, and those where you can't.

The first type is much less likely to have results vary based on analytical
technique (effectively the second experiment is a new analytical technique).
Of course it does happen sometimes and sometimes the studies are wrong, still
more controls and more experiments are always more better.

However, studies were you're limited by ethical or practical constraints (i.e.
most experiments involving humans) don't have that luxury and therefore are
far more contingent on decisions made at the analysis stage. What's awesome
with this paper is it kind of gets around this limitation by trying different
analytical methods, effectively each being a new "experiment" and seeing if
they all reach the same consensus.

Interestingly, very few features in the analysis were shared among a large
fraction of the teams, (only 2 features were used by more than 50% of teams)
which suggests that no matter the method, the result holds true. A similar
approach to open data and distributed analysis would be a really great way to
eliminate some of the recent trouble with reproducibility in the broader
scientific literature.

------
dang
A blog post giving background is at [http://www.nature.com/news/crowdsourced-
research-many-hands-...](http://www.nature.com/news/crowdsourced-research-
many-hands-make-tight-work-1.18508).

~~~
jdp23
The blog post is a great overview as well as useful context, thanks for
sharing it.

TL:DR summary: Scientific results are highly contingent on subjective
decisions at the analysis stage. Different (well-founded) data analysis
techniques on a fairly simple and well-defined problem can give radically
different results.

It's very interesting research -- a great real-life example supporting the
models Scott Page et. al. use for the value of cognitive diversity. The thrust
of the blog post is about where crowdsourcing analysis can be helpful (as well
as reasonable caveats about where it might not apply), which is certainly an
interesting question. Obvioulsy, there are a lot of other implications to this
as well.

~~~
nkurz
Off-topic, but you seem well positioned to answer: Why do you say "TL:DR" here
when summarizing a short blog post that you enjoyed? Clearly the meaning has
diverged from the original abbreviated insult of "Too long; didn't read", but
I don't understand what people mean when they use it today. Why did you phrase
it this way? Are you a native English speaker? If not intended to be
derogatory, does the dissonance bother you?

~~~
jdnier
How do you see it as derogatory? I'm a native English speaker and have never
thought of it that way. I didn't click the link, but did appreciate his short
summary – and upvoted him for it. ;)

~~~
taneq
'tl;dr' is often a troll response to a long post that that someone has
obviously spent a lot of time on. Bonus troll-points if the long post was in
response to another troll.

Example:

poster1: only retards use vi, notepad rules

poster2: _huge list of reasons why vi is better than notepad_

poster1: lol tldr

~~~
dalke
I was quite peeved when I first saw a "tl;dr" comment concerning one of my
blog posts. My thought was, and still is somewhat, "if you didn't read it, how
can you say it was too long for what it needed to cover? Why do you feel the
need to tell others that you have the attention span of a fly?"

We already have terms like "summary", "digest", and even "précis"; why create
a new term imbued with snark?

------
SilasX
Reminds me of the idea (Robin Hanson's, I think?) to add an extra layer of
blindness to studies: during peer review, take the original data, and write a
separate paper with the opposite conclusion. Randomize which reviewers get
which version. Your original paper is then only accepted if they reject the
inverted version.

~~~
gwern
I think you misremembered it:
[http://www.overcomingbias.com/2007/01/conclusionblind.html](http://www.overcomingbias.com/2007/01/conclusionblind.html)
[http://www.overcomingbias.com/2010/11/results-blind-peer-
rev...](http://www.overcomingbias.com/2010/11/results-blind-peer-review.html)
Nothing about accepted only if they rejected the reversed version; just that
the pro & con versions be supplied (first post), or a paper sans
conclusions/results (second post).

------
sndean
FiveThirtyEight did a write up of this paper (part 2):

[http://fivethirtyeight.com/features/science-isnt-
broken/](http://fivethirtyeight.com/features/science-isnt-broken/)

On the bright side, if you look at the 95CI for the 29 studies, almost all of
them overlap.

------
joe_the_user
_" The primary research question tested in the crowdsourced project was
whether soccer players with dark skin tone are more likely than light skin
toned players to receive red cards from referees."_

This seems like a topic where one indeed typically winds-up with a multitude
of competing conclusions.

Among other factors for we have:

* Pre-existing beliefs on the part of researchers.

* Lack of sufficient data.

* Difficulty in defining hypothese (is there a skin tone cut-off or should one look for degrees of skin tone and degrees of prejudice, should one look all referees or some referees).

Given this, I'd say it's a mistake to expect _just numeric data_ at the level
of complex social interactions to be anything like clear or unambiguous. If
studies on topics such as this have value, they have to involve careful
arguments concerning data collection, data normalization/massaging, and _only
then_ data analysis and conclusions.

But a lot of the context comes from prevalence shoddy studies that expect you
can throw data in a bucket and draw conclusions, further facilitated having
those conclusions echoed by mainstream media or by the media of one's chosen
ideology.

------
krick
I understand how tempting it is in our age of big data and all that stuff to
perceive this as some curious new phenomena, but it really is not. This is
precisely the reason why we've come up with some criteria for "science" quite
a while ago. And in fact, all this experiment is pretty meaningless.

So, for starters: 29 students get the same question on the
math/physics/chemistry exam and give 29 different answers. Breaking news?
Obviously not. Either the question was outrageously bad worded (not such a
rare thing, sadly), or students didn't do very well and we've got at most 1
_correct_ answer.

Basically, we've got the very same situation here. Except our "students" were
doing statistics, which is not really math and not really natural science.
Which is why it is somehow "acceptable" to end up with the results like that.

If we are doing math, whatever result we get must be backed up with formally
correct proof. Which doesn't mean of course, that 2 good students cannot get
contradicting results, but at least one of their proofs is faulty, which can
be shown. And this is how we decide what's "correct".

If we are doing science (e.g. physics) our question must be formulated in a
such way that it is verifiable by setting up an experiment. If experiment
didn't get us what we expected — our theory is wrong. If it did — it _might_
be correct.

Here, our original question was "if players with dark skin tone are more
likely than light skin toned players to receive red cards from referees",
which is shit, and not a scientific hypothesis. We can define "more likely" as
we want. What we really want to know: if during next N matches happening in
what we can consider "the same environment" black athletes are going to get
more red cards than white athletes. Which is quite obviously a bad idea for a
study, because the number of trials we need is too big for so loosely defined
setting: not even 1 game will actually happen in isolated environment, players
will be different, referees will be different, each game will change the
"state" of our world. Somebody might even say that the whole culture has
changed since we started the experiment, so obviously whatever the first
dataset was — it's no longer relevant.

Statistics is only a tool, not a "science", as some people might (incorrectly)
assume. It is not the fault of methods we apply that we get something like
that, but rather the discipline that we apply them to. And "results" like that
is why physics is accepted as a science, and sociology never really was.

~~~
borplefark
Your rant makes no sense. I flip a coin 100 times and it comes up tails 99
times. You are basically saying that asking "Is the coin more likely to come
up tails" isn't a real scientific question. That's just silly.

------
DiabloD3
So, does this mean every team used improper methodology? Or can we meta-review
the results and figure out what's really going on?

~~~
DanBC
It makes it really hard to work out what's happening, especially if you want
the result to match existing standards.

For a real world example of this see deworming schoolchildren.

People looking at the educational effects of deworming children reach
different conclusions because some of them use a medical model and some of
them use an economics model.

[http://www.cochrane.org/news/educational-benefits-
deworming-...](http://www.cochrane.org/news/educational-benefits-deworming-
children-questioned-re-analysis-flagship-study)

[http://www.cochrane.org/CD000371/INFECTN_deworming-school-
ch...](http://www.cochrane.org/CD000371/INFECTN_deworming-school-children-
developing-countries)

Talked about in this More or Less episode:

[http://www.bbc.co.uk/programmes/b0659q1f](http://www.bbc.co.uk/programmes/b0659q1f)

[http://www.theguardian.com/society/2015/jul/23/research-
glob...](http://www.theguardian.com/society/2015/jul/23/research-global-
deworming-programmes)

------
LunaSea
Of course it's social "sciences".

~~~
DanBC
Also medical treatment:
[https://news.ycombinator.com/item?id=10387375](https://news.ycombinator.com/item?id=10387375)

------
hmate9
Lies, damned lies, and statistics

[https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statist...](https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics)

Statistics can be manipulated surprisingly easily.

~~~
justinlardinois
There are three kinds of lies. There are also three kinds of comments I see in
this thread:

> "This is interesting, here's some thoughts and ideas that further contribute
> to this subject"

> "This is interesting, here's a link to some further writing on this subject"

> "The entire concept and discipline of statistics is bullshit."

Par for the course here at Hacker News.

~~~
duaneb
Hey, the null hypothesis is powerful and valuable. I, for one, and happy that
all three types are well-represented; all three are healthy in moderation.

I also think that the quote fits in quite nicely here, it's not a wholesale
rejection of statistics.

~~~
justinlardinois
The quote itself isn't, but just posting a link to the Wikipedia article
without explaining how they think it applies here is pretty much a wholesale
rejection of statistics.

