

Majority of published scientific data not recoverable 20 years later - bane
http://www.upi.com/Science_News/2013/12/19/Researchers-say-valuable-scientific-data-disappearing-at-alarming-rate/UPI-70701387491288/

======
tokenadult
Jelte Wicherts and his co-authors put a set of general suggestions for more
open data in science research in an article in Frontiers of Computational
Neuroscience (an open-access journal).[1]

"With the emergence of online publishing, opportunities to maximize
transparency of scientific research have grown considerably. However, these
possibilities are still only marginally used. We argue for the implementation
of (1) peer-reviewed peer review, (2) transparent editorial hierarchies, and
(3) online data publication. First, peer-reviewed peer review entails a
community-wide review system in which reviews are published online and rated
by peers. This ensures accountability of reviewers, thereby increasing
academic quality of reviews. Second, reviewers who write many highly regarded
reviews may move to higher editorial positions. Third, online publication of
data ensures the possibility of independent verification of inferential claims
in published papers. This counters statistical errors and overly positive
reporting of statistical results. We illustrate the benefits of these
strategies by discussing an example in which the classical publication system
has gone awry, namely controversial IQ research. We argue that this case would
have likely been avoided using more transparent publication practices. We
argue that the proposed system leads to better reviews, meritocratic editorial
hierarchies, and a higher degree of replicability of statistical analyses."

Wicherts has published another article, "Publish (Your Data) or (Let the Data)
Perish! Why Not Publish Your Data Too?"[2] on how important it is to make data
available to other researchers. Wicherts does a lot of research on this issue
to try to reduce the number of dubious publications in his main discipline,
the psychology of human intelligence. When I see a new publication of primary
research in that discipline, I don't take it seriously at all as a description
of the facts of the world until I have read that independent researchers have
examined the first author's data and found that they check out. Often the data
are unavailable, or were misanalyzed in the first place.

[1] Jelte M. Wicherts, Rogier A. Kievit, Marjan Bakker and Denny Borsboom.
Letting the daylight in: reviewing the reviewers and other ways to maximize
transparency in science. Front. Comput. Neurosci., 03 April 2012 doi:
10.3389/fncom.2012.00020

[http://www.frontiersin.org/Computational_Neuroscience/10.338...](http://www.frontiersin.org/Computational_Neuroscience/10.3389/fncom.2012.00020/full)

[2] Wicherts, J.M. & Bakker, M. (2012). Publish (your data) or (let the data)
perish! Why not publish your data too? Intelligence,40, 73-76.

[http://wicherts.socsci.uva.nl/Wichertsbakker2012.pdf](http://wicherts.socsci.uva.nl/Wichertsbakker2012.pdf)

~~~
omegant
It's an amazing topic. My wife is a doctor and I'm amazed of how difficult is
to see if an article is sound or important or it's been checked properly, or
if the tree of citations that lead to it's importance is sound after new
studies. I've been thinking on a visual standard that let's you know what
status has a given study, how many reviews has, if it's citing studies that
are solid with solid peer reviews and such. Ideally it should be possible to
track a new discovery down to it's scientific roots, and easily be able to
know how good is. But to achieve this you need as you said to change all the
scientific publication method, create some kind of standard publication API
for the data, the citations, the peer reviews and number of times it's been
replicated. Also some kind of debate forum around this publications would be
of help. Were people could coordinate replications or talk about methods ,
etc... But it is all a big mess, and certainly ready for disruption.

One clear user case is that recently it was published in the New England (a
tier one med publication) how all the studies about high blood pressure and
salt relationship have their origin in an old animal study with rabbits (If I
recall correctly). They fed them with the equivalent for humans of hundreds of
grams of salt, as the blood pressure rise it was deduced that salt causes high
blood pressure. They also did a meta-study about the relationship of high
blood pressure and salt and they didn't find a clear correlation. Maybe this
is correct maybe it´s not, but that a medical truth as established as
salt=high blood pressure can not be properly traced and known, just let's you
see how it's all broken.

edit: typos and spelling

------
natejenkins
The problem with getting scientists to publish their data is that the
incentives are not aligned properly. Clearly it is better for science in
general, but the benefit to the scientist publishing the data is less clear.
For what it's worth, I am working on reshaping the academic article where the
raw data and analysis can be stored alongside the text and figures, and
furthermore that the reader will be able to play with the data and analysis
directly inside the article. The idea being that this will be a more
interesting article from the reader's perspective, and thus result in more
visibility and citations for the author.

This doesn't solve the problem for research where the datasets are in the many
terabytes, but then again there are many papers where the datasets are well
under a gigabyte.

~~~
Fomite
Incentives really are the problem - there's absolutely no reason for me, the
researcher, to put much work into maintaining my data. If I do, it's because
of some personal standing, or the good of the field or what have you.

No one got tenure for a well curated data set.

~~~
darkmighty
There may be a contrived way out of this conundrum, but I think really the
most sensible way is simply regulation.

We've had a ton of scientists sacrificing their own individual reputation and
relationship with publishers (and of their groups) in name of something
everyone agrees but few really stand up for, because the personal gains are
almost exclusively negative. That's textbook use case of regulation.

I don't know specifically what should be done but perhaps a requiring
publishers certain obligations (e.g. responsibility of maintaining papers for
a long date and turning them public afterwards); or maybe simply a universal
obligation to open publications after say 5 years.

People fear this will compromise quality or sustainability of publishers. But
the community need publishers. It's a tag of credibility. So if publishers are
into trouble (and they're really needed) they'll find a way by e.g. demanding
payments from publications from the most wealthy labs.

~~~
Fomite
The NIH, Howard Hughes Medical Institute, and a number of other major funders
of science both in the U.S. and U.K. already require papers funded with their
money to be open access after 12 months.

The problem is that whether or not a paper is Open Access, "Data is available
on request from the author" may be an undocumented bit of spaghetti code, may
be stored on a Zip disk around here somewhere I'm sure, or may just be lost.

Regulating "You must make your data accessible, and maintain it well" is much
harder to implement, and much harder to check. Some grants now have sections
describing what will happen to the data etc., but right now there really is
very little reason beyond their own personal desire for researchers to
maintain good quality software and data repositories.

------
lutorm
As someone who's made sure to keep the data from my PhD thesis around for
almost a decade now, I know that most of it is totally obsolete and will never
be looked at by anyone. It is a laudable goal to keep data available, but
it'll be a cache with very low hit rate, I think.

However, this was simulated data, it could be recreated by checking out some
old version of the code and rerunning it. These days, it wouldn't even take
that long. The situation is different where someone's made measurements of the
real world, since those are truly irreplaceable.

------
eor
There are lots of people working on this problem. The NSF Office of
Cyberinfrastructure has funded quite a few projects working on long-term data
storage and discoverability. The largest of those are probably DataOne
([http://www.dataone.org/](http://www.dataone.org/)) and Data Conservancy
([http://dataconservancy.org/](http://dataconservancy.org/)). The hard part is
convincing scientists to use the tools that are available. The NSF already
requires that all new proposals include a data management plan. I imagine it
won't be long before they start requiring projects to deposit their data in a
public or eventually-public repository.

------
qwerta
All scientific research should be verifiable and reproducible. It could be
hard to reproduce physical experiments. But we should at least be able to
verify authors work.

Study without original raw data or source code, is just authors opinion, not a
science!

~~~
rprospero
I'm happy to submit my raw data. Where I get concerned is where do we define
the raw data?

For my thesis, I measured the Paterson function for a series of colloids. I
can imagine other scientists finding this useful and I'd be happy to submit
it. However, it's not the raw data. What I actually measured is the
polarization of a neutron beam, which I then mathematically converted into the
Patterson function. So I should probably submit the neutron polarization I
measured, so that other scientists can check my transformation. Except that I
can't directly measure the polarization - all I really measure are neutron
counts versus wavelength for two different spin states, so that must be my raw
data. But those counts versus wavelengths are really a histogram of time coded
neutron events. And those time coded neutron events are really just voltage
spikes out of a signal amplifier and a high speed clock.

If a colleague sent me her voltage spikes, I'd I'd assume she was an idiot and
never talk to her again. Yet, I've also see experiments fail because of
problems on each of these abstraction layers. The discriminator windows were
set improperly, so the voltage spikes didn't correspond to real neutron
events. The detector's position had changed, so the time coded neutron events
didn't correspond to the neutron wavelengths in the histogram. A magnetic
field was pointed in the wrong direction, so the neutron histograms didn't
give the real polarization. There was a flaw in the polarization analyzer, so
the neutron polarization didn't give the true Patterson function. And all of
this is assuming that my samples were prepared properly.

I've seen all of these problems occur and worked my way around them. However,
I could only work my way around the problem because I had enough context to
knew what was going wrong. The deeper you head down the raw data chain, the
more context you lose and the easier it becomes to make the wrong assumptions.
I know that I have one data set that provides pretty damn clear evidence that
we violated the conservation of energy. Obviously we didn't, but looking at
the data won't tell you that unless you have information on the capacitance of
the electrical interconnects in our power supplies on that particular day.

Research should be verifiable and reproducible. However, an order of magnitude
in verifiability isn't as useful as an incremental increase in
reproducibility. I'd be happy to let every person on earth examine every layer
of my data procedure to see if I've made any mistakes, but even I won't fully
trust my results until someone repeats the experiment.

~~~
Fomite
One concern of mine is also that being able to "Click Run and Get The Same
Answer" seems to assuage people and convince them that all is well, when what
really needs to happen is to have the experiment _repeated_ independently.

------
gnewton77
I made a great diagram which illustrates this here: '"Research Data and
Metadata at Risk: Degradation over Time"
[http://zzzoot.blogspot.ca/2010/12/research-data-and-
metadata...](http://zzzoot.blogspot.ca/2010/12/research-data-and-metadata-at-
risk.html) for a paper I co-authored:
[https://www.jstage.jst.go.jp/article/dsj/11/0/11_11-DS3/_pdf](https://www.jstage.jst.go.jp/article/dsj/11/0/11_11-DS3/_pdf)

The diagram is based on one from an earlier paper: 'Nongeospatial Metadata for
the Ecological Sciences', 1997, Michener et al.
[http://dx.doi.org/10.1890/1051-0761(1997)007%5B0330:NMFTES%5...](http://dx.doi.org/10.1890/1051-0761\(1997\)007%5B0330:NMFTES%5D2.0.CO;2)

------
gaius
This would be a worthwhile use for the NSA datacentres, a raw scientific data
repository.

~~~
jagger27
Github for science provided by Amazon might be more realistic.

~~~
icelancer
Isn't AWS Glacier built for this type of use?

------
beloch
I've written papers where the raw data was small enough that it could be put
into tables and included within the online supplements. This is pretty much
ideal. Unfortunately, a lot of experiments generate enough data that no
publisher will store it for you. Given what a racket the scientific journal
business is and that scientists pay thousands for each publication, it really
should be expected that journals will store and curate any data pertinent to a
paper they are paid to publish.

------
D9u
This highlights the fallacy of the old "once it's on the internet it's there
forever" meme.

