
Dumped on by Data: Scientists Say a Deluge is Drowning Research - grellas
http://chronicle.com/article/Dumped-On-by-Data-Scientists/126324/
======
neutronicus
I suppose this is as good a place as any for this rant:

If you go out seeking a specific type of data, it is almost impossible to find
it. Multiple times in the past, I've wanted to sanity-check against public
results, and gone looking for experimental data. I find plenty of papers, some
of which even purport to measure the exact thing I'm interested in, but _every
single fucking one of them_ just has a plot. A god damn plot. Give me the
data, I can make a god damn plot. I have MATLAB, I have gnuplot, it's no
problem. But give me a plot, and I've got to fire up an image editor and count
pixels - and of course it's a log-log plot, so I have to convert "pixels over"
and "pixels up" to whatever units I actually need.

I'm tired of this shit. I can't read half the papers I see because they're
behind a paywall, and when I _could_ read them, none of them included data. If
you're an experimental scientist, your prose is probably worthless. I am
wading through your entirely passive-voice description of your experimental
setup (separate rant...) _solely_ so that I can find some god damn data. Give
me a drawing of your experimental setup, a csv of your data, and whatever
MATLAB/LabView/Excel kludge you used for any derived quantities, and you can
avoid something you obviously don't like (writing - or else you wouldn't do it
so badly) and I can avoid something I don't like (reading your writing).

Now that the internet exists, there's no _reason_ not to communicate actual
data sets - in the past, you'd have to deliver a phone book to your
subscribers if you did that, so this all sort of makes sense in an
anachronistic way, but it'd be so much more useful for the scientific
community if the funding required "post your data set, drawings of your setup
with specs for any instruments, and whatever MATLAB/LabView/Excel you used for
any derived quantities" instead of "publish a short paper, it better have
plots".

(This doesn't really apply to multi-TB data sets like the author describes -
obviously you're going to have to contact the PI and ask him/her to ship you a
hard drive)

~~~
beoba
Is there currently much motivation for a given researcher to keep the data
online somewhere after the paper is out? Who would handle the hosting?

Have you had any luck emailing the author of the paper?

~~~
joe_the_user
Indeed,

Everything from incentives to the attitude seems to make academic research
groups act like sealed little islands of "brilliance".

Researchers should be strongly "incentivized" to make their data and source
code available and immediately usable (ie, not in Matlab). And the only way to
do that is to gradually devalue data-less, source-code-less research in the
way non-peer-review research is now devalued. (and yes, it will be messy,
embarrassing, fragile code sometimes but we'll have to live with that).

The amazing thing is that small, for-profit, private research organizations
can still operate with this "what we produce is a pdf with a picture of our
research on it" attitude.

Btw, one organization can and I think is help is PLOS: <http://www.plos.org/>

~~~
neutronicus
> not in Matlab

I disagree with this one. Matlab is nice, and I'm going to use it if it's
available (and if I'm doing linear algebra - if the problem is more
graph/tree-ish, I might reach for SciPy).

~~~
ylem
The problem with Matlab is that while it is fast for prototyping, it's
difficult to write maintainable code in it (which is why I switched to
python). Furthermore, you run into the issue of sharing code with people at
institutions that don't have Matlab and Octave may or may not be compatible.
Sometimes, an institution may have Matlab, but even if they do, they might not
have access to the same toolkits you use (for example, I use a savitsky-golay
filter as part of a peak finding algorithm--because it's in Matlab and we have
a DSP toolkit, I just used that one. Because my collaborators didn't have it,
I ended up rewriting the code in python--and this was for a group at a
national lab...Now, imagine people at small universities and colleges...)

~~~
neutronicus
I use Matlab mostly as a souped-up excel, for prototyping, and for the
parallel computing toolbox.

If I plan on _maintaining_ something, I want static typing, so the whole
SciPy/Matlab/Octave/R debate is a little moot for me there.

~~~
ylem
I don't want to start a flame war, but I think static typing is overrated.

~~~
neutronicus
I like being able to see function signatures in a header file.

In Python or Matlab, I have to read the implementation to see what types it
wants. To me, having collections of type signatures in a separate file is a
huge boon when I have to come back to code I haven't seen for a while - it's a
10,000 foot view of what the code does that helps jog my memory - and it's
guaranteed to be _accurate_ , unlike documentation.

To the extent that static typing supposedly prevents bugs, I agree that it is
overrated. My neutron transport solver I wrote in Haskell had plenty of god
damn bugs after I got it to compile, thank you very much (perhaps a very deep
Haskell-fu is required to attain the mythical "neutron transport solver that
works right the first time" - in any case I realized that compiler-aided bug
prevention is not a _low-hanging_ fruit in Haskell, so I lost interest).

That was a long-winded way to say that I feel lost and alone without function
signatures, and that this preference is possibly only quasi-rational.

------
tgflynn
Why are article titles on the web so disconnected from the contents of the
posts ?

This title to me meant "Scientists are complaining that greater availability
of data is making it harder to do research."

Instead it turns out the post is about scientists losing data due to not
having places/formats for storing it.

I have trouble understanding how someone thought this was a good title for
this post. Are titles being chosen simply to maximize click rates ?

~~~
Alex3917
"Are titles being chosen simply to maximize click rates?"

I was under the impression that the editors usually choose four or five
different titles, and then after the first few thousand impressions the
software chooses the one that does the best to become the permanent title.

Apropos of this, a cool hack is that the more days in a row you click on an
item in Amazon the cheaper it gets. So if you want to save an extra 10% on
something, just click it once a day for a week or so until the price goes
down. (After a certain point though it starts going back up again, so you have
to time it right.)

~~~
beoba
Straying a bit from the original topic: For amazon price tracking I've used
this and it's pretty good: <http://camelcamelcamel.com/> (They apparently
cover some other retailers too.)

~~~
L1quid
Glad you like it (I'm the author.)

------
zipdog
It seems like there needs to be a real break between the collectors of data
and the analyzers. At the moment heaps of data is collected by the same teams
that analyze it, but just as an API opens up a codebase to new ideas, so would
a standard practice of placing the data into a library before starting
analysis.

Fortunately there's already a growing number of places (typically government
in my experience) making an effort just to keep data libraries up and open to
further research.

~~~
fauigerzigerk
I agree that this is a problem, but I fear it's not as easy to solve as with
traditional libraries that store books. For data to be useful you need a lot
of very formal high quality metadata, and that metadata has to be collected by
the original researchers as part of their process. So I think there has to be
an incentive or mandate that makes researchers collect and document data in a
more systematic way. Librarians cannot do it for them.

------
robbles
sounds like a good application for Fluidinfo:
<http://www.fluidinfo.com/about/>

if they made it a bit more accessible, scientists could potentially share some
of their data, and new research could associate their related findings with
the original data.

