
Too much of the research process is now shrouded by the opaque use of computers - rtplasma
https://theconversation.com/how-computers-broke-science-and-what-we-can-do-to-fix-it-49938?utm_medium=email&utm_campaign=Latest+from+The+Conversation+for+November+9+2015+-+3785&utm_content=Latest+from+The+Conversation+for+November+9+2015+-+3785+CID_97673005140b964c4d2a38195e365ba4&utm_source=campaign_monitor_us&utm_term=converging%20on%20recommendations
======
grhmc
In my experience, scientists are like extremely junior developers with PhDs.

They name functions "wstok" instead of "whitespace_tokenizer" because it uses
fewer keystrokes.

They can't be bothered to commit code. And when you do manage to get them to
commit code, it is isn't the same code. It is close to the same code, but
different. They think "close enough" is close enough, even though they
recognize that without the exact code, it isn't the same thing. "Just a few
tweaks will get you from that code to my code!"

This is a very difficult problem for a startup to solve.

 _EDIT 1_ I read about other commenters writing about containers or just
provide a github repository. The problem isn't the availability of the tools,
the problem is convincing them to use the tools.

 _EDIT 2_ Scientists don't give a damn about reading code, or what happens to
the code the moment it leaves their hands. To quote one, "Why would I use a
coding style? Why did you read the code? Code is not meant to be read!" This
was a serious conversation.

 _EDIT 3_ The most difficult thing is remembering they're not developers who
care about developer tools. They're scientists who want to get results to
their very exciting experiments.

 _EDIT 4_ (as commented in reply to woah: we ("a startup") want to solve this
problem because we employ scientists. See :
[https://news.ycombinator.com/item?id=10539433](https://news.ycombinator.com/item?id=10539433)

 _EDIT 5_ We've made great progress on working with scientists, and we're
hoping to open source some of our tooling and write about what we've learned.

~~~
woah
I don't think that a startup can solve their coding problems. Only scientists
can solve their own coding problems by having the diligence and attention to
detail that one would expect of a high school dropout with 2 weeks of a coding
bootcamp under their belt.

~~~
mbostleman
Exactly. The root problem is that they're just not scientists.

~~~
nmrm2
"The code is bad therefore it's not science" is a questionable and even
arrogant attitude.

In more cases than not, the software is doing something that was previously
done by hand by some grad student or support staff(1). If those people made a
mistake, then tough luck. Hopefully you'd catch it with a sanity check on the
output. Were those people who relied on manual labor for centuries before the
invention of modern computers not doing science? Absurd.

IMO, what's happening is that scientists are doing the same sort of validation
that they _always_ did -- placing no trust in the process (the code), and
instead looking at checkpoints in the computation process and sanity checking
the final output.

None of this is to say that there isn't room for improvement -- there is. But
it's possible to do great science while writing shitty code. Hell, it's
possible to do great Computer Science while writing shitty code!

(1) The vast majority of scientists programming are not, in fact, doing
anything revolutionary with computation. It's just a more efficient way of
doing what they used to do by hand. And those that are using code in an
essential way typically (not always) have higher quality code.

------
maliker
Amen!

I work in electric utility research. Pretty much every paper gives some
simulation results. Getting the source? The only option is emailing and hoping
the researchers respond, want to share, and have code that runs on something
other than laptop of the student that built it. I haven't ever succeeded in
getting working source code.

There are many simple improvements that could be done. Just a github repo for
each paper would be a monumental step forward. Installation instructions that
have been tested and proven to work would be valuable. And then good coding
practices, smart reuse of existing packages and integration with other
projects would be beyond awesome (and pretty hard).

~~~
swiley
I'd imagine one of the problems with that is most researchers must sell the
rights to their research to journal publishers.

~~~
haihaibye
Sell? They usually pay money to assign away the rights.

------
dougmccune
A project just launched yesterday is Depsy [1], from Impact Story. Depsy tries
to give credit to the coders who write scientific software that's used in
research. One element at play is that it's not worth it from a reputational
standpoint to go through the hard work of getting your code ready to release
publicly. It's the same problem we all have with open sourcing hacked together
side projects (except even relatively shoddy side projects do help build a
programmer's reputation). Depsy wouldn't totally solve that problem, but if
there was a good way to boost your reputation by contributing public code then
you might see changes in the industry. But like all things in academia, you
have to start with reputation and work your way to a solution from there.

[1] [http://depsy.org/](http://depsy.org/) and the blog post announcing it:
[http://blog.impactstory.org/introducing-
depsy/](http://blog.impactstory.org/introducing-depsy/)

------
aiahopeful
This is exactly the problem that the Center for Open Science
[[https://cos.io/](https://cos.io/)] is hoping to solve with the Open Science
Framework [[https://www.osf.io](https://www.osf.io)].

These are the same people behind the large Reproducibility Project that was on
the top of HN not too long ago.

------
ubasu
I don't disagree that badly-written scientific code exists (both in academia
and in the industry), but I want to point out a couple of things for people
who have only a software engineering background.

1\. Keep in mind that most scientific code for numerical analysis, being based
in mathematics, follows the convention for using algebraic symbols, e.g. even
in physics, we write

F=m*a

instead of the wordy version, which we overcame a couple of centuries ago. So
using shortened variable names [1] comes from that background, and using
longer names as in Java-world seems like a regression.

2\. Writing code like [2]

    
    
        Ax_1 += Bx_1 + Cx_1 # Add Bx_1 to Cx_1
        Ax_2 += Bx_2 + Cx_2 # Same thing, but for Ax_2
        Ax_3 += Bx_3 + Cx_3 # "
        Ay_1 += By_1 + Cy_1 # "
    

is known as loop unrolling [3,4] and is used for optimization.

[1]
[https://news.ycombinator.com/item?id=10539078](https://news.ycombinator.com/item?id=10539078)

[2]
[https://news.ycombinator.com/item?id=10540446](https://news.ycombinator.com/item?id=10540446)

[3]
[https://en.wikipedia.org/wiki/Loop_unrolling](https://en.wikipedia.org/wiki/Loop_unrolling)

[4] [http://stackoverflow.com/questions/2349211/when-if-ever-
is-l...](http://stackoverflow.com/questions/2349211/when-if-ever-is-loop-
unrolling-still-useful)

~~~
chrisseaton
Why isn't your compiler unrolling your loops for you?

~~~
ubasu
Modern compilers should, but there is a lot of legacy code.

------
e12e
If ever there was a case for programming as part of computer literacy, surely
this must be it? Give children an introduction to some basic version control
(also useful for writing in general!) and basic programming and maintainence
(eg: give a programming assignment in 2nd grade, have them improve it in 4th
grade).

That should leave some scars that can be built on in the phd. program.

(And either make sure they have a full number tower (probably best for
"most"), or make sure they know that computers, despite their name, suffer
from fundamental dyscalculia -- the ability to manipulate decimal numbers).

------
joeclark77
When I do simulations in my research, I put the code on GitHub. As the article
points out, the solutions are there, some people just aren't interested in
using them. Practically speaking, replication doesn't happen very often,
because there's no profit in it. What we need is an incentive to replicate
existing studies. Maybe replicating an existing study should be a requirement
for a masters degree, or for certain PhD-level research methods courses.

------
bluenose69
_Anecdote._ I suggested that a bright young PhD science student switch from
point-click software to scripting software for his analysis. He replied that
he would never do that; he was "too old to learn something like that."

------
jarofgreen
Very interesting, and we were talking about the exact same process at an Open
Data event this weekend - documenting the data processing work people do so
others can reproduce and build on it.

One thing that is not mentioned at all in this as far as I can see is what
about personal data? When the experiment handles personal data - a medical
trial is the most likely candidate - the details shouldn't be published
sometimes, only anonymised aggregate tables. Are there any guidelines for
handling that?

------
Houshalter
There's a great comment here on the issue:
[https://www.reddit.com/r/MachineLearning/comments/39yj8y/why...](https://www.reddit.com/r/MachineLearning/comments/39yj8y/why_isnt_code_submission_mandatory_for_top_ml/cs7oca4)

>\- __The code is a mess __\- Most code is written by grad students, it 's
often sloppy scripts that are duck taped together to get the numbers and
graphs needed for the paper. I know some students who essentially have to
rebuild things from scratch if they need to expand on what they did for their
thesis. Releasing this type of code is embarrassing to the dept, advisor, and
student.

>\- __Don 't want to support it __\- Even with disclaimers, there 'll
inevitably be someone who downloads the code, and emails asking for
help/support. Depending on the quality of the code some may even claim that
your results are false because _they_ couldn't get it running on their
application. I'm all for peer review and agree that code should be available,
and further i know that the scenario is probably a rare occurrence, but still
some would rather not publish it to avoid it entirely.

>\- __Simply don 't know where or how __\- granted i 'm in the engineering
school (not cs) so milage may vary, but you'd be surprised how many students
i've talked to that just don't know much about github or repositories (even
svn).

>\- __There 's no incentive __\- students are already overworked, what
benefits do they gain from publishing their code? It helps academic research,
yes. But most just want to finish their phd and move on to the next thing,
anything else just takes time away from them accomplishing that goal.

>\- __Funding conflicts __\- part of the code i use was developed by another
student who was funded by a particular grant where a deliverable was required
at the end. Thus, anything i release cannot include that code, which
inevitably breaks the rest. I do not know the particulars of the grant or why
it is not allowed to be released, this is just the response i receive when i
ask to put my code online.

------
awch
Nature had a good article [0] about the use of iPython notebooks in science
that was previously discussed here [1].

[0][http://www.nature.com/news/interactive-notebooks-sharing-
the...](http://www.nature.com/news/interactive-notebooks-sharing-the-
code-1.16261)
[1][https://news.ycombinator.com/item?id=8563028](https://news.ycombinator.com/item?id=8563028)

------
jpolitz
In some software-focused conferences, artifact evaluation has become a new,
separate phase:

[http://www.artifact-eval.org/](http://www.artifact-eval.org/)

There is a separate committee charged with evaluating software (and other
artifacts, like data sets), that come along with a paper.

This is a nice start to a process that could be adopted in non-CS fields for
the evaluation of statistical results or software that analyzes data.

------
tejtm
If this issue is interesting to you from either the researcher or software
engineering side please consider working with, taking classes from, or just
supporting Greg Wilson's Software Carpentry [http://software-
carpentry.org](http://software-carpentry.org) or the spin off Data Carpentry.

Full disclosure: I am not affiliated in any way shape or form. Just a fan of
the intent.

------
Xcelerate
It's bizarre to me how byzantine the scientific community's grasp of
programming is. Variables are named things like "namxaqqs" and "stoggs1_r3".
4,000 line monster functions are rampant. If scientific code were a garden,
stuff like this would be the weeds:

    
    
        Ax_1 += Bx_1 + Cx_1 # Add Bx_1 to Cx_1
        Ax_2 += Bx_2 + Cx_2 # Same thing, but for Ax_2
        Ax_3 += Bx_3 + Cx_3 # "
        Ay_1 += By_1 + Cy_1 # "
        ...
    

Code with comparable functionality is _copied and pasted_ instead of being
factored into a single function. Everything is tightly-coupled: changing one
line of code is like pulling the keystone out of a bridge designed by an
oyster chef. If there _are_ any functions, then calling one mutates at least 7
global variables and induces 13 side effects that are more unpredictable than
eigenstate selection. File formats are non-standard and consist largely of one
giant concatenation of every variable in the program (all converted to strings
of course).

Not-invented-here-syndrome is a badge of honor (LAPACK? Bah! I'll write my own
Gaussian elimination routine for this matrix with a million entries).
Libraries are embraced with the exuberance of a picky eater encountering
durian (as a rule of thumb, anything that is open source and has been vetted
by thousands of users is probably untrustworthy).

It is considered a waste of time to learn basic CS algorithms — efficiency is
merely an implementation detail, so problems that could have been solved with
a clever algorithm and an iPhone are instead brute-forced using millions of
hours of supercomputer time. Complexity classes are the abstract nonsense of
computer science — it's much easier to throw more hours at the problem (so
what if it's NP-hard? My algorithm probably converges to the global minimum.
Why wouldn't it?)

When garbage-collected languages are used, programs spend 99% of their time
allocating and deallocating small quantities of memory in tightly nested inner
loops (16 of them, no less). "Inlining" means putting comments inside of the
code instead of above it (if there even are any comments). "Cache locality"
has something to do with GPS systems. "Hashing", "recursion", and "quicksort"
are the names of recently-announced smartphones. And doesn't "SIMD" stand for
the Society for Inherited Metabolic Disorders?

/rant

It's a mess. Granted, there are researchers who write very high quality code,
but they are few and far between. I think the main problem is that a lot of
students/professors who get involved in computational research never had a
good CS background. Perhaps they had one "Computing for Engineers" class that
taught Matlab or Python, but that isn't nearly sufficient for research-quality
code. While I had the advantage of taking up programming as a hobby during
childhood, most graduate students have never programmed before in their life.
They're expected to learn something like C++ in a week (that's not a
hypothetical example).

Many researchers are afraid to publish their code alongside their paper
because they _know_ the code is low-quality. And that awareness causes a lot
of insecurities. If someone finds a bug in their 2-3 year research project,
then their entire conclusion might be invalidated. But I don't think that's
the biggest worry (most scientists ultimately want to know the truth about
their subject of study). I think the biggest fear is of losing prestige,
losing a chance for tenure, or having funding revoked.

To fix this problem, there is a _crucial_ and _urgent_ need for the academic
community to reduce the penalty associated with making honest mistakes.

Mistakes are simply part of the research process. Humans are fallible —
_everyone_ is going to mess up at some point. And instead of propagating this
academic "chilling effect", it would be much better for the whole scientific
community if everyone quit worrying about messing up and instead published
their code in a highly visible location, subjecting it to the highly critical
(yet extremely beneficial) scrutiny that it deserves.

------
elchief
[https://github.com/Factual/drake](https://github.com/Factual/drake) from the
data science toolbox is useful. "Drake is similar to GNU Make, but designed
especially for data workflow management"

------
sklogic
I remember some time ago somebody proposed that scientists should supply
entire virtual machines with all the relevant setup and data for a better
reproducibility.

It was before the dawn of Docker and alike, so now this idea is probably a bit
less insane.

------
davexunit
GNU Guix is starting to be used for reproducible science in the bioinformatics
industry. Much better than bundling opaque binary VM/container images or just
having some scripts that bitrot.

~~~
sparkie
It's funny how the reaction of most people in this thread is to bash on 'the
other sciences' for bad coding practice, while completely ignoring how
inadequate their own practices are in creating reproducible programs.

The heart of the problem is described in the article: It's point and click
interfaces (and yes, this includes regurgitating out commands into the
terminal), which are expected to be followed by the dot, where they could be
automated by the machine if the program were ever to be completed to be
reproducible.

A big problem is a non-computer scientist is probably working in a lab on some
ancient machine running ancient software - he has no control over the machine,
and the system-admin is so far behind because his main job is to reproduce the
(unreproducible) software written by so called 'computer scientists'. He
struggles so much that he has to share his work with thousands of others as a
'package maintainer', and is grateful that so many other package maintainers
exists, because without eachother, they would all have absolutely zero chance
of reproducing anything.

It's time to stop bashing the coding practices of other people folks, and look
in the mirror. We are the friction that causes code in scientific research to
be unreproducible - it's not the code itself. If we're to educate non-computer
scientists in how to create reproducible research, surely the absolute minimum
is that we do so ourselves.

And so far, Nix and Guix are the only two projects (afaik) which are seriously
attempting to tackle this. If you call yourself a 'computer scientist', and
you regularly write research (which all code is), then start living up to the
name and make it reproducible. This means you should be using Nix or Guix, and
packaging your software for it. Without such tool to reproduce the software,
you're suffering from the same reproducibility problems this article is
highlighting about the other sciences.

~~~
davexunit
Amen!

------
aaron695
This is an interesting topic but the article is very pie in the sky.

Talking of members of the public using scientific research (In high school
even), code reuse to save money etc. is a bit silly.

For starters I'd imagine code, like an arts student theses is copyright the
researcher?

As mentioned 10% is getting code working. 90% is getting it working for
others. Researchers just don't have time or skills. And it loses most of it's
'possible' value without that 90%

But as a start a simplified git framework would be nice. Git is way to hard
for researchers and other easier version controls won't be suggested cause
everyone loves git. Not allowing branching for instance in the gui of some
sort of easyGit program.

~~~
nmrm2
Copyright the researcher / university, and usually the latter.

------
guard-of-terra
The solution is obvious - provide a github repository with all the code for
the research. Provide clear ways to reproduce all the results of the research
from raw data by running build.

Bonus points: Hirschware as software license.

~~~
Natanael_L
Maybe even a container image / VM to ensure there's no subtle enviroment
differences that alter the results. Even better with reproducible builds
specifically so you can confirm it was created from the published source.

~~~
guard-of-terra
IMO that's an overkill. It's sufficient that the result is reproduced once. If
slightly different environment would produce enormously different results,
that's interesting by itself.

The goal is that you can continue research based on previous results, not that
you can reproduce exact results in an obtuse VM.

