Hacker News new | comments | show | ask | jobs | submit login
Secret Computer Code Threatens Science (scientificamerican.com)
105 points by emcl on Apr 15, 2012 | hide | past | web | favorite | 42 comments

Back in my academic days, I became a proponent for open-notebook science.

Far too often, researchers never release code because it's never "polished" enough.

So I began publishing my code on github from the moment I started the project. e.g. https://github.com/turian/neural-language-model

However, some of my more conservative colleagues were averse to this approach. I constantly debated with my office-mate, who was of the opinion that there are many reasons not to present half-finished work.

So there is also a large cultural barrier to more open science. If there were publishing pressure on researchers to open their code, then it might effect a cultural shift.

Every piece of academic code I have ever seen has been an unmitigated nightmare. The sciences are the worst, but even computer science produces some pretty mind-crushing codebases.

So, I don't blame them for being embarrassed to release their code. However, to some degree it's all false modesty since all of their colleagues are just as bad.

Add into this the fact that no one in academia understands version control systems, and it's a hard hill to climb.

Perhaps this attitude contributes to the problem (I am not excepted from it). If you were given an earful about how your code is an unsalvageable flaming pile of rubbish by everyone you showed it to, would you want to release it alongside a paper to which you've given years of your life?

Yeah no kidding. There's a huge difference between the disposable one-off code produced by a scientist trying to test a hypothesis, and production code produced by an engineer to serve in a commercial capacity.

The original transistor, produced by Shockley and team at Bell Labs, "worked" only in a nominal sense. It didn't do anything other than prove a concept. To turn it into something usable in real equipment took years of effort by other scientists, and engineers. Thank god they published the details of it rather than saying "we made it and it worked, here are the results" because they were afraid of releasing something that was "a pile of rubbish."

Additionally, every piece of academic code I have ever written has been an unmitigated nightmare. Something about academia drags down the quality of my code. My industry production code is much better (or so I tell myself).

There is a very simple explanation. You are not payed to produce code or finished product. While I can't speak for the entire academic CS community, we are constantly under pressure to write less code and write more papers. I really hope that publishing the source code along side the article becomes the standard, if only that it would give us an 'excuse' to polish it up.

If one of the key ideas behind Science is thorough peer review, and the code that your paper relies on is essentially unreviewable and untestable, how is it any different from just not releasing it?

"Even a fool is thought wise if he keeps silent, and discerning if he holds his tongue." -- Proverbs 17:28

There's more to it than "lack of polish".

A bio researcher I know is afraid of releasing any code because of the way it might tarnish their reputation.

They're not expert programmers and are afraid to be perceived as less competent, in their field, because of the low quality of their code.

This seems wrong. If they're using code to arrive at their findings, it should be high-quality, no less so than their lab technique. One can lead to bogus results just as easily as the other.

This actually made me laugh a bit. This just isn't how it works...

Researchers are not judged by the quality of their code -- they're judged on ideas (and more specifically, papers). And to be fair... have you ever written a quick, hacked-together script to prove some point and then move on? That's the same thing that researchers are doing. If they want "high-quality" code, that will probably only happen as the research systems are hardened and/or commercialized.

I should say, I'm still a big proponent of open-sourcing it all anyway -- perhaps just a few months later to maintain competitive advantages (or file for IP protections). All my dissertation code, hardware designs, etc. are online and documented for posterity. And I find that some other researchers genuinely find it useful (which kinda scares me). But I try to be a good citizen and support 'em anyway.

hardware designs

Where? Thanks!

The long-range (passive) RFID hardware is documented here:



Probably not too interesting, but a start. It looks like the old robot power supply boards and force-torque sensor boards reside on my old lab's "internal" wiki. That's no good! I'll have to ask 'em about moving the files over to the public one. The latest designs (FPGA software defined radio) are being tested, so they've got a while before they'll be released. ;-)

It may seem wrong, but in practice it isn't. My experience with researcher-written code is that it does things in a roundabout, ugly, inefficient, duplicated-library-function but ultimately correct way.

The programs are not complicated. They are usually just some implementation of an equation or some other method for transforming input into output. Researchers don't have hundreds of hours to invest in learning the nuances of the const keyword in C++ or whatever, so they hack it. It works.

"It works."

How would they know? Because it produces output that looks like what they're expecting? That might work...until it doesn't. :-)

Checking extensional equality of programs is a task that is impossible to perform. If code is not well written, there ARE bugs lurking in the sourcefiles that just are unnoticed. Only computer scientists and mathematicians seem to understand this and try to prove correctness of their programs/results.

Publishing a version controlled source repository is actually more stringent than publishing the results of a lab assay, because it contains a record of everything you tried, older versions, and errors (that have hopefully been fixed).

When publishing on wet-lab data, you only publish the assays that worked (i.e., you didn't contaminate the samples, etc.). The wet-lab equivalent of a source repo would be like publishing a video recording of your lab.

I am still in my "academic days" and just like you I try to push code into my public repository from the start of a project and onwards (https://github.com/ninjin/). The only limitation I make is that I generally want either a test-suite to run or a set of experiments to run before I make a push (not breaking the main tree). I also try to have a Wiki page where I write daily updates and results, sometimes correcting claims I made on the very same page earlier.

I think it is good practice and it makes the paper writing much easier since you can back-track the process. The only experience I have had contrary to yours is that all my colleagues have some sort of envy of my idea, but they just can't do it themselves because of stage-fright, lack of time, etc.

I do everything research-related in the open! Every piece of code, and, hell, even my phd thesis draft is on github: https://github.com/SnippyHolloW/

Nice to see. Back in February I had a paper in Nature (with two co-authors) arguing for the same thing (http://www.nature.com/nature/journal/v482/n7386/full/nature1...). With this paper in Science (http://www.sciencemag.org/content/336/6078/159.summary) it means that the top two journals in the world have now published papers arguing for source code openness.

Probably time for an international cooperation on defining open code policies: http://blog.jgc.org/2012/04/more-support-for-open-software-i...

Hm, shouldn't the articles contain the information necessary to rewrite the code? Then rewriting the code could be seen as replicating the experiment.

Both sharing and not sharing seems to have pros and cons. For example if the code is buggy and shared, odds might be higher that the bugs will never be found because nobody will bother trying to write the code again.

I completely agree. If this is publishable science, then a strong and reproducible description of the science and algorithms used is all that is necessary.

I would be very interested if the author had actually given even a single instance where the lack of software code that merely implements the experiment has completely impeded progress on the science in a paper. Even if this were the case, would that not imply simply more algorithmic detail is required?

Of course, for all of the above, I am referring to non-computer science. There may be special circumstances in computer science where the code itself is the published algorithm or an intended description of the underlying science.

Also agree.

For an example of a specific circumstance consider theorem provers, because the proof is usually too large for a paper publication. The Archive of Formal Proofs (AFP) [0] is a repository for Isabelle proofs, which my collegues use. They submit a proof to AFP and write a paper about the results, where they cite the AFP publication.

[0] http://afp.sourceforge.net/about.shtml

> For example if the code is buggy and shared, odds might be higher that the bugs will never be found because nobody will bother trying to write the code again.

Your reasoning is a bit odd. It's a good thing that nobody will bother trying to write the code again. The world has enough people who can write code. We need more people who can read and dissect code, refactor them, and add tests where applicable.

But what if the results depend on some buggy function? Then it would all be bunk, and nobody would notice.

All researchers know that they should release their code. The problem is that they are just bad programmers. Programming has turned into a required skill for many scientist, but the school system is lagging behind. So right now we have all these scientists lacking fundamental skills. Sooner or later this will be recognized and schools will then hopefully accept programming or computer science as just another subject in the curriculum.

If the programs are actually important in the production or verification of the research, then don't we want peers who try to independently reproduce the experiment to also independently develop their own programs, so that their reproduction is truly independent?

Possibly, but it would also be good if the source code were available for scrutiny by independent scientists.

My biggest problem with releasing code is that it people will expect support.

I released some code that only compiled on visual studio 6, with a specific version of a fairly expensive library. I got several emails asking for a mac or linux version, rather an update for more modern compilers.

Personally I would have preferred people just reimplement the code from the paper. I suspect for them it would be less work.

I've often thought there should be an open code/data license that restricts usage and dissemination only to those who agree to make their code and data equally available.

Why? Because in many fields there is a negative incentive to provide code and data. It not only takes time, but it opens you up to criticism by people who wouldn't be willing to make their own code/data available. Perhaps something like this would raise the bar and encourage more people to share their code/data. Just a thought.

I experience it first hand as I am implementing a machine learning algorithm described in a paper. There are questions arising on what and how they did their experiments and on details of the algorithm, which I can't deduce from the paper . Hence, I'm guessing but still unable to reproduce their results. Leaving me to wonder if I have a bug or if I misinterpreted something....

Which just means that the paper was incomplete.

The more interesting question is, how we can check a paper for completeness. I fear the answer is to try and implement it, which is costly for doing it in the peer review process.

The title is somewhat misleading. It made me think of a "secret code" hidden behind a bush waiting to cut the throat of Science...

To some extent it is culture and the current incentive model, and to some extent it's just a need to be pragmatic. If you're a grad student who wants to defend in a certain amount of time and you have various deadlines (conferences, concerns about being scooped), you end up hacking up some code that gets your work done, allows you to analyze your data and publish. That's what gets you recognition, helps you defend, etc. In some cases, the code is your work and those groups spend to spend more time on making sure the code is robust, re-usable, and sustainable.

In general though the system doesn't encourage you to follow good practices at all. Having said that I've definitely seen a change over the last few years towards more awareness.

Pedantic, I know, but: Source code. Not source codes.

Seeing this made the author lose credibility on the subject.

Pedantry isn't so bad, if it's correct... As was stated, the term "source codes" is very common in many communities, particularly scientific computing and associated academic circles. The only credibility lost here is... well let's just move on here, shall we?

The guys at Los Alamos were talking about "codes" (programs) before you touched a computer (and yes, I looked at your profile, I know you've been doing software for 25 years, but there's more to the world of software than mass-market applications)

"Codes" is common in some sub-disciplines (eg HPC). Still sounds weird to my ears though.

Codes is the normal word in Mech Eng, at least, and in other disciplines that were early adopters of computing.

Various people, e.g. Titus Brown, have been trying to take a different approach. Titus doesn't practice open notebook science, but he does try and practice "replication". More on replication: http://ivory.idyll.org/blog/apr-12/replication-i.html

The paper gets it own website: http://ged.msu.edu/papers/2012-diginorm/; which includes arXic preprint, data and code repositories and even an AMI with everything loaded. Basically eveything you need to replicate the work in the paper.

Seems to me that most finished academic papers including dissertations, etc in the field of computer science also lack source code. Alot of institutions even discourage by policy the submitting of source to examiners unless it is illustrative of the text. Not based on any serious survey other than what I read of course so I may be wrong in this belief.

Reproducible research is a really interesting topic. One of my good friends in academia showed me how he was using babel (an emacs mode) to do literate programming and reproducible research. I think itsa fantastic idea, the data, conclusions and code used to arrive there should all be part of the peer review process; open source research.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact