

The real reason (climate) scientists don't want to release their code - jgrahamc
http://blog.jgc.org/2010/11/real-reason-climate-scientists-dont.html

======
patio11
They fear what would happen to their reputations if it were obvious how the
sausage was being made. (Not unique to climate change: my academic AI research
produced decent results on some data sets with code of truly abominable
quality. Then again, I wasn't asking anybody to bet the economy on my
results.)

Plus, if people actually provided code, someone might actually get it into
their heads to _run_ it. That can't happen. No, literally, it can't happen.
The level of professionalism with regards to source control, documentation,
distribution, etc in most academic labs is insufficient to allow the code to
be executed outside of the environment that it actually executes on. If you
put a tarball up somewhere, somebody who tries to run it is going to get a
compile error because they're missing library foo or running on an
architecture which doesn't match the byte lengths hardcoded into the assembly
file, and then they're going to email you, and that is going to suck your time
doing "customer support" when you should be doing what academics actually get
paid to do: write grant proposals.

This, by the way, means that peer review by necessity consists of checking
that you cited the right people, scratched the right backs, and wrote your
paper in the style currently in fashion in your discipline, because
reproducing calculations or data sets is virtually impossible.

~~~
philk
I think pretty much everything looks more impressive before you see how it
actually works.

~~~
patio11
Seriously. Relatedly, it is seriously impressive that systems this
comprehensively screwed up seem to still converge on producing acceptable work
much of the time. Big companies manage to get us all fed and fly us around the
world. That scares the _bejeesus_ out of me. I have put my lives in the hands
of someone who was selected by _an HR department_ (assisted by, even worse, _a
union_ )

~~~
DevX101
Convergence of scientific results could be validation of the theory. But it
could also be because people are anxious to publish contradictory results.

 _From Richard Feynman's1974 CalTech Commencement:_

 _"We have learned a lot from experience about how to handle some of the ways
we fool ourselves. One example: Millikan measured the charge on an electron by
an experiment with falling oil drops, and got an answer which we now know not
to be quite right. It's a little bit off because he had the incorrect value
for the viscosity of air. It's interesting to look at the history of
measurements of the charge of an electron, after Millikan. If you plot them as
a function of time, you find that one is a little bit bigger than Millikan's,
and the next one's a little bit bigger than that, and the next one's a little
bit bigger than that, until finally they settle down to a number which is
higher.

Why didn't they discover the new number was higher right away? It's a thing
that scientists are ashamed of--this history--because it's apparent that
people did things like this: When they got a number that was too high above
Millikan's, they thought something must be wrong--and they would look for and
find a reason why something might be wrong. When they got a number close to
Millikan's value they didn't look so hard. And so they eliminated the numbers
that were too far off, and did other things like that. We've learned those
tricks nowadays, and now we don't have that kind of a disease.

But this long history of learning how to not fool ourselves--of having utter
scientific integrity--is, I'm sorry to say, something that we haven't
specifically included in any particular course that I know of. We just hope
you've caught on by osmosis"_

~~~
randallsquared
_anxious to publish_

Should be "anxious not to publish", or "anxious about publishing", I think.

------
syllogism
It's a simple matter of incentives.

We all want scientists to share their code because that's the positive sum
action. But individual scientists aren't paid based on how well the scientific
community is doing. They're awarded positions, grants and prestige on their
individual performance against other scientists. So scientists worried about
their career think in zero sum terms: if I publish this source code, will I be
pipped to the next paper? Well, I'll publish this other piece to make myself
look good, since I'm not following it up; and then I'll collect the citations
too."

We can wring our hands about scientists acting in bad faith all we like, but
it's obvious we just have to change the incentives. Funding agencies just need
to award higher weight to journals that demand source releases, transitioning
to only weighting those journals.

~~~
CWuestefeld
In concept you're right, but I want to clear up some terminology. This isn't a
"zero-sum game" issue: it's a "prisoner's dilemma" [1]

In the prisoner's dilemma, the parties can work together to yield a common,
greater result. But they might not do so because the common solution requires
trust; any individual might go for the easy answer that brings himself a
return while screwing the others.

[1]
[https://secure.wikimedia.org/wikipedia/en/wiki/Prisoner%27s_...](https://secure.wikimedia.org/wikipedia/en/wiki/Prisoner%27s_dilemma)

~~~
syllogism
I considered this, but if you assume that the scientists don't benefit from
science functioning properly, then they're playing a zero sum game with other
scientists --- they're simply competing with each other for a fixed pool of
resources. That was what I meant.

You could argue that a prisoner's dilemma view of it is more realistic. In
this view, the scientists all desperately do want to get it right, but they
know they can't because they'll be punished for doing it by their peers,
who'll seize the opportunity to get ahead at their expense. This model is
plausible too, but it's different from the one I suggested. So it's not
actually a terminological difference.

------
danieldk
This can only end if peer-reviewed journals require source code (and if
possible datasets) to be made available as well. High-impact journals have the
weight to enforce such policies.

It's true that third parties can apply methods easily to new data. But it is a
testimony to the method, and references will help building the reputation of
the original inventor.

Another concern only addressed in the comments on this blog post is that most
scientists do not produce beautiful programs. The reasons are twofold:

\- Programs are hacked together as quickly as possible to produce results.
Scientists are mostly concerned with testing their theories, and not so much
in producing software for public consumption.

\- Most scientists are not great programmers.

Consequently, scientists usually do not want to make their source code
available.

This situation sucks, given that in many countries taxpayers fund science.

~~~
RBerenguel
Yes, taxpayers fund science, but commenting, beautifying and documenting code
is not what a physicist/mathematician/climate researcher wants to spend his
time. Usually you want to be doing research, may it be by direct coding or by
doing something else, related. Doing this kind of stuff is far worse than
filling grant proposals or doing other bureaucratic stuff.

On the other side, I disagree with "most scientists are not great
programmers". What is a "great programmer"? In my definition, it is someone
who can write a program to solve a problem without too much hassle. And a lot
of scientists I know satisfy this to terrific levels. Of course, they use no
orthogonality, nor source code control, nor do extreme programming and usually
don't write test cases. They just do what is asked as quickly as possible to
keep on doing what is needed to do.

~~~
Maro
I disagree strongly about your definition of "great programmer". Your
definition is of a "somebody who can program".

~~~
kenjackson
"Somebody who can program" have written probably 99% of used applications in
the world today. From Bill Joy to Donald Knuth to David Cutler to Linus
Torvalds to Guy Steele to Jamie Zawinsky to Guido van Rossum to John Carmack,
and almost everyone in between.

Great programmers somehow only appear to write books and give pristine
examples of how you build infinitely extensible architectures.

~~~
Maro
John Carmack produces the most maintainable and readable codebases. I happen
to have the Quake3 source code up on my github at, so you can see for
yourself:

<https://github.com/mtrencseni/quake3>

Donald Knuth is the author of literate programming, which is a framework for
writing human-readable programs:

<http://www-cs-faculty.stanford.edu/~uno/lp.html>

~~~
kenjackson
Have you read how each of them actually write their code? Both write them
precisely how the other poster noted.

And I've read the Quake 3 code extensively. Great code, but certainly not
pristine, and I'm sure if you handed it to code review to virtually anyone you
know, they'd find a whole bunch of sylistic and architectural issues with it.
Like take a look at the playerDie code. You're telling me you wouldn't have
said, "Rewrite this?" if you a colleague handed this to you?

And yes, Knuth is the author of literate programming, but that's not how the
code started out. Read his letters on computer science.

Tarjan wrote a similar thing, I think in his ACM Turing Award lecture.

I picked those names because they are the best our industry has. But even with
that, they all pretty much write code the way the previous post noted.

------
j4mie
Actually, I have heard one good argument against open-sourcing scientific code
[1]. It's not bullet-proof, and it won't apply in all situations, but I think
there's a nugget of truth in there.

If I write a paper describing an algorithm (or process, or simulation, etc)
_and_ open source my code, someone attempting to reproduce and confirm my work
is likely to take my code, run it, and (obviously) find that it agrees with my
published results. No confirmation has actually taken place - errors in the
code will only confirm errors in the results. Further work may then be based
on this code, which will result in compound errors.

If, however, I carefully describe my algorithm and its purpose in my paper,
but _don't_ open source the code, anyone who wishes to reproduce my results
will have to re-implement my code, based on my description. This is vastly
more likely to highlight any bugs in my implementation and will therefore be
more effective in confirming or disconfirming my findings.

I'm not sure yet what I think about this argument. It seems to only apply in
certain domains and within a limited scope (what if the bug exists in my
operating system? Or my math library?) but in relatively simple simulation
models, it may have some validity.

What do you think?

[1] From Adrian Thompson, if you're interested:
<http://www.informatics.sussex.ac.uk/users/adrianth/ade.html>

~~~
stygianguest
Your argument doesn't seem to apply to most code in natural sciences. The
difference being that their theoretical models are, in most cases, only
approximated by code. Yet, in papers, they claim to prove or test their
theoretical model by experiments on the approximation.

Note that this is nothing terribly new. Sometimes experiments testing a
hypothesis give false positives because something went wrong in the
experiment. In that sense there is no real difference between a faulty
thermometer and a bug in your code.

Having written this, I think your argument in fact does apply after all. In
natural sciences one would argue that if your buggy code confirms an invalid
hypothesis, someone redoing your argument with the same code would not uncover
the problem. Publishing your code invites people to use your faulty
thermomenter.

Of course I'm assuming that the published paper contains all necessary details
to reproduce the experiment. In case of e.g. climate models or modelling the
formation of galaxies that might be a problem because the code _is_ the
experiment. Describing what code does is very hard, and it would be easier to
just publish it in stead.

~~~
jonhendry
" Describing what code does is very hard, and it would be easier to just
publish it in stead.'

Depends on what level you're describing it. "It does an FFT" or "it sorts" are
pretty clear. It gets hairy if you describe the specific details of the
implementations. But the implementation is likely irrelevant, because other
scientists can choose any implementation. Even with complex models you ought
to be able to piece much of them together out of descriptions at that higher
level.

------
jfager
Why are we singling out climate scientists here? The only article of the three
that were linked that was solely about climate scientists was the one from
RealClimate; the other two make it more than clear that these issues span the
full scientific spectrum.

And why dismiss so casually the argument that running the code used to
generate a paper's result provides no actual independent verification of that
result? How does running the same buggy code and getting the same buggy result
help anyone? As long as a paper describes its methods in enough detail that
someone else can write their own verification code, I would actually argue
that it's better for science for the accompanying code to _not_ be released,
lest a single codebase's bugs propagate through a field.

The real problem, if there is one here, is the idea that a scientist's career
could go anywhere if their results aren't being independently validated. A
person with a result that only they (or their code) can produce just isn't a
scientist, and their results should never get paraded around until they're
independently verified.

~~~
jgrahamc
_Why are we singling out climate scientists here?_

Because this recent rash of articles is a result of "ClimateGate". Clearly the
issues raised are more general.

 _And why dismiss so casually the argument that running the code used to
generate a paper's result provides no actual independent verification of that
result? How does running the same buggy code and getting the same buggy result
help anyone_

I think it's a bogus argument because it's one scientist deciding to protect
another scientist from doing something silly. I like your argument about the
code base's bugs propagating but I don't buy it. If you look at CRUTEM3 you'll
see that hidden, buggy code from the Met Office has resulted in erroneous
_data_ propagating the field even though there was a detailed description of
the algorithm available ([http://blog.jgc.org/2010/04/met-office-confirms-
that-station...](http://blog.jgc.org/2010/04/met-office-confirms-that-station-
errors.html)). It would have been far easier to fix that problem had the
source code been available. It was only when an enthusiastic amateur (myself)
reproduced the algorithm in the paper that the bug was discovered.

~~~
jfager
_It was only when an enthusiastic amateur (myself) reproduced the algorithm in
the paper that the bug was discovered._

But _that's_ the actual problem, that nobody else tried to verify the data
themselves before accepting it into the field. If you could reproduce the
algorithm in the paper without the source code, why couldn't they?

And while it may have meant that the Met Office's code would itself have been
_fixed_ faster, I don't buy the idea that having the code available
necessarily would have meant the errors in the resulting data would have been
_discovered_ faster. That would imply that people would have actually dived
into the code looking for bugs, but we've already established that the people
in the field are bad programmers who feel they have more interesting things to
do. Why isn't it just as plausible that they would have run the code, seen the
same buggy result, and labored under the impression they had verified
something?

------
paufernandez
What I have seen so far is that very bright and capable scientists
(physicists, for instance) who are non-programmers[1] are usually extremely
ashamed of their code. I'm talking even CS Professors, who spend most of the
time proving theorems. Structuring code well and making sure it's correct _is_
hard, and they know.

[1] Programmer = somebody who spends 8 hours a day at it.

~~~
RBerenguel
If "Programmer = somebody who spends 8 hours a day at it." then my students of
Numerical Analysis are programmers, and not mathematicians. They are currently
coding an assignment on continuation of zeros and (at least looks like) they
are spending a ton of hours each day on it (and making me loose a lot of time
answering email questions, by the way)

~~~
gjm11
I can readily believe that they're (currently working as) programmers, but
where do you get "not mathematicians" from? If you're spending 8 hours a day
writing mathematical code and understand the mathematics, then in my book
you're being both a programmer and a mathematician.

Incidentally, my experience is that plenty of people who _are_ programmers are
ashamed of a lot of their code too, at least in the sense that they wouldn't
want anyone else reading it and judging them. Writing code that looks good as
well as getting the job done is hard, whoever's doing it, and it's by no means
always worth the effort.

~~~
RBerenguel
Because they have strong problems understanding the mathematics, but they
devote all their time to code something they don't understand . I have tried
my best to get them to understand it, or convince them to understand first and
code later, to no use.

------
DanielBMarkham
Taxpayers and scientists have a deal: we provide _some_ support for your
education and research, and in return you show us how to do stuff.

If you don't like that deal, governments have an even better one: we give you
patent rights on what you invent -- as long as you show us how it is done.

These deals aren't altruism on the part of the public. Nobody thinks science
is a charity. It's vital to the interests of the particular nations and the
species as a whole.

In my opinion, no institution of higher learning that is supported by
taxpayers should be giving out credentials to people who are so insecure and
unprofessional as to not be able or willing to completely describe how they
reached whatever conclusions they have. And that's not even getting into the
issue of taking research and making political arguments out of it. That raises
the bar even higher.

It's a scandal. And the only reason it's coming out is because some people --
for whatever reason -- have a bug in their shorts about climate science.

It's time to set some ethical standards for all scientific research. Open
data, open programming on standardized platforms, and elimination of
scientist-as-activist. There's just too much dirt and conflict of interest in
certain areas of science. Not all, by any means. But enough to leave a bad
taste in the average citizen's mouth. I love science. We deserve better than
this. Something needs fixing.

~~~
nickpinkston
Very well said - I find the current state of publishing in academia appalling.
Don't forget that most research is behind a pay wall just for the damn PDF! I
think you nailed it with "insecure". The little PhDs need to know their work
isn't designed to just get them tenure...

~~~
RBerenguel
Like we choose to have our PDF's under pay walls. I don't even have printed
copies of my paper, because the publisher does not want to make the
expenditure. And if I were to lose my password, I would have no access to my
own paper for download (and I don't have access to the rest of papers in the
same issue, of course).

~~~
nickpinkston
So you're saying that you can't release due to some licensing / copyright
issue, or that it's just something that takes extra time? I definitely
understand the former - not that I like it, but the latter is inexcusable.

~~~
RBerenguel
It depends heavily on the journal. Most journals have a "final draft policy":
What they print is only theirs to publish. But you can self-post whatever
previous versions you have. In my case, I think there are one or two minor
spelling mistakes in the versions I have posted in ArXiV and my homepage. It
does not take extra time (at least not a lot) to self-post it or publish on
ArXiV (just a little hassle with image conversion problems, YMMV)

~~~
nickpinkston
Would you say that you're the exception in posting them for the public? If so,
would it be worth while for someone to try to get at these non-final but still
perfectly useful papers?

~~~
RBerenguel
I really can't tell. As far as I know, all people in my department publish
freely his documents: either in ArXiV or in the department page for submitted
papers. Also, ArXiV has a huge numnber of articles in Mathematics, the growing
trend is to submit it there. I guess that most mathematicians (or at least,
young ones) at least provide some draft version of their published manuscripts
online, freely available.

~~~
nickpinkston
Yea, ArXiV is a great resource, and I actually hadn't seen that it's grown
this much. Your department is one of the good ones - I salute you! Here's to
more doing the same.

~~~
RBerenguel
I also hope everyone starts doing it. There is no point in making research
unavailable to the public just for the sake of keeping the journal's "level".
The future is open content, but most publishers are still blind to it

------
jderick
In computer science academia, I have not heard of someone refusing to release
their code. This seems quite bizarre to me. Of course it is not usually very
polished code, but still there is no justification for hiding it.

~~~
lutorm
It happens all the time in astrophysics. Codes are competitive edges, and the
support burden from people asking questions about your code that you did
release is also a very real issue.

------
drallison
This article seems to have three goals.

1\. Spread FUD (Fear, Uncertainty, and Doubt) about the scientific results
used to create evidence for global warming.

2\. Observe that the training and skills of scientists processing data,
building models, and drawing conclusions from data need to be improved.

3\. Promote a very limited view of the scientific method where "replicating a
result" means "accessing another scientist's data and computer programs and
duplicating the processing that was performed". Independent verification
usually means that a totally independent experiment is run to test the same
hypothesis, new data is gathered and processed and a result produced which is
compared with previous results (and those predicted by current theories).
Verification means that the same phenomenon is observed at the same level
modulo the statistics of measurement.

~~~
jgrahamc
1\. That's not right. I'm not interested in FUD, I am interested in the debate
about releasing source code that's come about because of the so-called
"ClimateGate" thing.

3\. Also not correct. I simply don't believe that not releasing source code is
the right answer. It's one group of scientists claiming to save another group
from themselves. The argument appears to be that if they released the code
others would run it and be satisfied with the result. So? That's just bad
science and tells you something about the people who run the code. The
solution isn't to protect idiots from themselves.

------
bigiain
Other people have thought about, and at least started to solve the problem of
academic source code being extremely proof-of-concept rather that production
or resume ready pieces of software engineering art:

<http://matt.might.net/articles/crapl/>

------
ced
I've worked/studied computational physics for a few years. My experience has
not been good.

First, I don't think that we've learned how to make complex models yet. But in
fairness, it's a _really hard problem_. If my numerical code is wrong, I won't
get a segfault. Rather, I may notice "unusual" patterns in my model output,
which could be:

\- A genuine physical effect

\- An artifact of the assumptions we used (because models are simplifications)

\- A numerical method that hasn't converged, or whose accuracy is insufficient

\- A bug

Untangling this is nigh impossible, unless you rely on very, very careful
testing of independent parts. That's how the NASA does it [1], but it's simply
not within the realm of what the typical physicist can/will do (and
understandably so, numerics is hard).

The solution would be to have tried and tested _libraries_ , built by
numerical specialists, so that physicists would only have to specify the
equations to solve. That's what Mathematica does, and it's the only sane way I
know of making complex models.

But it's slow, so physicists use Fortran instead, and code their own numerical
routines in the name of efficiency. Tragedy ensues. Fortran's abstraction
capabilities are below C [2]. Modularity is out of the window.

I spent a summer working on one particularly huge model, that had been
developed and tweaked over twenty years. At some point I encountered a strange
1/2 factor in a variable assignment, and questioned my advisor about it.

"Oh, is that still in there? That's a fudge factor, we should remove it."

A fudge factor. No comment, no variable name, just 1/2.

Another scientist told me: "No one really knows anymore what equations are
solved in there.", to which my advisor replied "Ha, if we gathered all the
scientists for an afternoon, we could probably figure it out."

But I agree with the other posters and jgrahamc: the incentives for producing
quality code and models are just not there. And sadly, I don't see them
changing anytime soon.

[1] <http://www.fastcompany.com/node/28121/print>?

[2] (At least, the subset of Fortran used by the physicists I've met. Modern
Fortran is a bit different.)

------
kvs
One of the big problems is there is no incentive for repeating or asserting
previous results/findings. So even if someone is doubtful of an assertion made
there is no incentive to follow up and verify the assertion in general. I
don't think sharing code or secret data cleaning methods is going to bring
much change unless someone is rewarded for repeating the results.

------
RBerenguel
I have a question, after so much reading and commenting in this thread (and
the original post). How many of the people here (programmers and non
programmers) have peer-reviewed a paper, or written a paper (mind you, not in
CS) that has been peer-reviewed?

------
diego_moita
For a non-American this is one of the most typical patterns in HN worldview. I
call it the "libertarian style conspiracy theory".

~~~
jgrahamc
What conspiracy theory?

------
roadnottaken
I don't think you can call yourself an academic if you're unwilling to share
and describe your methodology in sufficient detail that others can follow it.
That's the major difference between academia and industry. Also it should
obviously be mandatory for taxpayer-funded research.

------
nice1
This sounds plausible, but is quite naive really. We are talking about huge
amounts of money which is at risk if the cat gets out of the bag. Please
remember that these sleazebags also do everything to prevent raw data being
available. They just want us to accept their "findings" and pocket the next
multi-million dollar check.

~~~
jfager
Everything they can to prevent raw data being available, up to and including
posting it freely on their own websites:

<http://www.realclimate.org/index.php/data-sources/>

