
Why scientific programming does not compute - szany
http://www.nature.com/news/2010/101013/full/467775a.html
======
dasil003
For all the talk of "best practices" and "training" the depressing truth is
that guaranteeing correct software is incredibly difficult and expensive.
Professional software engineering practices aren't nearly sufficient to
guarantee correctness with heavy math. The closest thing we have is NASA where
the entire development process is designed and constantly refined in response
to individual issues to create the checks and balances with the lofty goal of
approaching bug impossibility at an organizational level. Unfortunately this
type of evolutionary process is only viable for multi-year projects with
9-figure budgets. It's not going to work for the vast majority of research
scientists with limited organizational support.

On the positive side, such difficulty is also in the nature of science itself.
Scientists already understand that rigorous peer review is the only way to
come to reliable scientific conclusions over time. The only thing they need
help with understanding is that the software used to come to these conclusions
is as suspect as—if not more so than—the scientific data collection and
reasoning itself, and therefore all software must be peer-reviewed as well.
This needs to be ingrained culturally into the scientific establishment. In
doing so, the scientists can begin to attack the problem from the correct
perspective, rather than industry software experts coming in and feeding them
a bunch of cargo cult "unit tests" and "best practices" that are no substitute
for the deep reasoning in the specific domain in question.

~~~
schleyfox
I've spent a bit of time on the inside at NASA, specifically working on earth
observing systems. There is a huge difference between the code quality of
things that go into control systems for spacecraft (even then, meters vs.
feet, really?) and the sort of analysis/theoretical code the article talks
about. Spacecraft code gets real programmers and disciplined practices, while
scientific code is generally spaghetti IDL/Matlab/Fortran.

There is a huge problem with even getting existing code to run on different
machines. My team's work was primarily dealing with taking lots of project
code (always emailed around, with versions in the file name) and rewriting it
to produce data products that other people could even just view. Generally
we'd just pull things like color coding out of the existing code and then
write our processors from some combination of specifications and
experimentation.

I'd agree that "unit tests" and trendy best practices are probably not the
full answer, but the article is correct in emphasizing documentation,
modularity, and source control. Source control alone would protect against
bugs produced by simply running the wrong version of code.

~~~
dasil003
Definitely. Obviously the software industry has a lot of know-how that would
be invaluable to the science community. The critical point I was trying to
make is that scientists need to understand the fundamental difficulty of
software correctness before they can be expected to apply best practices
effectively.

------
gallamine
I'm a PhD student in Electrical Engineering. I'm currently working on a Monte
Carlo-type simulation for looking at the underwater light field for underwater
optical communication (no sharks!). I'm doing the development in MATLAB and I
recently put all my code up on Github
(<https://github.com/gallamine/Photonator>) to help avoid some of these
problems (lack of transparency). Even if nobody ever looks/uses the code, I
know every time I do a commit there's a change someone MIGHT and I think it
helps me write better code.

The problem with doing science via models/simulation is that there just isn't
a good way of knowing when it's "right" (well, at least in a lot of cases), so
testing and verification are imperative. I can't tell you how many times I've
laid awake at night wondering if my code has a bug in it that I can't find and
will taint my research results.

I suspect another big problem is that one student writes the code, graduates,
then leaves it to future students, or worse, their professor, to figure out
what they wrote. Passing on the knowledge takes a heck of a lot of time,
especially when you're pressed to graduate and get a paycheck).

There's got to be a market in this somewhere. Even if it was just a volunteer
service of "real" programmers who would help scientists out. I spent weeks
trying to get my code running on AWS, which probably would have taken a few
hours from someone who knew what they were doing. I also suspect that someone
with practice could make my simulations run at twice the speed, which really
adds up when you're doing hundreds of them and they take hours each.

~~~
john_b
I'm a M.S. student in mechanical engineering facing a similar situation,
except I haven't put any code on Github (my advisor wants to keep it
proprietary, but I probably would not bother putting it up even if he were ok
with it).

I've written around 15000 lines of MATLAB for my research and only a handful
of people will ever need to see it. Some is well-structured and nicely
commented, but other parts are incomprehensible and were written under severe
time constraints. My advisor is not much of a programmer and will not be able
to figure it out, and I feel bad for leaving a pile of crappy code to the
person who inevitably follows in my footsteps, but I ultimately have a choice
between writing fully commented, well-tested, and well-structured code and
graduating a semester late (at the cost of several thousand dollars to
myself), or writing code that's "just good enough" to get results on time.
This is a solo project (there is no money for a CS student to intern) and I'm
not getting paid to write code unlike a professional programmer, so every
second I spend improving my code beyond the bare minimum _costs me_ time and
money.

Even if I were able to tidy up and publish all of my code, most mechanical
engineers would not be able to understand it because most can't write code.
Those who can mostly use FORTRAN, although C is becoming more common.
Nonetheless, even those who could understand my code would have little
incentive to read through 15000+ lines of code.

Unfortunately, as far as research code is concerned, a lot of trust is still
required on the part of the reader of the publication. I agree that the
transfer of knowledge _should_ be handled differently, but until there is a
strong incentive for researchers to write good code it will continue to be
bad. Especially when many research projects only require the code to
demonstrate something, after which it can be put in the closet.

~~~
eru
I wonder whether our priorities for research are misguided. Isn't research
about extending the knowledge of humanity? Writing and passing on readable
code would probably advance us further in total, than everyone starting
basically from scratch.

(I'm not faulting you, you just react to the incentives.)

~~~
reginaldo
Well, I went to a lecture by one of the most prominent scientists here in
Brazil, where he explicitly said that the answer to your question is NO.
Research as it stands today exists to feed the system. According to him, you:

* Publish, so you can get grants

* Use that grant so you can publish more

* Get more grants;

* Get tenure somewhere in the middle.

I have to confess I was very disgusted by him saying that in front of such a
large audience of scientists and graduate students.

EDIT: Formatting

~~~
_delirium
I agree it's a problem, but I think you have to fix the incentives to make
meaningful change. When people are thrown into a cut-throat competitive
environment, with tenure clocks, multiple junior professors per tenure slot,
requirement to bring in grants to fund your research or you get shut down,
etc., it doesn't encourage people to be altruistic and sharing.

~~~
cdavoren
I think the problem is fundamentally one of economics. Research is good, but
you have to decide how much money to allocate to it. In order to decide, you
need a metric for performance. Really, only scientists are qualified to judge
whether the results of other scientists are worth anything, so currently the
only metric we really have is publishing in peer-reviewed journals.
Ultimately, therefore, that's where the incentives end up.

When a more appropriate way of quantifying research output and its benefits is
found, hopefully a beneficial change in culture will trickle down into the
academic trenches.

~~~
eru
How about trying to fix the current system by making somebody else using your
software count as a "super citation"? (It could even arguably count as much as
co-authorship.)

~~~
john_b
I think this is an excellent idea. If published software could be tagged via a
unique identifier (like the DOI of a paper), then it could be cited by that
tag just like a paper. Well written software might even get cited more than
the paper it was published in.

------
gte910h
I want to write a "software style guide" for journalists and their editors.

Software and Code are both mass nouns in technical language.

"Code" can be in programs (aka, things that run), libraries (things that other
programmers can use to make programs), or in samples to show people how to do
things in their programs or libraries. Some people call short programs
scripts.

When you feel you should pluralize "software", you're doing something wrong.
You might want to use the word programs, you might want to use the word
products, you might want to just use it like a mass noun "It turns out,
thieves broke into the facility and stole some of the water", etc when talking
about a theft of software "It turns out, thieves broken into the facility and
stole some of the software".

~~~
szany
Actually it's a science thing. In a scientific context "code" is understood to
mean "program". For example:
[http://scholar.google.com/scholar?q=%22population+synthesis+...](http://scholar.google.com/scholar?q=%22population+synthesis+codes%22)

I'm not sure why this is though.

~~~
gte910h
What an excellent idea: Generate new jargon that's incompatible with the
jargon used from the field you suck at.

Perhaps concerned scientists and editors should reject the bifurcation here
and take on the lingo of the field that creates the tool they have to use and
need to learn better as a first step in learning to program in a more
responsible manner?

~~~
_delirium
I think you have the chronology backwards. The use of "code" as a mass noun
dates to the 1960s at the earliest (actually, I can't find a good example
before the 1970s in brief searching), while the use of "code" as a singular
noun to mean "implemented algorithm", and "codes" as the plural, dates back at
least to the 1950s.

As one of many examples, here's a 1955 survey article compiling a list of
"available digital computer codes for nuclear reactor problems":
[http://www.osti.gov/energycitations/product.biblio.jsp?osti_...](http://www.osti.gov/energycitations/product.biblio.jsp?osti_id=4364015)

~~~
gte910h
My Chronology may still be backwards, but they still should swap over to the
language of the mature field of software development's language to better
allow themselves to integrate in good practices.

The bifurcation is still harmful to them even if the software usage originated
later than the science term.

------
jzila
My girlfriend is a PhD student in a pharmacology lab. I'm a software engineer
working for an industry leader.

Once, she and the lab tech were having issues with their analysis program for
a set of data. It was producing errors randomly for certain inputs, and the
data "looked wrong" when it didn't throw an error. I came with her to the lab
on a Saturday and looked through the spaghetti code for about 20 minutes. Once
I understood what they were trying to do, I noticed that they had forgotten to
transpose a matrix at one spot. A simple call to a transposition function
fixed everything.

If this had been an issue that wasn't throwing errors, I don't know whether
they would have even found the bug. I've been trying to teach my gf a basic
understanding of software development from the ground up, and she's getting a
lot better. But this does appear to be a systemic problem within the
scientific community. As the article notes, more and more complicated programs
are needed to perform more detailed analysis than ever before. This problem
isn't going to go away, so it's important that scientists realize the
shortcoming and take steps to curb it.

~~~
reinhardt
Not sure how your anecdote relates to the conclusion. Forgetting, or even
knowing why, to transpose a matrix is not an example of a problem that can be
solved by "a basic understanding of software development". Hell, I'm sure
there are many decent hackers that don't know what a matrix is, let alone spot
such errors within a long sequence of computations.

~~~
Dove
I disagree. Well written and abstracted code makes mathematical formulas easy
to scan and proofread.

Your code should look like this:

    
    
        angle = acos( ( a . b ) /
                      (|a|*|b|) )
    

If it looks like this . . .

    
    
        angle2 = Math.acos((a.x*b.x + a.y*b.y + 
          a.z*b.z)/Math.sqrt(a.x*a.x + a.y*a.y + 
          a.z*a.z)*Math.sqrt(b.x*b.x + b.y*b.y + 
          b.z*b.z))
    

. . . you're likely to miss the error*

Bad code compiles. Good code works right. Great code is so obviously right you
don't have to wonder.

*Those are the same formula, though the second one is missing some critical parentheses. I use the example because I have done exactly this and been bitten by exactly this, and now am fanatical about keeping my mathematical formulas clean and obvious.

------
GoogleMeElmo
Yes this is a huge problem. I am a software engineer working at a research
institute for bioinformatic. The biggest problem I encounter in my struggle
for clean maintainable code, is that management down prioritize this task
quite heavily.

The researchers produce code of questionable quality that needs to go into the
main branch asap. Those few of the researchers that know how to code (we do a
lot of image analysis), don't know anything about keeping it maintainable.
There is almost a hostile stance against doing things right, when it comes to
best practices.

The "Works on my computer" seal of approval have taken a whole new meaning for
me. Things go from prototype to production by a single correct run on a single
data set. Sometimes its so bad I don't know if I should laugh or cry.

Since we don't have a single test or, ever take the time to do a proper build
system, my job description becomes mostly droning through repetitive tasks and
bug hunting. It sucks the life right out of any self respecting developer.

There, I needed that. Feel free to flame my little rant down into the abyss.
:)

------
bh42222
_As a general rule, researchers do not test or document their programs
rigorously, and they rarely release their codes, making it almost impossible
to reproduce and verify published results generated by scientific software,
say computer scientists._

Just stop doing that!

Seriously, testing is not wasted effort and for any project that's large
enough it's not slowing you down. For a very small and simple project testing
might slow you down, for bigger things - testing makes you faster! And the
same goes for documentation. And full source code should be part of every
paper.

 _Many programmers in industry are also trained to annotate their code
clearly, so that others can understand its function and easily build on it._

No, you document code primarily so YOU can understand it yourself. Debugging
is twice as hard as coding, so if you're just smart enough to code it, you
have no hope of debugging it.

~~~
pygy_
You won't see any clean code written by scientists until (major) journals make
it mandatory to submit code for peer review and publication.

When it happens, I hope that they'll manage to agree on a sensible license
(even though I won't set my hopes too high).

~~~
rflrob
I have all of my code on github under a CRAPL license [1]. It assumes a
certain amount of good-faith from others, but I feel that if you're worrying
about getting scooped, your problem isn't ambitious enough. Luckily, my
adviser agrees, and is very in favor of open releases of data [2].

[1]<http://matt.might.net/articles/crapl/>
[2]<http://www.michaeleisen.org/blog/?p=440>

~~~
pygy_
The terms of the license are good, but its name is literally crappy:-/

------
notarealname
[New account for anonymity]

An often neglected force in this argument is that many practitioners of
"scientific coding" take rapid iteration to its illogical and deleterious
conclusion.

I'm often lightly chastised for my tendencies to write maintainable,
documented, reusable code. People laugh guiltily when I ask them to try
checking out an svn repository, let alone cloning a git repo. It's certain
that in my field (ECE and CS) some people are very adamant about clean coding
conventions, and we're definitely able to make an impact bringing people to
use more high level languages and better documentation practices.

But that doesn't mean an hour goes by without seeing results reverse due to a
bug buried deep into 10k lines of undocumented C or Perl or MATLAB full of
single letter variables and negligible modularity.

~~~
gte910h
What I am hearing from you, is that we need to lobby the US government into
open code requirement for grant work done under their purvey.

Also some sort of git front end that unwilling people could use would make
things better?

------
gwern
An interesting citation <http://portal.acm.org/citation.cfm?id=188228> :

> This paper describes some results of what, to the authors' knowledge, is the
> largest N-version programming experiment ever performed. The object of this
> ongoing four-year study is to attempt to determine just how consistent the
> results of scientific computation really are, and, from this, to estimate
> accuracy. The experiment is being carried out in a branch of the earth
> sciences known as seismic data processing, where 15 or so independently
> developed large commercial packages that implement mathematical algorithms
> from the same or similar published specifications in the same programming
> language (Fortran) have been developed over the last 20 years. The results
> of processing the same input dataset, using the same user-specified
> parameters, for nine of these packages is reported in this paper. Finally,
> feedback of obvious flaws was attempted to reduce the overall disagreement.
> The results are deeply disturbing. Whereas scientists like to think that
> their code is accurate to the precision of the arithmetic used, in this
> study, numerical disagreement grows at around the rate of 1% in average
> absolute difference per 4000 fines of implemented code, and, even worse, the
> nature of the disagreement is nonrandom. Furthermore, the seismic data
> processing industry has better than average quality standards for its
> software development with both identifiable quality assurance functions and
> substantial test datasets.

~~~
gwern
There's a later paper available for reading:
<http://www.leshatton.org/Documents/Texp_ICSE297.pdf>

------
saulrh
Something I heard from one of my professors once: "A programmer alone has a
good chance of getting a good job. A scientist alone has a good chance of
getting a good job. A scientist that can program, or a programmer that can do
science, is the most valuable person in the building."

~~~
earl
That has not been my experience. I spent years working on DNA aligners then in
a wetlab building software for confocal laser microscopy. In both locations,
the best paid and most highly valued people were the scientists. If you, say,
were a good developer with masters in stats and a strong understanding
(somewhere between an undergrad and an MS) of the relevant science... you were
paid 1/2 as much as you would be paid if you did computational advertising.

And yet there aren't many good developers doing science. Weird, huh?

~~~
gammarator
Agreed that low salaries are a disincentive for good engineers to work in
science. It's worth noting that even the scientists are paid far less than
people with commensurate training and experience in industry.

As for why scientists are more highly valued: they bring in the grants that
keep the wheels turning (in academic circles; industry & national labs
obviously differ).

------
brohee
Next they'll discover than when those scientists leave academia and become
quants, they don't magically become any better at coding (but at least they
now have access to professionals, if they recognize the need).

------
ajdecon
_(Disclaimer: my background is in materials physics, and it may be different
in other fields. But I doubt it.)_

Unfortunately there is very little _direct_ incentive for research scientists
to write or publish clean, readable code:

\- There are no direct rewards, in the tenure process or otherwise, for
publishing code and having it used by other scientists. Occasionally code
which is widely used will add a little to the prestige of an already-eminent
scientist, but even then it rarely matters much.

\- Time spent on anything other than direct research or publication is seen as
wasted time, and actively selected against. Especially for young scientists
trying to make tenure, also the group most likely to write good code. Many
departments actually discourage time spent on _teaching_ , and they're paid to
do that. Why would they maintain a codebase?

\- Most scientific code is written in response to specific problems, usually a
body of data or a particular system to be simulated. Because of this, code is
often written to the specific problem with little regard for generality, and
only rarely re-used. (This leads to lots of wheel re-invention, but it's still
done this way.) If you aren't going to re-use your code, why would others?

\- If by some miracle a researcher produces code which is high-quality and
general enough to be used by others, the competitive atmosphere may cause them
to want to keep it to themselves. Not as bad a problem in some fields, but I
hear biology can be especially bad here.

\- Most importantly, _the software is not the goal_. The goal is a better
understanding of some natural phenomenon, and a publication. (Or in reverse
order...) Why spend more time than absolutely necessary on a single part of
the process, especially one that's not in your expertise? And why spend 3x-5x
the cost of a research student or postdoc to hire a software developer at
competitive rates?

I went to grad school in materials science at an R1 institution which was
always ranked at 2 or 3 in my field. I wrote a _lot_ of code, mostly image-
processing routines for analyzing microscope images. Despite it being
essential to understanding my data, the software component of my work was
always regarded by my advisor and peers as the least important, most annoying
part of the process. Time spent on writing code was seen as wasted, or at best
a necessary evil. And it would never be published, so why spend even more time
to "make it pretty"?

I'm honestly not sure what could be done to improve this. Journals could
require that code be submitted with the paper, but I really doubt they'd be
motivated to directly enforce any standards, and I have no faith in scientists
being embarrassed by bad code. Anything not in the paper itself is usually of
secondary importance. (Seriously, if you can, check out how bad the
"Supplementary Information" on some papers is.) But even making bad code
available could help... I guess. And institutions could try to more directly
reward time put into publishing good code, but without the journals on board
it may be seen as just another form of "outreach"--i.e., time you should have
been in lab.

I did publish some code, and exactly two people have contacted me about it.
That does make me happy. But many, many more people have contacted me to ask
about how I solved some problem in lab, or what I'm working on now that they
could connect with. (And are always disappointed when I tell them I left the
field, and now work in high-performance computing.) Based on the feedback of
my peers... well, on what do you think I should've spent my time?

~~~
JohnnyBrown
I'm working in biology now, and a good example of researchers who produce
quality, documented code that other people find useful is the Knight group at
UC Boulder. They write python with good docs and support, publish the
algorithms they come up with in bioinformatics journals, and people cite them
all the time.

Might be worth thinking about why there are incentives there and not
elsewhere.

~~~
roadnottaken
That is an excellent example and a good point. But, for what it's worth, the
Knight lab doesn't really do any biology. Most of their biology is done by
collaboration with other labs, and the people in the lab are almost entirely
programmers or database people. There's nothing wrong with that, but it's more
an example of programmers getting into biology than the other way around.

------
arctangent
I think it is unreasonable to expect that a person will be a good programmer
just because (a) they are a scientist and (b) their current project can be
assisted by computers.

Is it not sensible, perhaps, to have a dedicated group of programmers (with
various specialities) available as a central resource to assist the scientists
with their modelling? (I am imagining a central pool whose budget would be
spread over several areas.)

I personally love working on toy projects related to science. Maybe we hackers
with time for that kind of thing should volunteer in some way to assist with
the technical aspects of research that is directed by a scientist? I'm not
sure I'd even care about getting a credit on a research paper so long as I
could post pretty pictures and graphs on my blog...

------
ANH
From personal experience, I attest that it can be more difficult than pulling
teeth to get a scientist to commit code to a version control system.

~~~
pwang
Greg Wilson once commented that the subversive way to get scientists to use
source control was not to pitch it as a code history tool, but rather as a
nifty way to sync up code between their work machines, home machines, etc. He
said he had a lot more traction with that than trying to lecture them about
having code history.

~~~
pama
Dropbox has invalidated this pitch.

~~~
eru
Not completely. Git is better at merging.

~~~
gte910h
Git's also a bit hard for them to handle by themselves with how rarely they
need to use it (I'm a huge git proponent, but also a realist).

------
scott_s
One of the main sources in the article is a study from the 2009 Workshop on
Software Engineering for Computational Science and Engineering. One of the
workshop's organizer's has a report of the overall conference which is
interesting: <http://cs.ua.edu/~carver/Papers/Journal/2009/2009_CiSE.pdf>

------
mclin
Rather than building these data analysis/visualization programs from scratch
each time, my thought is that scientists should instead be writing them as
modules for a data workflow application like RapidMiner.

If you haven't heard of RapidMiner, you basically edit a flowchart where each
step takes inputs and outputs, eg take some data and make a histogram, or
perform a clustering analysis.

Video of someone demoing it: <http://www.youtube.com/watch?v=TNESlvXp47E>

This way, the scientists can focus on the algorithms and not have to worry
about all the other details of creating useable, maintainable software.

~~~
droz
Do you know of any other good data analysis applications similar to
RapidMiner?

~~~
mclin
I don't know any first hand, but other's I've heard of: Taverna:
<http://taverna.org.uk/> Trident:
<http://www.microsoft.com/mscorp/tc/trident.mspx>

------
gwern
There are a lot of suggestions that the code and data be required to publish.

Sorry guys, but that hasn't worked so far: the economics journal _Journal of
Money, Credit and Banking _, which required researchers provide the data &
software which could replicate their statistical analyses, discovered that
<10% of the submitted materials were adequate for repeating the paper (see
"Lessons from the JMCB Archive", Volume 38, Number 4, June 2006).

Oops.

------
sliverstorm
Why not just hire comp scientists or programmers permanently? Adjust the
company model, permanently segregate the work?

~~~
prospero
I did some work on data visualization for the astrophysics department when I
was in college. I started to work with the simulation code, but found that the
math was sprinkled everywhere, which made it really difficult for me to make
structural changes without risking the integrity of the program.

One of the most elusive skills for self-taught programmers is how to structure
code properly. A good architecture would allow domain experts and non-expert
programmers to coexist, but that would require throwing away a lot of existing
spaghetti code written by domain experts, which is not going to be a popular
decision.

~~~
walrus
I'm a programmer who is studying physics in college, and a couple years back I
had a similar experience with simulation code as you did. I didn't have any
issues with the math—the program I was working on didn't have anything more
conceptually advanced than multivariable calculus—but I did struggle
significantly to understand the physics behind the simulation.

It didn't help that most programs use, for example, the variable 'rho' for
density instead of just writing out 'density'.

On the other hand, reading game physics libraries (written by programmers, not
physicists) can be just as bad. There are physics hacks all over ("it's not
stable, so let's throw in an arbitrary constant") and there's code repetition
where the programmer doesn't understand that two concepts are closely related.

------
snissn
there's not nearly enough open source academic projects, nor is there any sort
of pervasive culture that encourages one.. besides the litany of examples that
could be put together to show that open source + academia does exist and does
work, I've read way too many computational physics or computational chemistry
or computational anything academic papers that simply do not publish source
code, and imo there's no good excuse for it, other than the usual, funding, or
copyright / university IP

~~~
neworder
There is an important factor discouraging publishing source code - fear that
there indeed are bugs and they will be exposed. This is blatantly "security
through obscurity", but I fear it's a common attitude. If there are bugs and
code is secret, even if someone else later points out that the results
contradict their own findings, it's (presumably) not difficult to sweep the
thing under the rug and cool it down. On the other hand, if the paper is
published, code is public and someone spot serious bugs, it's instantly a big
shame... (code review as a part of peer review would help, but it's very
unrealistic - already, reviewing is very time consuming).

In addition, there are really no structural/institutional incentives to
produce and share good quality scientific code. Maintaining good code costs
much effort and, currently, gives few short-term benefits. It's often easier
to produce crappy code, get the results, publish and move ahead.

~~~
eru
Won't you get a citation, whenever somebody uses your code?

~~~
starwed
Typically a review paper describing the software is what is actually cited,
but yes!

For instance, in my department there is a guy who maintains an astrophysical
software package called Cloudy. The faq[1] describes how to cite it. (Unlike a
lot of the software mentioned here, that project actually is open source, uses
version control, and was migrated from the original Fortran to C++.)

[1]<http://www.nublado.org/wiki/FaqPage>

------
rflrob
Where do most programmers get this exposure to best practices like version
control, unit testing, etc? I took a few early-mid level CS classes, and there
was a relatively cursory emphasis on readable code, there was barely any on
any of the sorts of things that lead to well-maintained projects. If these are
the sorts of things that one learns at your first internship, then it's no
wonder that academics in other disciplines don't have any exposure to it.

~~~
bendmorris
The vast majority of students are never exposed to these concepts and those
who are usually teach themselves. We've been teaching a class called
"programming for biologists" that teaches practical skills that are needed in
scientific computing - simple database use, version control, etc. - and we've
seen huge demand from students and faculty members in many departments.

------
jleyank
This is a difficult situation. Is it easier to train the domain experts to be
competent programmers or train the competent programmers to be domain experts?
In a research environment, I worry there's little time or interest in
developing specs that can change in an instant or can't be written until the
physics is understood.

We find it quite difficult trying to get programming out of people who don't
know why Carbon has 4 bonds while Nitrogen has 3, for example.

~~~
gammarator
My feeling is that a one-semester required course for students in "software
carpentry" [1] (as developed by Greg Wilson and discussed in the article)
would cure many of the most serious ills in scientific software development.
Students can't know they should be using version control, debuggers, and
testing if they don't even know such things exist.

[1] <http://software-carpentry.org/>

------
radarsat1
I think there are multiple reasons for this problem, and only one of them is a
lack of training in software management. Another problem is that science is an
inherently exploratory procedure. You design an experiment, gather some data,
and then go about analyzing it. You have an idea of what you'll find, but
depending on what you get, you might need to then reformat/restructure the
data, transform it, cut it up, etc.

The problem is that this represents one of the worst problem cases in software
design: evolving requirements. By itself this is bad enough. Recently I have
been analysing data from a recent study. You start off with a data structure
that you think represents things, but then you notice for example you need to
synchronize several recordings; now you have to track time. You realize some
recordings need to be split down the middle to aid in synchronization; now you
need to add a 'part' field. You derive some value from several data points
that takes a long time to compute, so you need to create a file to hold it.
This needs to be kept in synch with the original data. Eventually you realize
that text files aren't going to cut it; you start moving things to a database.
Now you need to reconfigure your visualization program to read from the
database. Then you realize that you want to add another similar derivative
value, but this time it's a 3x3 matrix for each data point; time to extend the
database again. etc.. etc.. Eventually you decide it would be best to really
rewrite the codebase because it's becoming impossible to work with.
Unfortunately the paper is due soon and you just need to generate a few more
graphs..

And I didn't even mention the growing directory of scripts that aren't
properly organized into modules, that end up with copy-pasted code because
it's not very clear how to cleanly put this into a function, or which module
it should belong to.

Now, this is bad enough when you have a CS degree and have designed several
software frameworks in your life. Combine this with someone who knows nothing
about software architecture and you have a really big problem on your hands.
My point is this: it happens to the best of us, no matter how hard you try to
organizing things, when you don't have the requirements available ahead of
time.

The best approach I've found is to force myself to simply write functions as
small as possible, that do one simple thing at a time. I try to break up
functions as much as possible for reuse, and avoid copy-pasting code at all
costs. Admittedly it's not always easy, sometimes a function that generates a
particular graph just needs a certain number of lines of logic, and it's very
difficult to modularize. Then you find that you want a similar graph but with
a slightly different transformation on the Y axis... etc.. etc..

------
cool-RR
My approach is to have the scientist write as little code as possible. That's
why I'm working on GarlicSim:

<http://garlicsim.org>

GarlicSim's goal is to do all the technical, tedious work involved in writing
a simulation while letting the scientist write only the code that's relevant
to his field of study.

------
salva_xf
Could be that they are not using the correct language, If they have some
domain specific language on top of common lisp for example, they will have
much better code with less work, i think

------
JonnieCache
Maybe we can do an outreach program? Hackers adopting scientists?

~~~
draven
I wish this kind of mentoring program would be implemented here (I work in a
big research center). Often us programmers end up having to integrate code
written by scientists in our apps, and it's a pain. Even a quick glance over
the code is often enough to see some problems.

------
sc68cal
This is why partnering is key. I partnered up with a geneticist who
understands his subject matter, while I can focus on my subject matter.

The end result was a grant funded by NIH.

------
jostmey
Incorporate the ability and require the usage of units! Problem partially
solved :-)

------
gte910h
Is there a git client for the unwilling?

I could see that solving some of the issues.

------
tedjdziuba
I know of a company, made up of scientists from academia, that develops
software by writing the code (or "codes" as they call it) in Microsoft Word
documents and e-mailing them to eachother.

Some how, they are still in business.

True story.

~~~
xxd
This wouldn't be a healthcare startup in SF, would it?

~~~
reinhardt
I shudder to think that such a "company" could be developing health-related
software.

~~~
madaxe
You should see the shit they design pharmaceuticals with.

~~~
TWAndrews
Oh my god yes. As someone who cut their teeth developing software for
pharmaceutical research, I can testify that a much of it is absolute crap, by
a variety of metrics.

