On the positive side, such difficulty is also in the nature of science itself. Scientists already understand that rigorous peer review is the only way to come to reliable scientific conclusions over time. The only thing they need help with understanding is that the software used to come to these conclusions is as suspect as—if not more so than—the scientific data collection and reasoning itself, and therefore all software must be peer-reviewed as well. This needs to be ingrained culturally into the scientific establishment. In doing so, the scientists can begin to attack the problem from the correct perspective, rather than industry software experts coming in and feeding them a bunch of cargo cult "unit tests" and "best practices" that are no substitute for the deep reasoning in the specific domain in question.
There is a huge problem with even getting existing code to run on different machines. My team's work was primarily dealing with taking lots of project code (always emailed around, with versions in the file name) and rewriting it to produce data products that other people could even just view. Generally we'd just pull things like color coding out of the existing code and then write our processors from some combination of specifications and experimentation.
I'd agree that "unit tests" and trendy best practices are probably not the full answer, but the article is correct in emphasizing documentation, modularity, and source control. Source control alone would protect against bugs produced by simply running the wrong version of code.
the depressing truth is that guaranteeing correct software
is incredibly difficult and expensive
One of the main problems is convincing, especially young, scientists that their code sucks. Young programmers, you can coach. You review their code, teach them what works and what doesn't and they get better. Scientists that happen to write progams, they don't learn to become better programmers: they've got other things to worry about. There's nobody to help them and since they're usually highly intelligent and overestimate their capabilities in things they don't want to spend time on (which is a way of justifying for yourself not to spend time on it), they need all the more guidance to become good.
BUT isn't better to use "cargo cult best practices", as you call them, than code-and-fix without any kind of formal test or documentation?
The hole point of these software programming practices is to improve overall quality with limited resources, not to craft perfect code.
The problem with doing science via models/simulation is that there just isn't a good way of knowing when it's "right" (well, at least in a lot of cases), so testing and verification are imperative. I can't tell you how many times I've laid awake at night wondering if my code has a bug in it that I can't find and will taint my research results.
I suspect another big problem is that one student writes the code, graduates, then leaves it to future students, or worse, their professor, to figure out what they wrote. Passing on the knowledge takes a heck of a lot of time, especially when you're pressed to graduate and get a paycheck).
There's got to be a market in this somewhere. Even if it was just a volunteer service of "real" programmers who would help scientists out. I spent weeks trying to get my code running on AWS, which probably would have taken a few hours from someone who knew what they were doing. I also suspect that someone with practice could make my simulations run at twice the speed, which really adds up when you're doing hundreds of them and they take hours each.
I've written around 15000 lines of MATLAB for my research and only a handful of people will ever need to see it. Some is well-structured and nicely commented, but other parts are incomprehensible and were written under severe time constraints. My advisor is not much of a programmer and will not be able to figure it out, and I feel bad for leaving a pile of crappy code to the person who inevitably follows in my footsteps, but I ultimately have a choice between writing fully commented, well-tested, and well-structured code and graduating a semester late (at the cost of several thousand dollars to myself), or writing code that's "just good enough" to get results on time. This is a solo project (there is no money for a CS student to intern) and I'm not getting paid to write code unlike a professional programmer, so every second I spend improving my code beyond the bare minimum costs me time and money.
Even if I were able to tidy up and publish all of my code, most mechanical engineers would not be able to understand it because most can't write code. Those who can mostly use FORTRAN, although C is becoming more common. Nonetheless, even those who could understand my code would have little incentive to read through 15000+ lines of code.
Unfortunately, as far as research code is concerned, a lot of trust is still required on the part of the reader of the publication. I agree that the transfer of knowledge should be handled differently, but until there is a strong incentive for researchers to write good code it will continue to be bad. Especially when many research projects only require the code to demonstrate something, after which it can be put in the closet.
This concerns me. Is this kind of thinking pervasive in public academic institutions? Avoiding the copyright ownership issues that tend to accompany such discussions, would it not be better to be more open about the code in an attempt to gain peer review?
I understand your personal motivations about not publishing, but the statement about your advisor is what I'm worried about.
Of course, the problem is that it can sometimes take years to get large datasets published and this means that the code gathers dust and gets forgotten in the meantime. By contrast, the papers and results aren't, because those are the things by which academic careers are measured.
I would personally support a wholesale change in culture in this area. Code and data/results/conclusions are not as inseparable as most scientists would like to believe, and often should be published as a unit. There has been push in this direction for a while in the engineering sciences, but other informatics disciplines like biology lag badly in this respect.
As to the last point, perhaps it's time the scientific community took software into consideration along with the data and it's resulting papers. At the least, acknowledge the problem. At best, decide where (alongside the data? with the paper in progress?) the software should be stored.
(I'm not faulting you, you just react to the incentives.)
* Publish, so you can get grants
* Use that grant so you can publish more
* Get more grants;
* Get tenure somewhere in the middle.
I have to confess I was very disgusted by him saying that in front of such a large audience of scientists and graduate students.
When a more appropriate way of quantifying research output and its benefits is found, hopefully a beneficial change in culture will trickle down into the academic trenches.
Until there is that job security, knowing that as long as you keep working you're not going to be randomly turfed out, this phenomenon will be a fundamental part of the academic career.
Really? Your funding isn't guaranteed?
When I did my Master's I was funded as an R.A. without my advisor/lab having to tap her particular grants. Grants and fellowships were usually seen as something "extra" for master's and Ph.D pre-quals students, not their main source of funding. I find it surprising that your school or department seems to (or is forced to) think differently.
It might have something to do with recent state budget cuts (it's a state-funded university). My department has also grown dramatically over the past few years, both in terms of faculty and students, so the graduate student funding will probably lag behind for a few years more.
This is starting to shape up on the Python side of things, but it has stagnated a little bit. People who can and do write the foundational code are oftentimes too focused on making the code work, and not at all focused on improving the quality of the ecosystem that their code is part of. Open Source is a great mechanism for many things, but polishing up the last 20% is not one of them.
Well, to some extent products like MATLAB solve this problem. For better or worse, I trust Matlab's ability to generate a (pseudo) random number, parallel process my functions, invert matrices, etc., etc.
On a broader level, thanks to the specialization of academia, chances are that the code I want to write isn't duplicated by others. Even if it is, I still have to trust them to have written it well - which is the whole problem here.
I guess I don't have as much hope as you do.
I've read about it in the context of speeding up global-illumination path-tracing for computer graphics.
I think it's based on work that was originally done for neutron scattering.
[ http://en.wikiquote.org/wiki/George_E._P._Box ]
Software and Code are both mass nouns in technical language.
"Code" can be in programs (aka, things that run), libraries (things that other programmers can use to make programs), or in samples to show people how to do things in their programs or libraries. Some people call short programs scripts.
When you feel you should pluralize "software", you're doing something wrong. You might want to use the word programs, you might want to use the word products, you might want to just use it like a mass noun "It turns out, thieves broke into the facility and stole some of the water", etc when talking about a theft of software "It turns out, thieves broken into the facility and stole some of the software".
This annoys me, and it is everywhere. It indicates the writer has no idea what they're writing about and presumes that it's not a process but a matter of getting the right answer. "Hold on a sec, let me get out my Little Orphan Annie's Secret Decoder Ring."
(sibling deleted and moved here)
That said, at least (some kinds of) EEs seem to have it better - the basic Spice simulator was released under a permissive license a really long time ago, and there are people like Fabio Somenzi who make available things like CUDD (it's also used commercially.) Mind you, these have a significant overlap with CS, where the culture is different. I would be very happy to see a good open-source EM field solver, for example.
I'm not sure why this is though.
Perhaps concerned scientists and editors should reject the bifurcation here and take on the lingo of the field that creates the tool they have to use and need to learn better as a first step in learning to program in a more responsible manner?
"Code" also has connotations (self-contained, numerical, etc.) that make it distinct from "program" or even "library". A routine in ATLAS is a code, but Microsoft Word is not.
As one of many examples, here's a 1955 survey article compiling a list of "available digital computer codes for nuclear reactor problems": http://www.osti.gov/energycitations/product.biblio.jsp?osti_...
The bifurcation is still harmful to them even if the software usage originated later than the science term.
John Tukey is widely credited with coining the term "software" in print in 1958, but I'll wager that "codes" actually predates that.
When the manual is titled "Theoria combinationis observationum erroribus minimis obnoxiae", you know you're dealing with legacy code. In that case it's a pretty cool legacy, though.
Also, FORTRAN 77 is most definitely not an interpreted language.
"As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists."
"As recognition of these issues has grown, software experts and scientists have started exploring ways to improve the codes used in science."
Once, she and the lab tech were having issues with their analysis program for a set of data. It was producing errors randomly for certain inputs, and the data "looked wrong" when it didn't throw an error. I came with her to the lab on a Saturday and looked through the spaghetti code for about 20 minutes. Once I understood what they were trying to do, I noticed that they had forgotten to transpose a matrix at one spot. A simple call to a transposition function fixed everything.
If this had been an issue that wasn't throwing errors, I don't know whether they would have even found the bug. I've been trying to teach my gf a basic understanding of software development from the ground up, and she's getting a lot better. But this does appear to be a systemic problem within the scientific community. As the article notes, more and more complicated programs are needed to perform more detailed analysis than ever before. This problem isn't going to go away, so it's important that scientists realize the shortcoming and take steps to curb it.
anyway, i disagree slightly with your analysis. in my experience academics know that they suck at the "engineering" part and, to make up for it, are very diligent in making sure that the results "feel right". so i don't think what you described was luck - that's how they work.
in comparison, what drives me crazy, is that if they learnt to use a few basic tools (scm, libraries, an ide, simple test framework) they could save so much time and frustration.
[related anecdote: last year i rewrote some c code written by a grad student that was taking about 24 hours to run. my python translation finished in 15 minutes and gave the same answer each time it was run (something of a novelty, apparently)].
Your code should look like this:
angle = acos( ( a . b ) /
angle2 = Math.acos((a.x*b.x + a.y*b.y +
a.z*b.z)/Math.sqrt(a.x*a.x + a.y*a.y +
a.z*a.z)*Math.sqrt(b.x*b.x + b.y*b.y +
Bad code compiles. Good code works right. Great code is so obviously right you don't have to wonder.
*Those are the same formula, though the second one is missing some critical parentheses. I use the example because I have done exactly this and been bitten by exactly this, and now am fanatical about keeping my mathematical formulas clean and obvious.
In the simulation sub-field I am there is this "research development process" which includes "verification" and "validation" after the model is performed.
Part of the verification is done by "third party code reviews" in which a party unrleated to the program/project reviews the model description (word document) and does a line-by-line analysis of the code to see that the program matches the code.
I did that during my PhD (a Professor at INSEAD paid me to do a code review of a model).
In the case of your girlfriend's lab, they catched the error via "face validation" (the results looked wrong).
The researchers produce code of questionable quality that needs to go into the main branch asap. Those few of the researchers that know how to code (we do a lot of image analysis), don't know anything about keeping it maintainable. There is almost a hostile stance against doing things right, when it comes to best practices.
The "Works on my computer" seal of approval have taken a whole new meaning for me. Things go from prototype to production by a single correct run on a single data set. Sometimes its so bad I don't know if I should laugh or cry.
Since we don't have a single test or, ever take the time to do a proper build system, my job description becomes mostly droning through repetitive tasks and bug hunting. It sucks the life right out of any self respecting developer.
There, I needed that. Feel free to flame my little rant down into the abyss. :)
Just stop doing that!
Seriously, testing is not wasted effort and for any project that's large enough it's not slowing you down. For a very small and simple project testing might slow you down, for bigger things - testing makes you faster! And the same goes for documentation. And full source code should be part of every paper.
Many programmers in industry are also trained to annotate their code clearly, so that others can understand its function and easily build on it.
No, you document code primarily so YOU can understand it yourself. Debugging is twice as hard as coding, so if you're just smart enough to code it, you have no hope of debugging it.
That they should is basically a given in the article. The question is how to make it happen.
Now, all of this is different if you research actually is building the model. But, I'm speaking for experience on the rest. I've built plenty of software tools that I need "right now" to get a set of data.
Writing good engineering software is not the scientist's goal so much as demonstrating that someone else with a greater tolerance for tedium (also someone better-paid) could write good engineering software.
In practice, of course, the quality of the software often does affect the quality of the output---but time spent on software quality creates less immediate value than it does in industry.
When it happens, I hope that they'll manage to agree on a sensible license (even though I won't set my hopes too high).
An often neglected force in this argument is that many practitioners of "scientific coding" take rapid iteration to its illogical and deleterious conclusion.
I'm often lightly chastised for my tendencies to write maintainable, documented, reusable code. People laugh guiltily when I ask them to try checking out an svn repository, let alone cloning a git repo. It's certain that in my field (ECE and CS) some people are very adamant about clean coding conventions, and we're definitely able to make an impact bringing people to use more high level languages and better documentation practices.
But that doesn't mean an hour goes by without seeing results reverse due to a bug buried deep into 10k lines of undocumented C or Perl or MATLAB full of single letter variables and negligible modularity.
Also some sort of git front end that unwilling people could use would make things better?
> This paper describes some results of what, to the authors' knowledge, is the largest N-version programming experiment ever performed. The object of this ongoing four-year study is to attempt to determine just how consistent the results of scientific computation really are, and, from this, to estimate accuracy. The experiment is being carried out in a branch of the earth sciences known as seismic data processing, where 15 or so independently developed large commercial packages that implement mathematical algorithms from the same or similar published specifications in the same programming language (Fortran) have been developed over the last 20 years. The results of processing the same input dataset, using the same user-specified parameters, for nine of these packages is reported in this paper. Finally, feedback of obvious flaws was attempted to reduce the overall disagreement. The results are deeply disturbing. Whereas scientists like to think that their code is accurate to the precision of the arithmetic used, in this study, numerical disagreement grows at around the rate of 1% in average absolute difference per 4000 fines of implemented code, and, even worse, the nature of the disagreement is nonrandom. Furthermore, the seismic data processing industry has better than average quality standards for its software development with both identifiable quality assurance functions and substantial test datasets.
I don't think it's the science that adds value, I think it's the programming. The thing is, programming allows you to automate, simulate, measure and visualize complex processes. Science is all about complex processes, so if you have more powerful tools available to understand them, you will be much more valuable. Add to that, many of the physical sciences are hitting limits of physical experimentation and require simulations for further understanding.
I don't think the power of programming has truly shown itself, it should revolutionize every industry. It brings with it a different attitude towards solving problems and opens up new realms of possibilities. Social sciences are finally starting to look like real science thanks to big data and we have new knowledge industries. I'm personally most interested in how much art and education will change thanks to new powers of interactivity.
I personally think games and science need to get much closer together, interactive learning is so powerful, and video games can make anything fun. It's definitely something to explore, but there is still a huge divide between science and entertainment and the understanding of the people in each field.
Georgia Tech has a fun Mobile Robotics Lab, and there are several other places that you could study further. (You'll gain training in the actuators, sensors, etc you'd need for your artistic work).
"People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines."
"There needs to be a real shift in mindset away from worrying about how to get published in Nature and towards thinking about how to reward work that will be useful to the wider community."
EDIT: What the heck is wrong with this? We have two opinions on the perceived value of programmers in scientific enterprises, one from someone who works in the field (David Gavaghan) and another from Zed Shaw. I'm highlighting that Zed's perception is not universally agreed upon.
useful to the wider community = warm fuzzy feeling
Ideally you'd get both, but if it's one or the other, for most people a full belly wins.
1) Remove the paywall
2) Require publishing the code for computational papers (and the data for experimental papers)
Nature Group only cares about maintaining its status as a high impact factor journal, and scientists sheepishly submit to them. They actually love it that scientists worry about getting published in Nature.
Furthermore, at least at my institution, being the "programmer who knows some science" means that your position is entirely funded with "soft money", which means that your level of job security can be pretty low.
"Personally, I liked the university. They gave us money and facilities, we didn't have to produce anything! You've never been out of college! You don't know what it's like out there! I've worked in the private sector. They expect results."
And yet there aren't many good developers doing science. Weird, huh?
As for why scientists are more highly valued: they bring in the grants that keep the wheels turning (in academic circles; industry & national labs obviously differ).
I'd also cite myself, but I don't count since I'm in a robotics lab.
Sequence analysis companies and labs which don't value software engineering get what they pay for: serious or crippling inefficiencies and inability to do analysis on their data or maintain continuity. Unfortunately, many of them don't even realize what they need or how bad their inefficiencies are.
Unfortunately there is very little direct incentive for research scientists to write or publish clean, readable code:
- There are no direct rewards, in the tenure process or otherwise, for publishing code and having it used by other scientists. Occasionally code which is widely used will add a little to the prestige of an already-eminent scientist, but even then it rarely matters much.
- Time spent on anything other than direct research or publication is seen as wasted time, and actively selected against. Especially for young scientists trying to make tenure, also the group most likely to write good code. Many departments actually discourage time spent on teaching, and they're paid to do that. Why would they maintain a codebase?
- Most scientific code is written in response to specific problems, usually a body of data or a particular system to be simulated. Because of this, code is often written to the specific problem with little regard for generality, and only rarely re-used. (This leads to lots of wheel re-invention, but it's still done this way.) If you aren't going to re-use your code, why would others?
- If by some miracle a researcher produces code which is high-quality and general enough to be used by others, the competitive atmosphere may cause them to want to keep it to themselves. Not as bad a problem in some fields, but I hear biology can be especially bad here.
- Most importantly, the software is not the goal. The goal is a better understanding of some natural phenomenon, and a publication. (Or in reverse order...) Why spend more time than absolutely necessary on a single part of the process, especially one that's not in your expertise? And why spend 3x-5x the cost of a research student or postdoc to hire a software developer at competitive rates?
I went to grad school in materials science at an R1 institution which was always ranked at 2 or 3 in my field. I wrote a lot of code, mostly image-processing routines for analyzing microscope images. Despite it being essential to understanding my data, the software component of my work was always regarded by my advisor and peers as the least important, most annoying part of the process. Time spent on writing code was seen as wasted, or at best a necessary evil. And it would never be published, so why spend even more time to "make it pretty"?
I'm honestly not sure what could be done to improve this. Journals could require that code be submitted with the paper, but I really doubt they'd be motivated to directly enforce any standards, and I have no faith in scientists being embarrassed by bad code. Anything not in the paper itself is usually of secondary importance. (Seriously, if you can, check out how bad the "Supplementary Information" on some papers is.) But even making bad code available could help... I guess. And institutions could try to more directly reward time put into publishing good code, but without the journals on board it may be seen as just another form of "outreach"--i.e., time you should have been in lab.
I did publish some code, and exactly two people have contacted me about it. That does make me happy. But many, many more people have contacted me to ask about how I solved some problem in lab, or what I'm working on now that they could connect with. (And are always disappointed when I tell them I left the field, and now work in high-performance computing.) Based on the feedback of my peers... well, on what do you think I should've spent my time?
Might be worth thinking about why there are incentives there and not elsewhere.
In the past, the model of having many small labs in universities was a great idea. Today things are looking a bit different in some fields because larger labs can afford to do more automation (by hiring programmers instead graduate students).
I wrote probably around 3000 lines of code on 4 separate projects (mostly MATLAB, C and Java). This code was never shared with anyone, my advisors were not interested in the code, all they cared about were the results. To be honest it wasn't very good code, I would have a hard time understanding it now (although I could probably figure it out eventually).
And after I graduated I took the code with me and I am the only person who ever verified the working of the code.
This bothers me on some level, since no one can really verify and inspect the results of my publications (unless they tracked me down to ask me for the code some of which has been lost) - but it is pretty much the norm in my field.
There was an interesting discussion about this on the Theoretical Computer Science Stackoverflow a while back:
Bottomline: Yes, we should probably do it (especially in areas where the research is simulation and the code encapsulates all the results) but we probably won't unless we're pushed.
Regarding your code, you could have just uploaded it to SourceForge or any other OpenSource repository. I know a guy (Steve Phelps) who did exactly that ( http://sourceforge.net/projects/jasa/ ) with his PhD code.
On a related note, the institute where I am working now has this "great" simulation program (housemade in C++) for which a lot of publicatios have been written. However, the code is closed source and thus cannot be third-party verified.
This is wrong, and actually, a colleague of mine who just started doing her PhD found an error in the simulation program, bad enough that it makes me question the previous research.
In my opinion it must be a requirement that all software related to a publication must be made open-source before (or at the same time) the paper is published.
In the traditional research method, computer programs are part of the methods of the reserach. It is amazing that nowadays researchers can publish research without clearly showing the process they used to arrive to those.
I'm left a little cynical after a Master's in computational science, and I still can't believe that open code is not part of the repeatability doctrine. I suppose my goals are not aligned with most grad students since I have no interest in an academic career (at least not after many years in industry) but I got much more satisfaction from feedback on my blog posts than publication.
Hell, each blog post is its own little publication, and it may not be peer reviewed before its published, but the amount of links to them and google searches prove that I have more than a few peers who appreciate my contributions.
1) Peer review is done anonymously and errors are discussed privately.
2) More people "wrote" it than read it.
I don't believe in the private model, so I release code when it's ready, regardless of where it fits in the publication cycle. It's pretty neat from a reproducibility perspective to submit a paper based on code that is runnable as a tutorial example shipped with a library that the reviewers stand a good chance of already having installed.
In my opinion, publishing all code that was used for the paper should be mandatory. Everything else is an obvious violation of the confirmation requirement in the scientific method.
But just like with Open Access I have little hope, that this will be adapted soon on a wide scale. If you are a student, I believe, all you can do is get permission to publish your code and do so. Maybe this will hurt tenure but it increases your karma!
Careers could be destroyed, if people where held to account.
People in software see this as ludicrous. Of course there's bugs, just update the conclusions, and move on! But that's not how a lot of scientists think.
Of course, the problem with this is that it's a large amount of work and in most cases probably doesn't have a good ROI.
I have recently started to try this approach of better documenting everything, mostly because I have found it hard to go back to work I did 6 or 7 years ago and understand it (e.g. a bunch of one-off, poorly documented data processing scripts that could, if properly done, save me some time today). I haven't yet published anything like this yet, but it looks promising.
CS papers without code are even worse.
Is it not sensible, perhaps, to have a dedicated group of programmers (with various specialities) available as a central resource to assist the scientists with their modelling? (I am imagining a central pool whose budget would be spread over several areas.)
I personally love working on toy projects related to science. Maybe we hackers with time for that kind of thing should volunteer in some way to assist with the technical aspects of research that is directed by a scientist? I'm not sure I'd even care about getting a credit on a research paper so long as I could post pretty pictures and graphs on my blog...
If you haven't heard of RapidMiner, you basically edit a flowchart where each step takes inputs and outputs, eg take some data and make a histogram, or perform a clustering analysis.
Video of someone demoing it:
This way, the scientists can focus on the algorithms and not have to worry about all the other details of creating useable, maintainable software.
To a lesser (but free) extent, Orange http://orange.biolab.si/
Sorry guys, but that hasn't worked so far: the economics journal _Journal of Money, Credit and Banking _, which required researchers provide the data & software which could replicate their statistical analyses, discovered that <10% of the submitted materials were adequate for repeating the paper (see "Lessons from the JMCB Archive", Volume 38, Number 4, June 2006).
One of the most elusive skills for self-taught programmers is how to structure code properly. A good architecture would allow domain experts and non-expert programmers to coexist, but that would require throwing away a lot of existing spaghetti code written by domain experts, which is not going to be a popular decision.
It didn't help that most programs use, for example, the variable 'rho' for density instead of just writing out 'density'.
On the other hand, reading game physics libraries (written by programmers, not physicists) can be just as bad. There are physics hacks all over ("it's not stable, so let's throw in an arbitrary constant") and there's code repetition where the programmer doesn't understand that two concepts are closely related.
Grad students probably wrote the majority of the tools used in my lab and it shows because when they leave the knowledge of the different issues and bugs in the software that they had goes with them. Years later they resurface and no one has any idea of the thought process of the original author.
It's quite annoying. We have such a project right now that is a basic piece of software we use for all our research. There is no current funding for someone to sit there and clean up the code. Most funding agencies want new work not maintenance work to be done with their money. There just isn't any incentives anywhere for this.
Someone will just reinvent the wheel probably.
I could see a role for staff computer scientists in areas of research where computation plays a particularly large role, but for typical grad-student data munging, the cost/benefit ratio is likely far too high.
In many cases, projects with sufficient technical depth require detailed domain knowledge that can be acquired quicker and cheaper by hiring scientists or engineers with that knowledge and then teaching them to code (or code better), rather than segregating the work and dealing with the problems that result from the communication gap.
In addition, there are really no structural/institutional incentives to produce and share good quality scientific code. Maintaining good code costs much effort and, currently, gives few short-term benefits. It's often easier to produce crappy code, get the results, publish and move ahead.
For instance, in my department there is a guy who maintains an astrophysical software package called Cloudy. The faq describes how to cite it. (Unlike a lot of the software mentioned here, that project actually is open source, uses version control, and was migrated from the original Fortran to C++.)
We find it quite difficult trying to get programming out of people who don't know why Carbon has 4 bonds while Nitrogen has 3, for example.
The problem is that this represents one of the worst problem cases in software design: evolving requirements. By itself this is bad enough. Recently I have been analysing data from a recent study. You start off with a data structure that you think represents things, but then you notice for example you need to synchronize several recordings; now you have to track time. You realize some recordings need to be split down the middle to aid in synchronization; now you need to add a 'part' field. You derive some value from several data points that takes a long time to compute, so you need to create a file to hold it. This needs to be kept in synch with the original data. Eventually you realize that text files aren't going to cut it; you start moving things to a database. Now you need to reconfigure your visualization program to read from the database. Then you realize that you want to add another similar derivative value, but this time it's a 3x3 matrix for each data point; time to extend the database again. etc.. etc.. Eventually you decide it would be best to really rewrite the codebase because it's becoming impossible to work with. Unfortunately the paper is due soon and you just need to generate a few more graphs..
And I didn't even mention the growing directory of scripts that aren't properly organized into modules, that end up with copy-pasted code because it's not very clear how to cleanly put this into a function, or which module it should belong to.
Now, this is bad enough when you have a CS degree and have designed several software frameworks in your life. Combine this with someone who knows nothing about software architecture and you have a really big problem on your hands. My point is this: it happens to the best of us, no matter how hard you try to organizing things, when you don't have the requirements available ahead of time.
The best approach I've found is to force myself to simply write functions as small as possible, that do one simple thing at a time. I try to break up functions as much as possible for reuse, and avoid copy-pasting code at all costs. Admittedly it's not always easy, sometimes a function that generates a particular graph just needs a certain number of lines of logic, and it's very difficult to modularize. Then you find that you want a similar graph but with a slightly different transformation on the Y axis... etc.. etc..
GarlicSim's goal is to do all the technical, tedious work involved in writing a simulation while letting the scientist write only the code that's relevant to his field of study.
The end result was a grant funded by NIH.
I could see that solving some of the issues.
Some how, they are still in business.
It's funny how this exact same mistake is made in the linked paper. For some reason, people outside of IT can't get it into their minds that "code" is an uncountable noun in this context.