
Simple rules for documenting scientific software - pplonski86
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006561
======
acidburnNSA
I manage a team of power plant design engineers writing complex scientific HPC
software, mostly in Python (which drives Fortran 77 codes behind the scenes,
among other things). It's been a long haul but through the years we've learned
a lot of good lessons and are pretty productive at what I consider at least
moderately good code. Everyone starts off by reading Clean Code and taking
basic proficiency training. We know to try to make code speak for itself
because, as we've all seen many times, comments lie. We've slowly learned from
the software engineering community how to set up things like Jenkins for CI,
black for Python code auto-formatting, Phabricator for revision control policy
and code review. We're writing docs in rst with sphinx, watching coverage, and
getting pylint 10/10\. We're waking up to the wonders of static types.

Mandatory code review has turned out to be a great way to train newbies, write
better code, and make sure code is understandable to at least one other
person. It's been an investment (it's slowww) but I still think it pays
dividends. That's what I'd add to this advice. And I'd tone down the comments
focus. Comment only when you fail to express yourself clearly in the code.

~~~
l0b0
One thing I've found even better than code review is pair programming. At the
recommendation of the tech lead we all did it for the four years at my last
job, and it was a fantastic way to spread knowledge around. For thorny
prpblems we'd even "mob" on them, four or five of us sitting in front of one
big monitor, rotating who used the keyboard and mouse, discussing and giving
high-level instructions to the driver. We collectively figured we produced
features about as fast as with individuals programming (because of the
intensity and avoiding many dead ends) but that the quality was far superior.
Reviewing the process honestly (without getting personal) is key, though, to
address inefficiencies and interpersonal issues.

~~~
acidburnNSA
I believe that. We pair up on super complex things but have been promising to
do more of it having heard of successes like yours.

------
hprotagonist
A really great idea: put DOIs in your function documentation.

Writing things like:

 _Implements equation 3.2 of Foo et. al. (2005), doi:10.2.3 /baz_

has saved me no end of pain in the past.

~~~
petschge
This is indeed a helpful thing. And leave a note if you had to rename the
variable called "s" in that paper to "x" to work with the 5 equations you
pulled of that other paper.

------
timeu
Some good basic suggestions that should be followed when doing softwaare
engineering. But they forgot one aspect:

Don't be too smart for your own sake when packaging your software. As an HPC
administrator I stopped counting how many hours I spent installing scientific
software with broken Makefiles, interactive installers, simple github dumps,
unresolvable dependencies. Sadly this is quite prevelant in life sciences (I
am looking at you bioconductor)

------
gravypod
Would be nice to point the scientific community in the direction of basic
software engineering practices. Encouraging engineers to read Code Smells and
Refactoring.

Most of the scientific software I've worked with is completely unmaintainable
by anyone but the original makers. Buss factors of 1 are not sustainable.

~~~
danieltillett
I write scientific code for a living and my code is very difficult for anyone
else to maintain. It is not because my code is badly documented or written, it
is because what it does is very complex. Every module is documented why it
exists and what it does and the code straightforward to read, yet the
interaction of all the modules is very complex as it reflects the underlying
complexity of the problem the code is solving. Some code is hard to maintain
because it is solving a complex problem that few people understand.

~~~
perfmode
Managing complexity is one of the chief concerns of software engineering.

~~~
danieltillett
Complexity can be managed, but the required domain knowledge is hard to design
around. The problem my colleagues struggle with is the domain knowledge I have
that they don't. The positive from this is there are few people in the world
that can do my job :)

~~~
analog31
In my view, there's only a certain extent to which code can teach domain
knowledge. People tend to freeze up when they see any kind of math, physics,
or quantitative engineering. And managers want to believe that domain
knowledge is worthless because it refutes treating people as interchangeable
cogs.

The only solution is to make sure you keep someone on staff who has a hope of
understanding the underlying technology, and failing to do so is a business
risk like any other.

------
fredosega
I'm a 'scientific programmer' who learned how to code basically trial by fire.

My background was mechanical engineering which didn't focus on coding at all
other than needing to use/learn MATLAB for several classes. Went to grad
school for HPC/CFD and there I was given access to our group's simulation code
and was let loose to implement whatever routines I needed to simulate my
problems. The shared components were the input/output system and the primary
routine drivers (time-stepping and fluid dynamics algorithms and the like),
and what I mostly worked with were different constitutive models which hooked
into the system. Parallelization was implemented via MPI and was mostly
complete, so my only job with respect to parallelization was to make sure that
my algorithms would work in parallel.

I ended up taking several programming courses, but these were 100% focused on
topics like parallelization with MPI, shared memory parallelization, and
optimizing code, and a short stint in GPU programming. I learned nothing about
code management or best practices. Oh, and this was all using Fortran and C,
though now I'm working with C++ and Python, but that's because the newer
libraries seem to be C++ and Python is just easy to glue everything together
with.

My general programming knowledge isn't super great, but sometime this year I
managed to download an open source iOS app and without any prior knowledge of
Xcode or iOS programming/Swift was able to figure out how to implement
something that was missing.

This is already a lot of words, but lately I've been thinking about getting
out of academia and getting into actual software development cause I figure I
_kinda_ do that anyway. Obviously the easiest connections I could make would
be to maybe work for companies like ANSYS that develop CFD software, but I
feel like my programming knowledge is seriously lacking for that. You can give
me a scientific paper that describes an algorithm to do a thing and I'd have
no issue implementing it into some existing codebase but I read words written
in this discussion like "code smell" and "CI" and I have no idea what these
things are.

Anyway, can anyone recommend me some books to read and/or provide some advice
and/or anecdotes on jumping ship from HPC/CFD/scientific programming into a
general programming developer career?

~~~
sischoel
Have you thought about contributing to some open source software? It depends
on the project, but often after you submit your code, you will get a code
review. And you probably will also learn something about testing.

------
a-dub
11) Invariant: Someone should be able to go from a referenced paper or book to
your code without scratching their head too much. This can be approached by
either fortifying the documentation in the code or by specificity in a methods
paper.

------
amelius
Also don't forget to document all dependencies and their versions!

------
petschge
If there is one thing I have about HN it is how smug they are about software
engineering. The leading reason why scientific software stinks (and a fair
fraction does) is not because scientists and software engineeres in that field
suck, but because there is very strong incentives AGAINST writing better
software.

Remember: This is not the 17th javascript framework, but software for problems
that we don't understand going in. And often enough we have not understood the
problem all that well even after a decade when we write the third code. These
codes are research codes. Ongoing experiments. The main goal is NOT to produce
long-term maintainable software, but to produce scientific understanding and
build intuition about the systems that are modeled. The code is just another
tool among experiements, analytic calculations and back-of-the-envelope
discussions on a white board.

Rewriting the code every 5 years is an insane proposition in software
engineering, but completely ok in some fields of science.

Would I like better language support to check SI units for me? Sure. Would I
like highly performant libraries for vector fields, that work with gcc 4.6 on
a top 500 machine? Sure. Would I like to be allowed to spend time on fixing
yeah-I-guess-it-works code? You bet.

But would I like HN to just up about "scientists just need to learn to code"?
Oh hell yes! Because -- believe it or not -- we often DO know better. But
fixing code is not what the taxpayers, what YOU, pay us for. We are paided to
understand nature. And until you are willing to pay higher taxes and spend
more money and science and to invest more into fixing long term infrastructure
you really do not get to be so damn condescending.

~~~
jasonpeacock
By this argument, chemists should never wash glassware. Keep re-using it until
it's too dirty, then through it out and make new beakers.

Your tools affect the quality of your work. Work with shitty tools, get shitty
results.

I'll start believing your argument when scientists actually start publishing
the code to go with their papers and make it reproducible.

~~~
petschge
Two thirds of the codes in my field are on github. And for most others you can
get a copy if you ask politely by email. That said I would appreciate it if
journals not only had author, title, date and affiliation in the meta data,
but also a git url and commit ID.

------
hyperpallium
> Prior work has focused on various aspects of open software development
> [1–7], but documenting software has been underemphasized.

This article - like all software engineering - is completely unscientific.
Scientific method:

> systematic observation, measurement, and experiment, and the formulation,
> testing, and modification of hypotheses

~~~
geoalchimista
It is an editorial not a peer-reviewed research article. Mind the difference.

