
Good Enough Practices in Scientific Computing - privong
https://arxiv.org/abs/1609.00037
======
mbreese
One thing they (intentionally) left out was including a Makefile. This is one
of the "Good Enough" practices that I also include. It doesn't matter if it is
a Makefile or another master script or tool. What is important is that you
have a manifest for how each derived data file was generated. And it should be
able to be executed w/o arguments to have the entire analysis run from the
raw/primary data.

This has saved me many times in the past.

~~~
pps43
Unless you use R markdown, in which case there is no need for a separate
script or makefile.

~~~
mbreese
I still use a Makefile for R markdown, but that's largely because I don't use
only R for any analysis. There's normally some amount of calculation or pre-
processing that happens before I flip into R. Makefiles are particularly good
at describing the relationships between datafiles when you mix different
tools.

------
suchsciencewow
Having worked with a large open-source scientific project, this is extremely
reminiscent of the type of talk we would give to every new student or post-doc
joining the project. Unfortunately, the outcome is entirely binary - either
they understand both what to do and why its important, and also happen to do
it already, or they don't understand it and they don't already do it. I can't
really think of a single person who jumped from one category to the other over
the years.

The reason for that is there was rarely any visible benefit from putting in
the work to follow good computing practices. It certainly created a lot of
pain when jumbles of broken and unreadable code were passed on to new people,
who basically ended up reimplementing everything from scratch (I was one such
person), but there is exactly zero punishment for doing that, and very rarely
any reward for cleaning up code and data. So why bother?

As with the open-source publishing debate, there has to be an incentive system
in place, and then people will do it. There are standardized (and required!)
practices for things like bio protocols or reporting PV-performance data -
only if computing had something similar do I see anything improving.

------
dibanez
Seems reasonable. An interesting bit for me was the CITATION file. The lack of
citability for software is a recurring annoyance when writing up work that
uses that software.

~~~
cossatot
I've started putting DOIs on my relevant github repos:
[https://guides.github.com/activities/citable-
code/](https://guides.github.com/activities/citable-code/)

There are clearly still issues with versioning and so forth, but it's a start
for software that isn't suitable for writing an article about.

------
RA_Fisher
I'd like to see ongoing discussion of the extent that HN-posted research
conforms to being "Good Enough."

~~~
TeMPOraL
Here's a task for you: whenever you see a piece of research posted to HN, run
it through a checklist from this article and post it as a comment. I'll
happily upvote. It will surely keep the discussion going on HN constantly :).

~~~
grzm
Why stop there? Let's build a ML system that does this for us! How long before
the Show HN submission?

~~~
TeMPOraL
Sure! But until that Show HN, 'RA_Fisher seems like a perfect protein bot for
the task ;).

------
pjbull
We've been working on a a data science template that bakes in a number of
these practices. Currently leans Python, but could definitely be useful for
scientific computing generally: [https://github.com/drivendata/cookiecutter-
data-science](https://github.com/drivendata/cookiecutter-data-science)

------
andrepd
I disagree with the part that recommends BSD instead of GPL. It's a myth that
GPL doesn't allow software to be used for commercial ends. It only forces the
person to contribute back the changes, letting the others profit from their
work as they profited from theirs.

~~~
biggerfisch
As someone who's worked with OSS while employed at a very large company, using
BSD licensed software is infinitely easier than GPL. GPLv3 was likely to be
flat out banned and even GPLv2 required many approvals and a long process from
legal - all this to even use it internally where it would never directly
impact a customer. GPL is simply too dangerous for many companies to touch,
and those that do have a disturbingly high rate of failure to comply with all
terms.

If the goal of making your software open source is to allow anyone to use it,
GPL makes that _harder_.

------
amelius
What fields of scientific computing does this cover? E.g., biology and physics
would probably have some overlap, but also some differences.

~~~
privong
It's fairly general. The document is discussing management of scientific
software, which I would think is fairly robust across disciplines. They
separate the discussion into the following topics:

* data management

* software

* collaboration

* project organization

* tracking changes

* [writing] manscripts

