Hacker News new | past | comments | ask | show | jobs | submit login
Good Enough Practices in Scientific Computing (arxiv.org)
143 points by privong on Oct 30, 2016 | hide | past | web | favorite | 20 comments

One thing they (intentionally) left out was including a Makefile. This is one of the "Good Enough" practices that I also include. It doesn't matter if it is a Makefile or another master script or tool. What is important is that you have a manifest for how each derived data file was generated. And it should be able to be executed w/o arguments to have the entire analysis run from the raw/primary data.

This has saved me many times in the past.

They do mention this, though it's somewhat buried in #4 on page 8 and in their paragraph saying they left out "Build Tools". They do suggest having a "controller script" (or set of shell scripts) to run all the complete analysis.

Unless you use R markdown, in which case there is no need for a separate script or makefile.

I still use a Makefile for R markdown, but that's largely because I don't use only R for any analysis. There's normally some amount of calculation or pre-processing that happens before I flip into R. Makefiles are particularly good at describing the relationships between datafiles when you mix different tools.

Having worked with a large open-source scientific project, this is extremely reminiscent of the type of talk we would give to every new student or post-doc joining the project. Unfortunately, the outcome is entirely binary - either they understand both what to do and why its important, and also happen to do it already, or they don't understand it and they don't already do it. I can't really think of a single person who jumped from one category to the other over the years.

The reason for that is there was rarely any visible benefit from putting in the work to follow good computing practices. It certainly created a lot of pain when jumbles of broken and unreadable code were passed on to new people, who basically ended up reimplementing everything from scratch (I was one such person), but there is exactly zero punishment for doing that, and very rarely any reward for cleaning up code and data. So why bother?

As with the open-source publishing debate, there has to be an incentive system in place, and then people will do it. There are standardized (and required!) practices for things like bio protocols or reporting PV-performance data - only if computing had something similar do I see anything improving.

Seems reasonable. An interesting bit for me was the CITATION file. The lack of citability for software is a recurring annoyance when writing up work that uses that software.

I've started putting DOIs on my relevant github repos: https://guides.github.com/activities/citable-code/

There are clearly still issues with versioning and so forth, but it's a start for software that isn't suitable for writing an article about.

Have you ever taken a look at DueCredit (https://github.com/duecredit/duecredit)? It aims to address this problem by allowing you to add citations directly to your Python code using decorators. I believe there are also plans to expand to other languages in the future.

I'd like to see ongoing discussion of the extent that HN-posted research conforms to being "Good Enough."

Here's a task for you: whenever you see a piece of research posted to HN, run it through a checklist from this article and post it as a comment. I'll happily upvote. It will surely keep the discussion going on HN constantly :).

Why stop there? Let's build a ML system that does this for us! How long before the Show HN submission?

Sure! But until that Show HN, 'RA_Fisher seems like a perfect protein bot for the task ;).

Not a bad idea. :)

It's important to differentiate the general concept of "Good Enough" as in "that's good enough" from the software engineering approach that, while coined from the same term, is surprisingly more formalized around balancing risk/effort vs. achieving perfection

Agreed. I think scientists and software engineers can really teach each other quite a bit. So I'd love to see a day whereby the phrase is more often used in science to mean making defensible trade-offs rather than achieving basic competency.

We've been working on a a data science template that bakes in a number of these practices. Currently leans Python, but could definitely be useful for scientific computing generally: https://github.com/drivendata/cookiecutter-data-science

I disagree with the part that recommends BSD instead of GPL. It's a myth that GPL doesn't allow software to be used for commercial ends. It only forces the person to contribute back the changes, letting the others profit from their work as they profited from theirs.

As someone who's worked with OSS while employed at a very large company, using BSD licensed software is infinitely easier than GPL. GPLv3 was likely to be flat out banned and even GPLv2 required many approvals and a long process from legal - all this to even use it internally where it would never directly impact a customer. GPL is simply too dangerous for many companies to touch, and those that do have a disturbingly high rate of failure to comply with all terms.

If the goal of making your software open source is to allow anyone to use it, GPL makes that _harder_.

What fields of scientific computing does this cover? E.g., biology and physics would probably have some overlap, but also some differences.

It's fairly general. The document is discussing management of scientific software, which I would think is fairly robust across disciplines. They separate the discussion into the following topics:

* data management

* software

* collaboration

* project organization

* tracking changes

* [writing] manscripts

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact