
Ask HN: What are the best coding practices in research projects? - mlajszczak
Research environment differs significantly from regular software engineering:<p>- researchers conduct experiments that may fail<p>- they often produce PoC instead of final products<p>- it&#x27;s tempting to produce tonnes of low quality code (since it just an experiment&#x2F;PoC!)<p>- often researchers are not software engineers so they don&#x27;t really care about code quality &#x2F; tests<p>How to find a good trade-off between high coding standards and not getting in the way of research? Is it possible to move smoothly from PoC to production solution without rewriting everything from scratch? How to share code between experiments &#x2F; PoCs?
======
electricslpnsld
Used to be a research scientist at a ~3k person Bay Area company. Early
versions of a project typically existed outside of the mainline code branch at
the company (we usually just tossed it into the company's private gitlab). We
tried to keep things reasonably organized and conforming to the company's
standards, but as paper deadlines approached the code usually devolved into a
mess of technical debt (just like grad school!). After we knew we had some new
algorithm or technique that worked, we typically went back and stripped away
the cruft, cleaned up the code so it conformed with the company's coding
standards, added required tests, pushed to the company's main branch, and
began the really hard work: convincing people in the company that what we did
was useful for their work to drive adoption.

Addressing the specific points:

> \- researchers conduct experiments that may fail

Very true! Often many, many, many experiments...

> \- they often produce PoC instead of final products

This is going to vary a lot depending on where you are working. We were
responsible for developing, deploying, and for some time supporting whatever
we developed.

> \- it's tempting to produce tonnes of low quality code (since it just an
> experiment/PoC!)

Yep, especially in early stages.

> \- often researchers are not software engineers so they don't really care
> about code quality / tests

This really depends on what stage in the lifecycle of a research project we
were in. We were responsible for deploying the final code, so at the end of
the day it had to be of the same quality as something someone with the title
of software engineer would generate.

------
cbanek
I think the first two are probably the right kinds of problems. Even in normal
software engineering, you want to try out PoC's to prove that they are what
your customer wants, and that things generally work the way you think they
should.

Code quality is another problem entirely. I agree code quality can get out of
control as soon as the PoC is promoted to something resembling "production."

My suggestions are:

\- First, if you get frustrated at researchers for code quality, let them
calmly know why you are upset. If they are being inefficient, many would love
to hear tips to keep it from happening. Let them know when the things they are
doing might affect large groups of people.

\- Don't try to write tests for everything. This just slows you down, getting
away from the good things above. Write tests for things that are frequently
broken, and absolutely required to work, such as core functionality. If
something gets broken 2 or 3 times, you should definitely have a test.

\- Make your tests as high level as possible. Compute power is cheap, and
despite what you might hear from the TDD/unit testing crowd, your tests don't
need to run in 2 seconds to be useful. I like to have tests that emulate
users, because as you change the logic of how you're doing things, you still
have tests to back you up.

\- Add lots of additional logging. This helps document the code (since the
messages should be useful and say what is going on), and provides great info
for debugging issues after they've already occurred. I've been saved by good
logging more times than I can remember, especially on different
OS/environments that aren't the test environment.

\- Don't worry too much about edge cases. Just print a log line or crash out
if it's something ridiculous you've gotten yourself into, which is a lot more
friendly than figuring out some horrendous bug mired in retry logic that has
masked the original issue.

\- Insist on version control, but not code reviews. Code reviews can really
slow you down. Instead, fix problems after they come up. You haven't shipped,
right?

\- Run the build and tests in a simple CI loop that runs overnight. Don't
worry about testing each commit, just know if it works or doesn't work. Fix
the problems.

These last two are related:

\- Feel free to just start over. Delete huge amounts of code, and try a
different approach.

\- If you have gone past the point of no return (you don't want to start
over), then start production-izing the code. Again, aim at the problems to
start, not some coverage metric. Look over all the code and reduce redundancy.
It's a lot easier to review code once it's all there, rather than bit by bit.

------
irvingprime
I used to be a software lead in a research organization. I have a LOT of
opinions on the subject but I'll give you just a few points.

\- I've seen many cases where researchers refused to share their code because
they knew it wasn't up to any reasonable standard. This is a red flag. If they
are embarrassed by their code, I tend to discount their alleged results
entirely.

\- Even in research, people should be required by the organization to follow
some kind of process. Use version control (like git or even svn. This basic
step is still not universal), put in pull requests, get code reviewed from
someone else.

\- For that purpose, every research organization should have someone on staff
who can do a competent review. They do not need to specialize in the
researcher's field. They just need to know a code smell when they see it.

\- Every researcher I have known will resist this strenuously. That is a sign
of how much they need it.

\- When publishing research results, code and data should always be required.
Otherwise, the results cannot be judged. (A lot of people like it that way.
They should not be accommodated).

I could go on but I'll be nice and stop here.

~~~
AnimalMuppet
To expand on your first and last points:

Your researcher got a result. Great. What is their objective evidence that the
result is real rather than an artifact of a bug in their code? If the code is
garbage, _you can 't trust the result_, no matter how much of a breakthrough
the result would be if true.

That doesn't mean that the code needs to be production-ready. It does mean
that the code needs to be clean enough to be trustworthy. (Tests can be
included in this evaluation.)

If the code's going to be product-ized... maybe ask the researcher which parts
of the code they think are the most troublesome. Start by re-writing those
pieces, from scratch, with production levels of rigor. Then, as other parts
prove troublesome, rewrite those too. Don't band-aid them, rewrite them. Keep
the interfaces, unless the interface itself is part of the problem.

------
indescions_2018
JupyterHub allows you setup research clusters on GCloud, AWS and Azure. You
can set CPU / GPU resource utilization limits, disk usage, memory, network.
Even limit scaling to your budget. Once your experiment is up and running.
It's simply another service running in a container. Have used it for a small
distributed team. But can be scaled to corporate R&D teams with 1000s.

Core environment is still the Jupyter Notebook. So should remain familiar to
most data scientists.

Zero to JupyterHub with Kubernetes

[https://zero-to-jupyterhub.readthedocs.io/en/latest/](https://zero-to-
jupyterhub.readthedocs.io/en/latest/)

------
hprotagonist
It's a free-for-all. (welcome to my world :( )

Something small but meaningful that I believe in are tools like versoneer
([https://github.com/warner/python-
versioneer/](https://github.com/warner/python-versioneer/)) which bump the
version of your code on _every_ commit.

Then, embed this version string in all output. Figures, serialized data,
whatever.

It is very powerful to be able to point at a figure and say "this graph was
produced by precisely this code". If you're feeling particularly anal, include
the hashes of the datasets that generated it too.

------
throwawayjava
There is no one answer. The answer will change drastically depending on the
sort of research you're doing, and your role in that research.

Are you a mathematician simulating a dynamical system? A theoretical computer
scientist exploring the effects of parameters that are difficult to nail down
analytically?

Are you a computer scientist working on a new sort of system? Is the point of
that system to support a long-running research agenda, or to demonstrate the
feasibility of a general notion/idea?

Or are you a software engineer supporting a natural scientist (e.g., in a
large bio/neuro/chem/physics lab)?

Are you the PhD student, the research scientist, the supporting engineer, or
the PI?

But in any case, the correct answer will start with interrogating the
purpose/role of the software in your research project. And that answer could
range from "hack out the MATLAB and sanity check" all the way to "lives are on
the line; practice extreme rigor". And certainly not excluding "convince your
funding agency/PI that it's time to hire a professional"!

------
mipmap04
I've been lead on a few research projects. I found it useful to require that
all work be tied to work items in our project management tool. The work item
would need to describe the purpose and hypothesis of the work and capture
summary results. We then created a file structure that matched to our work
item IDs for all work product created while working on that task. Towards the
end of the engagements, we go through and find what's relevant and include
those artifacts in our report with steps for reproduceability, assumptions,
and other clarifying points that may be salient.

We also made good use of tagging features in our project management toolset to
make report writing easier at the end of the project.

------
debacle
From my experience in knowing a few scientists, the pattern seems to be:

\- Scientists try their best to be good programmers, but are scientists first.

\- Someone the scientist knows, or someone on the team with more programming
knowledge, turns what the scientist produced into something maintainable at
some point.

\- If they're lucky, the grant will have resources for a script/software
maintainer.

Scientists are scientists. I know a few who can do things with awk that
probably should never be done, but they use the tools they know to get the
data to look the way they need.

------
fundamental
In an ideal world you start off with loose highly manual code/processes when
validating an idea and as you get closer to a publication components are
restructured/rewritten to make it easy to replicate results.

Other researchers care about the final code which is used to generate the
results. So, in my book it's ok if there's a large gap between the code that
lead to the initial idea and the code which was used to show the idea in
practice (i.e. the code used to generate all graphs/tables within a submitted
publication).

------
quickthrower2
> How to find a good trade-off between high coding standards and not getting
> in the way of research?

I think you can have both. Unit tests and good coding practices should make
you faster once you have more than "1 screen"'s worth of code and you are
relying on human memory when navigating and maintaining code.

I'm not a "unit test all the things" kind of person though.

------
tripn
researcher should NOT care anything related to "best coding practices", that
is engineer's job. if you organization is very cheap, try to put both
"researcher" and "software engineer" hat on same person, unless he/she wants
to be both, otherwise this organization is asking for cheap research result,
very simple, always true.

------
jononor
Basics

\- use version control

\- write tests (high level, keep it simple)

\- pull in data as if it was a dependency, versioner and stable

\- use a CI server/service

\- when publishing, code and data goes with the paper

