
How to implement an algorithm from a scientific paper - jnazario
http://codecapsule.com/2012/01/18/how-to-implement-a-paper/
======
bendmorris
In my field (ecology and evolutionary biology) there's a small but concerted
push to get people to publish "executable papers" where all code and data is
available via GitHub, on an iPython notebook, etc. so that all figures and
test cases can easily be reproduced by reviewers and researchers. If you're
publishing research, it shouldn't be the reader's job to parse your paper and
reimplement the research you describe; it should be on you to make your
results replicable, and I think this is a standard we need to insist upon. I
can't count how many papers I've read lately that were missing either crucial
methods or underlying data so that replication was impossible.

edit: Here's an example:

<https://github.com/weecology/white-etal-2012-ecology>

    
    
        Run portions of the analysis pipeline:
        Empirical analyses: python mete_sads.py ./data/ empir
        Simulation analyses: python mete_sads.py ./data/ sims
        Figures: python mete_sads.py ./data/ figs

~~~
jmilloy
I currently work in method development for proteomics, and I think about this
all the time. Of _course_ methods must be reproducible, or the paper largely
useless. But lately I've come to disagree that source code should be required.

If the research _relies_ on a piece of "in-house" code, which isn't described,
then that's a problem. But the OP is about research where the _algorithm_ is
the advance in the field. In a way, the algorithm is the result and not the
method (even though the algorithm will probably be applied to a sample problem
and the results thereof validated).

Now, if I develop a new method at the bench, you are expected to _do_ the
method at your own bench if you want to apply it to your research. This often
starts as a pilot experiment just trying out the method, which can take weeks
to months, before you integrate it fully into your methods. I don't have to
provide a kit which consistently implements the method along with the paper.
Certainly once the method gains some traction, an outside company may begin
selling kits. Until then, it absolutely _should_ be "the reader's job to ...
reimplement the research" - that's an essential part of the process.

Similarly, an algorithmic method can be reproducible even if code isn't
provided. Just like with bleeding edge wet-lab methods, if you want to use
bleeding edge algorithms you will need to be able to code. You will need to be
able to read a description of the algorithm and implement it accurately. The
publicly funded result is the algorithm as an idea, and providing that the
review process works, you're getting that idea. Later on, if the algorithm
gains any traction, someone will implement it in a usable, robust package and
more labs will be able to use it.

Finally, it takes significant time and experience to produce (and maintain)
code that others can use and expect to work most of the time. That's a waste
of time/money for a research lab, and an inefficient use of public funds. If
you can't code yourself, leave code production to those who do it well and
don't whine that you can't just plug-in whatever hot new method just came out
without any effort or proper understanding.

~~~
FrojoS
> Finally, it takes significant time and experience to produce (and maintain)
> code that others can use and expect to work most of the time. That's a waste
> of time/money for a research lab, and an inefficient use of public funds.

I haven't seen people asking for maintained, (re)usable code. We just want the
crappy code that was used to produce the results. There is even an appropriate
license, the Community Research and Academic Programming License (CRAPL).

[1] <http://matt.might.net/articles/crapl/>

~~~
bostonpete
I would think most people would be unwilling to make their "crappy" code
public because no matter how many disclaimers they provide with it, they will
be judged by others on it.

~~~
beagle3
Why on earth would anyone trust descriptions that they cannot verify?

Trusting without the ability to verify goes against everything scientific.

If you think your code is too "crappy" for publication, why do you believe it
is bug free enough to produce dependable answers?

~~~
bostonpete
> Why on earth would anyone trust descriptions that they cannot verify? > >
> Trusting without the ability to verify goes against everything scientific.

Hasn't this always been true about scientific papers? Descriptions can be
verified by reproducing the experiment. Why is a paper any less trustworthy
just because there's code involved?

~~~
jerf
The need for reproducibility in experiments is an accident of the fact that
our universe is horrifically complicated and true reproducibility is a myth,
thus we must make a deliberate, conscious effort to come as close as possible,
or no progress can be made. When that is no longer true and it becomes
possible to run (under certain constrained circumstances) fully deterministic
experiments that can be freely replicated to the bit by anybody, it's time to
rethink the assumptions made lo these many centuries ago.

People arguing against source code release often argue as if those of us in
favor think that re-running the original simulation is the end-all, be-all of
reproducibility. Clearly that is not the case. No one simulation can truly
prove anything, and independent reverification will always have a place. But
since we do have the source artifacts and original data, why _not_ release
them and show exactly what was done and how it was done? Again, the idea that
experiments should not do so is merely an artifact of the fact that scientific
papers could only be 10 very expensive pages or so in a journal; why carry
unexamined assumptions based on that now outdated fact forward into the
future?

Accidents of the past are nothing more than accidents of the past, not holy
writ. And I'm not aware of a good argument against release of source code that
doesn't boil down to _well, that's just not how we do it_ when deeply
examined.

------
wheaties
When my other half was going through her PhD, she was attempting to implement
a signal processing approach to construct grids for FEM. Eventually adding
some constraints led to an acceptable result. As she presented her findings at
a conference the author of the paper eagerly questioned her about her
approach. Why? He had never gotten it to work and failed to mention that next
to the glowing praise heaped upon the technique.

~~~
jmilloy
Good thing he published then! It may have been a long time before someone
thought of the approach _and_ ironed out all of the implementation details.
This sounds like exactly how science progresses... incrementally building on
the work of others.

~~~
dmpk2k
Alternatively, such a paper could lead many a reader to implement a dead end.
Each assumes they did it wrong, and after much wrangling, silently move on.
Later, the next intellectual victim comes along.

------
bjoernbu
While peer review is flawed in so many ways and neither all papers accepted at
top conferences are good, nor all good papers get accepted, it has still some
meaning.

When deciding which paper to read, it can be a great hit where it was
published. The link merely claims groundbreaking work was published in the
best "journals". Especially in computer science, conferece papers are where
recent, groundbreaking work is published and good conference are the ones that
are hard to get into. However, I agree that groundbreaking work ofter gets a
longer follow-up journal aritcle. But those usually appear years later and for
those algorithms it is likely that there are existing implementations
available by that time.

------
jamieb
With regard to section 3.5 "Know the definition of all terms", I found that in
a given field these terms change over time. I wasn't really aware of it, and
then I went back to a paper after reading about Domain Maps in Eric Evan's
Domain Driven Design. Lightbulb! I quickly made a list of the terms used in
each paper, and then drew lines between the identical or related terms.

I have for some time been attempting to learn everything I can about automated
theorem proving. A key paper is Robinson's 1965 paper "A Machine-Oriented
Logic Based on the Resolution Principle". It uses quite different notations
compared to, say, Kowalski's "Logic for Problem Solving". Robinson made an
important contribution, but his work is like the assembly language of logic.
Modern papers are much higher level. Kowalski's book is 1979, so its like C.
My new domain map made these works much more comprehensible, especially as I
switched between them.

The other good point is patents. That's why I'm reading papers from 1965 and
books from 1979.

~~~
polskibus
What would happen if you just hosted your potentially patent breaking api from
europe? Apart from 100ms lag?

~~~
jamieb
That's not a conversation I want to have when talking to investors.

------
mistercow
>If you are in the U.S., beware of software patents. Some papers are patented
and you could get into trouble for using them in commercial applications.

Is it only commercial applications that you have to worry about? I was under
the impression that even a free implementation would be infringing the patent
and make you liable for damages.

~~~
brainid
Free implementations are liable. See Madey v. Duke University. It severely
restricted the research exemption
(<http://en.wikipedia.org/wiki/Research_exemption>)

I have had University Council tell me I cannot open source code I have written
implementing patented algorithms. It is unfortunately more common than most
people think, and often not acknowledged in publications.

------
Gravityloss
It's interesting to note how Matlab is still so fast to prototype stuff in.
I've done it myself countless times.

Data generation, input, manipulation, output and result checking are all very
good.

Maybe things like Go can change that, or then some optionally typed language.
There is no fundamental reason not to get massive improvements.

~~~
carterschonwald
Indeed. These plenty of room for improvement in the tools for implementing
such algorithms. I'm actually spending a bit of time on the algorithm
engineering and data provenance bits

------
dmlorenzetti
It's kind of funny that the article lists "authors citing their own work" as
an identifying feature of groundbreaking research.

I don't know about CS, but in most scientific fields, this may be a bad sign.
It can mean they're just trying to pump up their own reference counts, or it
can mean they don't really know what other people are doing.

The only way to be sure that neither of these is the case is to know, a
priori, that they're truly doing groundbreaking work.

~~~
arethuza
I think the warning sign should really be if an author _only_ cites their own
work, from what I recall having a small number of references to your own
previous work that you are building on is pretty standard and desirable as
most research is incremental.

~~~
sophacles
This is true. A lot of times the process goes:

1\. Here is a paper about some interesting things we've been looking at, here
is what we know, here are some ideas we are building on. These tend to be
presented at conferences, as a "what do you guys think?" sort of introduction
for the larger community. It is nice, because other researchers can then tell
you if this is actually new, or just a repeat of an idea (that was hard to
find because it used different terms and went nowhere), or something worth
looking into, or the old "hey, here's some pitfalls I see from my expertise".

2\. A couple more conference papers with results building on the ideas in the
seminal paper in 1. These are just to keep awareness of your work to others in
the field, get feedback, and to play the game right - you get much less credit
if you don't have a history of showing you've been working on this for a while
when someone comes along and "scoops" you.

3\. Actually interesting/important work. This is the type of paper that the OP
calls "groundbreaking". After all that work (see documentation over the
years), here is something pretty awesome!

Another facet of this process:

When doing research, you have no idea what you are doing. I mean, you have
expertise and goals and hypothesis/theory, but you don't know how it will pan
out. You don't know if you suddenly find a spot where a left turn is required.
So publishing about these new things is a good idea. Other people in the field
can benefit from just that. Further, in my experience, each small result tends
to spawn more questions/investigatory tracks, etc than it closes. So a lot of
papers with honest "future work" sections are great places for grad students
and others new to the field to dive in and get their feet wet. They can follow
some of the tracks the original researchers just had no time for, and help
fill in gaps.

Finally, the self reference is a good sign, because it establishes you aren't
just some person coming out of left field with $BIG_IDEA (which looks a bit
crack-pot-esque...)

------
binarymax
I got into a situation a couple years ago, where I wanted to understand how
relational databases worked. I mean how they really worked. Down to the
algorithms and processes that implemented query parsing and ACID - purely so I
could try and toy around with implementing some OLAP stuff on my own, without
all the 'overhead' of the rest of the database (at the time I didn't really
care about transactions or query optimization). I ended up getting into a pile
of papers so deep I thought I would never escape, and eventually gave up. I
even ordered a thick book that was a collection of papers, tying it all
together.

Some of the papers on the subject - the groundbreaking ones by folks like
Goetz Graefe - were brilliant and very interesting reads, but at the same time
were so involved I felt like I would need to dedicate years before even
scratching the surface.

Walking away, I did learn to see the difference between good papers and bad,
and learned a heck of a lot on DB internals (good through the end of the 90's
at least). But I think I'll stick with books from now on :)

------
nullc
The site is down now, so I haven't seen the article— but I hope it just says
"(1) give up and beg the author for their implementation, which undoubtedly
contains 2 megabytes of opaque unmentioned magic constants."

------
denzil_correa
Harsh Truth - Sometimes the algorithm is on purpose obfuscated to make the
authors gain a competitive edge.

PS - Speaking from CS perspective.

------
ChristianMarks
I recently refereed a paper in computational geosciences and recommended
substantial revision because the authors did not make their source code
available for evaluation. The paper was ultimately rejected.

------
jwr
There is one increasingly common kind of paper out there, the 'obfuscapaper'.
It's a paper which pretends to outline all the steps, but really doesn't --
key information is either missing or obfuscated. You often see that with
papers published by people working at companies, or by people who are about to
leave academia to work at a company.

Problem is, this kind of obfuscation is really difficult to spot unless you
actually understand the paper and see all the steps necessary to implement the
algorithm.

------
Cyranix
I bumped into #1.4 (patented research) not too long ago -- I wanted to try out
Bi-Normal Separation for feature space reduction in a machine learning
classifier, but BNS comes out of HP Labs and would need to be licensed. I
think the wording on this point needs to be changed; why would it only apply
to those in the US?

~~~
Gmo
Because, officially at least, software patents in the EU are not allowed.

------
Xcelerate
This is very interesting and something I've been thinking about lately.

My research is in molecular dynamics. I've only been a grad student for one
semester, but I've written a lot of code. The code I am currently working on
takes a force-field description, combines it with a listing of atom
coordinates, and completely automates the production of an input file to a
simulation program. (This is less trivial than it sounds. All stretching,
bending, torsional, and improper atom connections must be generated. Bond-
orders must be determined, solely from the structure. And then this is
combined with equations from the paper describing the force-field to give an
energetic potential for each connection type).

I would think this code could have a lot of value to other researchers. Though
I doubt anyone in non-CS departments have even heard of Github.

------
jongraehl
"create your own type to encapsulate the underlying type (float or double,
32-bit or 64-bit), and use this type in your code. This can be done with a
define is C/C++ or a class in Java."

No. A Java class would certainly have worse performance than just float or
double, since each instance would be individually heap allocated, have the
usual per-Object overhead. Better use text preprocessing (or just "double")
than that.

(otherwise, this is all fine advice)

------
omaranto
6.3 and 6.4 read funny one after the other: put references to the paper in the
comments, but change all the notation from the paper.

~~~
1wheel
Probably the worse piece of advice in the article :

> 6.4 – Avoid mathematical notations in your variable names Let’s say that
> some quantity in the algorithm is a matrix denoted A. Later, the algorithm
> requires the gradient of the matrix over the two dimensions, denoted dA =
> (dA/dx, dA/dy). Then the name of the the variables should not be “dA_dx” and
> “dA_dy”, but “gradient_x” and “gradient_y”. Similarly, if an equation system
> requires a convergence test, then the variables should not be “prev_dA_dx”
> and “dA_dx”, but “error_previous” and “error_current”. Always name things
> for what physical quantity they represent, not whatever letter notation the
> authors of the paper used (e.g. “gradient_x” and not “dA_dx”), and always
> express the more specific to the less specific from left to right (e.g.
> “gradient_x” and not “x_gradient”).

Especially when you're just starting out, creating your own naming scheme just
creates more opportunities to do something wrong.

~~~
scrumper
Have to disagree. Those derivatives are a bad example of a good point. Most
mathematical symbols aren't representable in code, at least until we're able
to use unicode identifiers and sub/superscripts in every language. When you're
forced to write 'theta' instead of θ then you might as well just say 'angle'
so your future maintenance programmer will have an easier time of it.

An equation and an algorithm might achieve the same result but the ways they
get there are so different that using different notation styles makes perfect
sense. For example, a simple finite summation is a compact block in
mathematical notation but it's a multi-line for loop in C. Trying to force the
constraints of the 'source' notation on the implementation makes no sense.

Remember, dA/dx means 'gradient'. You're not creating your own naming scheme,
you're translating the concept of 'gradient' to the appropriate notation for
the medium you're working in.

~~~
mistercow
>Most mathematical symbols aren't representable in code, at least until we're
able to use unicode identifiers and sub/superscripts in every language.

If the Linux compose key supported more of the common mathematical symbols, I
would have so much trouble not using them in all of my JS code. It's already
hard not to use names like â to denote unit vectors.

As a side note, I really want to make a JS library called "Eta" for creating
progress bars (puns!), where the global namespace is under Η (the Greek
letter), but I think that might piss people off, even if I did allow the
visually identical H as an alias.

~~~
derleth
> If the Linux compose key supported more of the common mathematical symbols,

Google '.xcompose github' (without quotes) and see what you come up with. I
use this one, myself:

<https://github.com/kragen/xcompose>

but there are a _lot_ of others.

~~~
mistercow
That's pretty sweet. If anybody is having trouble getting it to work in KDE, I
found that these instructions were necessary:
[https://wiki.edubuntu.org/ComposeKey#Persistent_Configuratio...](https://wiki.edubuntu.org/ComposeKey#Persistent_Configuration)

------
marshray
How about this one:

Figure out if the pseudocode notation in the paper is using 0- or 1-based
array indexing.

If the paper doesn't match your implementation language, consider doing the
initial implementation using an array adapter class. The adapters can be
removed later if they reduce performance, but they will likely have save you a
number of maddening errors in the meantime.

------
willcodeforfoo
I'm a little ignorant on the topic, what legal rights do I have to use a
(self)implemented academic paper?

------
smiddereens
Pray for an appendix with source code.

