Hacker News new | past | comments | ask | show | jobs | submit login
How to implement an algorithm from a scientific paper (codecapsule.com)
206 points by jnazario on Jan 11, 2013 | hide | past | favorite | 77 comments



In my field (ecology and evolutionary biology) there's a small but concerted push to get people to publish "executable papers" where all code and data is available via GitHub, on an iPython notebook, etc. so that all figures and test cases can easily be reproduced by reviewers and researchers. If you're publishing research, it shouldn't be the reader's job to parse your paper and reimplement the research you describe; it should be on you to make your results replicable, and I think this is a standard we need to insist upon. I can't count how many papers I've read lately that were missing either crucial methods or underlying data so that replication was impossible.

edit: Here's an example:

https://github.com/weecology/white-etal-2012-ecology

    Run portions of the analysis pipeline:
    Empirical analyses: python mete_sads.py ./data/ empir
    Simulation analyses: python mete_sads.py ./data/ sims
    Figures: python mete_sads.py ./data/ figs


I currently work in method development for proteomics, and I think about this all the time. Of course methods must be reproducible, or the paper largely useless. But lately I've come to disagree that source code should be required.

If the research relies on a piece of "in-house" code, which isn't described, then that's a problem. But the OP is about research where the algorithm is the advance in the field. In a way, the algorithm is the result and not the method (even though the algorithm will probably be applied to a sample problem and the results thereof validated).

Now, if I develop a new method at the bench, you are expected to do the method at your own bench if you want to apply it to your research. This often starts as a pilot experiment just trying out the method, which can take weeks to months, before you integrate it fully into your methods. I don't have to provide a kit which consistently implements the method along with the paper. Certainly once the method gains some traction, an outside company may begin selling kits. Until then, it absolutely should be "the reader's job to ... reimplement the research" - that's an essential part of the process.

Similarly, an algorithmic method can be reproducible even if code isn't provided. Just like with bleeding edge wet-lab methods, if you want to use bleeding edge algorithms you will need to be able to code. You will need to be able to read a description of the algorithm and implement it accurately. The publicly funded result is the algorithm as an idea, and providing that the review process works, you're getting that idea. Later on, if the algorithm gains any traction, someone will implement it in a usable, robust package and more labs will be able to use it.

Finally, it takes significant time and experience to produce (and maintain) code that others can use and expect to work most of the time. That's a waste of time/money for a research lab, and an inefficient use of public funds. If you can't code yourself, leave code production to those who do it well and don't whine that you can't just plug-in whatever hot new method just came out without any effort or proper understanding.


> Finally, it takes significant time and experience to produce (and maintain) code that others can use and expect to work most of the time. That's a waste of time/money for a research lab, and an inefficient use of public funds.

I haven't seen people asking for maintained, (re)usable code. We just want the crappy code that was used to produce the results. There is even an appropriate license, the Community Research and Academic Programming License (CRAPL).

[1] http://matt.might.net/articles/crapl/


I would think most people would be unwilling to make their "crappy" code public because no matter how many disclaimers they provide with it, they will be judged by others on it.


Why on earth would anyone trust descriptions that they cannot verify?

Trusting without the ability to verify goes against everything scientific.

If you think your code is too "crappy" for publication, why do you believe it is bug free enough to produce dependable answers?


Re-running their crappy code and getting the same result they got doesn't really prove or verify anything. Re-implementing the algorithm they describe in the paper and getting the same result (or not) is far more interesting.


> Re-running their crappy code and getting the same result they got doesn't really prove or verify anything.

Yes it does.

Very often, the data selected for publication is cherry-picked. Running the same crappy code on a more complete data set, (or alternatively, on a partial data set) would give a very quick indication of the robustness of the results - and unlike re-implementing, might be doable in a day rather than months of effort.

Furthermore, when you actually re-implement (if you do), it is extremely helpful to compare intermediate results, which is impossible unless you have the original everything.

> Re-implementing the algorithm they describe in the paper and getting the same result (or not) is far more interesting.

Yes, but very rarely done in fields that are not CS or EE (and not very common in these either). Usually, results are just taken as gospel.

Also, there is a ridiculous amount of negligence (and even fraud) in publications. just running the crappy code, seeing the results, and having a cursory look at the code and data would reveal a lot of that.


> Why on earth would anyone trust descriptions that they cannot verify? > > Trusting without the ability to verify goes against everything scientific.

Hasn't this always been true about scientific papers? Descriptions can be verified by reproducing the experiment. Why is a paper any less trustworthy just because there's code involved?


The need for reproducibility in experiments is an accident of the fact that our universe is horrifically complicated and true reproducibility is a myth, thus we must make a deliberate, conscious effort to come as close as possible, or no progress can be made. When that is no longer true and it becomes possible to run (under certain constrained circumstances) fully deterministic experiments that can be freely replicated to the bit by anybody, it's time to rethink the assumptions made lo these many centuries ago.

People arguing against source code release often argue as if those of us in favor think that re-running the original simulation is the end-all, be-all of reproducibility. Clearly that is not the case. No one simulation can truly prove anything, and independent reverification will always have a place. But since we do have the source artifacts and original data, why not release them and show exactly what was done and how it was done? Again, the idea that experiments should not do so is merely an artifact of the fact that scientific papers could only be 10 very expensive pages or so in a journal; why carry unexamined assumptions based on that now outdated fact forward into the future?

Accidents of the past are nothing more than accidents of the past, not holy writ. And I'm not aware of a good argument against release of source code that doesn't boil down to well, that's just not how we do it when deeply examined.


> Hasn't this always been true about scientific papers? Descriptions can be verified by reproducing the experiment. Why is a paper any less trustworthy just because there's code involved?

It was always true to an extent.

Code is a force multiplier that makes it significantly harder to evaluate the paper with out it (and without reproducing an equivalent).

I'm not in academia myself, but I've heard from friends more than once that when they actually received code (and/or data) they requested from an author, the code turned out to be not precisely described in the paper, and the data is often massaged to fit in a way that's not precisely described either.

The question shouldn't be "why aren't you satisfied with what was good 20 years ago?", but rather "when sharing the bits that makes everything reproducible is a 'git push' away, why isn't it considered mandatory?"

It is a common error that science is about proving things; The scientific method is actually about trying to disprove things and failing to do so. If what you want to do is science, why don't you make it easiest to disprove your results?


I just want to emphasize the difference between "code that was used to produce the results" and as algorithm that is the key element of the paper.

If I can't reproduce the results because you're using methods that I can't reasonably find/replicate, then the paper should not be accepted. That's still true for code, and sometimes providing source code is the missing piece. That's definitely not what the OP is about.


>If you can't code yourself, leave code production to those who do it well and don't whine that you can't just plug-in whatever hot new method just came out without any effort or proper understanding.

This is missing the point. I can code, and I'm not promoting code sharing to more easily use other people's research. What I want is to be able to verify that the research I read is accurate, and there's simply not enough time available to reproduce everything I read from scratch. Reproducibility is important; I shouldn't have to just accept the authors' word that their results are exactly as described. Independent verification should be happening at the review stage at the very least, and authors should be required to make that possible if they want to publish.


i would imagine that the algorithm would gain a lot more traction if it actually came with an implementation


This is what I (with co-authors) argued for in Nature last year: http://blog.jgc.org/2012/02/case-for-open-computer-programs....


That's great... When I did my MSc (my thesis was on reducing error rates in OCR), of the dozens of computer science papers I surveyed, there were perhaps 2-3 where I didn't have to do extensive "reverse engineering" of the results of the papers to figure out a whole lot of unstated assumptions when I wanted to test their algorithms. It was a tremendous waste of time...

Especially when pretty much all of these papers described results that meant the authors had already implemented their algorithms in executable form. But almost none of them made the code available in any form whatsoever.


I don't understand how academic authors get away with this.

Papers that I implemented or tried to implement in the past were often sorely lacking in details. Your point about unstated assumptions is absolutely true. Some papers are nothing more than glorified abstracts.

Free access to PUBLICLY funded research should be the default.


I think the issue is that the current incentive system make it generally much more favorable to not publish actual source.

-A trivial bug in your program can go a long way to discrediting you if someone wants to. -If your methods get inlined into a popular library, the number of people who cite your work will drop to 0. Popular libraries (for Machine Learning at least) typically have a set of authors who will be cited, and if you implement some new method or improvement in that library, they will get the credit and you won't. Since being cited is the most common measure of your net worth, having your work be accessible only with your name attached is significantly better for your career than having your work be maximally accessible.


I've had more than a few situations where the author does release the code...and it doesn't match the paper. The equations are different, restrictions are tighter, etc. It's the single most frustrating aspect of my job (excepting autotools).

Releasing the code is another way that people can attack your conclusions, so maybe people are reluctant to do it?


There's not enough room in a 12-page paper to give details. As someone else mentioned, every detail is also an invitation for some nitpicker to try to sabotage your work.


When I did my diploma thesis my mentoring professor explicitly told me to only include short excerpts of the code, if any. Okay, it was Physics, but still. Having read the thesis 1 year after I handed it in, I realized the code fragments are mostly usely/trivial for the reader. Either you publish the whole code or none of it.


Free access to PUBLICLY funded research should be the default

I agree. But some research is privately funded.


I agree completely, but as the resident Dijkstra-head I am compelled to furnish the following quote:

"In the good old days physicists repeated each other's experiments, just to be sure. Today they stick to FORTRAN, so that they can share each other's programs, bugs included." -- EWD 498 ("How do we tell truths that might hurt?")


And what's wrong with that? That only makes it easier to spot the errors in their research.


Or harder, especially if you use the existing code as a crutch when recreating the experiment. You're much more likely to overlook a bug and copy it than perfectly recreate it from scratch.

Or worse yet, if you just run experiment.sh, see the results are the exact same, and declare that you recreated the results.


If someone goes looking through the code, it's a terrific improvement. But I work with astronomers and I haven't seen that tendency here. I expect this is changing across all science, but it's probably happening faster in biology.


Two "clean room" implementations that achieve the same result is much, much stronger evidence than any degree of code reviews on one piece of code (or reviews of one experiment). Independent reproducibility is a cornerstone of science.


I absolutely agree. I also have about zero faith in untrained scientists ability to perform a clean room implementation. (That's neither here nor there though.)


Good point, I hadn't considered that. But isn't the source code a part of the methodology used? As in, if someone doesn't review the source code, then he's skipping a certain part of the methodology?


Yes, logically that's true. Unfortunately, in the past programming wasn't particularly respected—remember Admiral Hopper discovered bugs—prior to that everyone had apparently assumed their programs would just be a small matter of coding or something. The scientists I know continue to regard a program as this necessary inconvenience, I suppose in part because the math is the reality.

So yes, I agree, but there's going to need to be some education and cultural change before we get there, and I think fields that have embraced a computational subfield have a big head start.


The Reproducible Research movement and the like have been trying to gain ground in many different areas since at least 2009 [1].

Too bad its adoption hasn't grown faster. Especially with all the recent focus on "big data", applied machine learning, and computational science papers. It's hard to actually measure the quality and contributions of most of them.

Not having an accepted method to cite and share datasets is also part of the problem.

1- http://www.computer.org/csdl/mags/cs/2009/01/mcs2009010005.p... [PDF]


http://figshare.com/ is gaining traction as a place to deposit data and figures, and it provides DOIs for citation.


Reproduction of work sometimes goes by the name "provenance." 1. Record what you want to do. 2. Record how, when, where, by whom it was done. 3. Include ability to run it again. (Exact reproduction is understood to be impossible given everything that can vary in the hardware, OS, and software.) There is an encoding format for provenance, called Open Provenance Model, that can function as a guideline for how to record actions faithfully.

A good example is VisTrails, a workflow application. If you use it to make an image, then you can click on that figure in a PDF, and the URL leads to an online record of what made the image, which downloads to your local machine and runs. You can pick up where the author left off (software, data, internet permitting). Running every program under such a workflow is cumbersome or impossible, though, but it's work in the right direction.


This should be the default for all code and data produced by publicly funded research.


Yes, although I don't think this is just a question of open access. It's a question of research verification - and expected of all published research.


It is generally a good idea, but I am not sure how efficient it would be, because scientists generally suck at programming, and wandering through a lamer's messy code is not that better than wandering through a poorly formulated scientific paper, very often it is even worse. Such a code written by nonprofessional programmers usually has very bad API design, very bad naming, poor to unacceptably poor performance, is often written in a high-level non-efficient language like Python, so you cannot just insert it into your program on C++. The scientists should better develop a standard for scientific papers, because they do not even name them consistently. Damn, people, the most advanced human knowledge is a collection of PDFs that are not good for anything except printing and forgetting. There is a program 'Mendeley', and all it does is it finds PDFs where you point it at, and renames it depending on how relatively successful it has parsed the article's author and the title. And it does suck, because it cannot even parse the author and title in 100% of cases. Will there be anything better in the XXI century for knowledge sharing and scientific collaboration except as writing a bunch of disconnected PDFs? At leats hyperlinks, for god's sake! Semantic web? Also, the traditional mathematical notation where any entity is marked with a single letter from Latin, Greek or another alphabet is exactly as very poor programming style where all your identifiers are like a, b, c, a1, a2, i3. For me, this makes parsing the scientific papers much harder.


the most advanced human knowledge is a collection of PDFs that are not good for anything except printing and forgetting

Yep, and we have no backup of science. If science is so important, why is there no backup of all these files? How important it must be that nobody is bothering- despite the moral implications- to make a complete backup? And why can't scientists access papers?


The last person who tried is now dead after being threatened to rot in jail for life by his state.


When my other half was going through her PhD, she was attempting to implement a signal processing approach to construct grids for FEM. Eventually adding some constraints led to an acceptable result. As she presented her findings at a conference the author of the paper eagerly questioned her about her approach. Why? He had never gotten it to work and failed to mention that next to the glowing praise heaped upon the technique.


Good thing he published then! It may have been a long time before someone thought of the approach and ironed out all of the implementation details. This sounds like exactly how science progresses... incrementally building on the work of others.


Alternatively, such a paper could lead many a reader to implement a dead end. Each assumes they did it wrong, and after much wrangling, silently move on. Later, the next intellectual victim comes along.


It seems like one should say "that's how scholarly research progresses". To be science as such it seems like one has to do some verification as well as some speculation.


While peer review is flawed in so many ways and neither all papers accepted at top conferences are good, nor all good papers get accepted, it has still some meaning.

When deciding which paper to read, it can be a great hit where it was published. The link merely claims groundbreaking work was published in the best "journals". Especially in computer science, conferece papers are where recent, groundbreaking work is published and good conference are the ones that are hard to get into. However, I agree that groundbreaking work ofter gets a longer follow-up journal aritcle. But those usually appear years later and for those algorithms it is likely that there are existing implementations available by that time.


With regard to section 3.5 "Know the definition of all terms", I found that in a given field these terms change over time. I wasn't really aware of it, and then I went back to a paper after reading about Domain Maps in Eric Evan's Domain Driven Design. Lightbulb! I quickly made a list of the terms used in each paper, and then drew lines between the identical or related terms.

I have for some time been attempting to learn everything I can about automated theorem proving. A key paper is Robinson's 1965 paper "A Machine-Oriented Logic Based on the Resolution Principle". It uses quite different notations compared to, say, Kowalski's "Logic for Problem Solving". Robinson made an important contribution, but his work is like the assembly language of logic. Modern papers are much higher level. Kowalski's book is 1979, so its like C. My new domain map made these works much more comprehensible, especially as I switched between them.

The other good point is patents. That's why I'm reading papers from 1965 and books from 1979.


What would happen if you just hosted your potentially patent breaking api from europe? Apart from 100ms lag?


That's not a conversation I want to have when talking to investors.


>If you are in the U.S., beware of software patents. Some papers are patented and you could get into trouble for using them in commercial applications.

Is it only commercial applications that you have to worry about? I was under the impression that even a free implementation would be infringing the patent and make you liable for damages.


Free implementations are liable. See Madey v. Duke University. It severely restricted the research exemption (http://en.wikipedia.org/wiki/Research_exemption)

I have had University Council tell me I cannot open source code I have written implementing patented algorithms. It is unfortunately more common than most people think, and often not acknowledged in publications.


It's interesting to note how Matlab is still so fast to prototype stuff in. I've done it myself countless times.

Data generation, input, manipulation, output and result checking are all very good.

Maybe things like Go can change that, or then some optionally typed language. There is no fundamental reason not to get massive improvements.


Indeed. These plenty of room for improvement in the tools for implementing such algorithms. I'm actually spending a bit of time on the algorithm engineering and data provenance bits


It's kind of funny that the article lists "authors citing their own work" as an identifying feature of groundbreaking research.

I don't know about CS, but in most scientific fields, this may be a bad sign. It can mean they're just trying to pump up their own reference counts, or it can mean they don't really know what other people are doing.

The only way to be sure that neither of these is the case is to know, a priori, that they're truly doing groundbreaking work.


I think the warning sign should really be if an author only cites their own work, from what I recall having a small number of references to your own previous work that you are building on is pretty standard and desirable as most research is incremental.


This is true. A lot of times the process goes:

1. Here is a paper about some interesting things we've been looking at, here is what we know, here are some ideas we are building on. These tend to be presented at conferences, as a "what do you guys think?" sort of introduction for the larger community. It is nice, because other researchers can then tell you if this is actually new, or just a repeat of an idea (that was hard to find because it used different terms and went nowhere), or something worth looking into, or the old "hey, here's some pitfalls I see from my expertise".

2. A couple more conference papers with results building on the ideas in the seminal paper in 1. These are just to keep awareness of your work to others in the field, get feedback, and to play the game right - you get much less credit if you don't have a history of showing you've been working on this for a while when someone comes along and "scoops" you.

3. Actually interesting/important work. This is the type of paper that the OP calls "groundbreaking". After all that work (see documentation over the years), here is something pretty awesome!

Another facet of this process:

When doing research, you have no idea what you are doing. I mean, you have expertise and goals and hypothesis/theory, but you don't know how it will pan out. You don't know if you suddenly find a spot where a left turn is required. So publishing about these new things is a good idea. Other people in the field can benefit from just that. Further, in my experience, each small result tends to spawn more questions/investigatory tracks, etc than it closes. So a lot of papers with honest "future work" sections are great places for grad students and others new to the field to dive in and get their feet wet. They can follow some of the tracks the original researchers just had no time for, and help fill in gaps.

Finally, the self reference is a good sign, because it establishes you aren't just some person coming out of left field with $BIG_IDEA (which looks a bit crack-pot-esque...)


You completely misread that bit -- citing your own work is noted as a sign of incremental research done over several years. It was in the discussion of different types of groundbreaking research.


Not funny at all. Referencing your own work, by itself, is not a bad sign. In many cases, the groundbreaking work take several years to accomplish. The years of work generate several new ideas on their own, resulting in several related papers.


Yes. The OP agrees; here is the full context of the quote:

"Groundbreaking papers...out of research teams in smaller universities that have been tackling the problem for about six to ten years. The later is easy to spot: they reference their own publications in the papers, showing that they have been on the problem for some time now, and that they base their new work on a proven record of publications."


It's pretty clear when they are trying to pump references; the papers they reference may be tangentially on not even related, or have no new content.


> just trying to pump up their own reference counts

Don't most bibliometrics exclude self-citations?


I got into a situation a couple years ago, where I wanted to understand how relational databases worked. I mean how they really worked. Down to the algorithms and processes that implemented query parsing and ACID - purely so I could try and toy around with implementing some OLAP stuff on my own, without all the 'overhead' of the rest of the database (at the time I didn't really care about transactions or query optimization). I ended up getting into a pile of papers so deep I thought I would never escape, and eventually gave up. I even ordered a thick book that was a collection of papers, tying it all together.

Some of the papers on the subject - the groundbreaking ones by folks like Goetz Graefe - were brilliant and very interesting reads, but at the same time were so involved I felt like I would need to dedicate years before even scratching the surface.

Walking away, I did learn to see the difference between good papers and bad, and learned a heck of a lot on DB internals (good through the end of the 90's at least). But I think I'll stick with books from now on :)


The site is down now, so I haven't seen the article— but I hope it just says "(1) give up and beg the author for their implementation, which undoubtedly contains 2 megabytes of opaque unmentioned magic constants."


Harsh Truth - Sometimes the algorithm is on purpose obfuscated to make the authors gain a competitive edge.

PS - Speaking from CS perspective.


I recently refereed a paper in computational geosciences and recommended substantial revision because the authors did not make their source code available for evaluation. The paper was ultimately rejected.


There is one increasingly common kind of paper out there, the 'obfuscapaper'. It's a paper which pretends to outline all the steps, but really doesn't -- key information is either missing or obfuscated. You often see that with papers published by people working at companies, or by people who are about to leave academia to work at a company.

Problem is, this kind of obfuscation is really difficult to spot unless you actually understand the paper and see all the steps necessary to implement the algorithm.


I bumped into #1.4 (patented research) not too long ago -- I wanted to try out Bi-Normal Separation for feature space reduction in a machine learning classifier, but BNS comes out of HP Labs and would need to be licensed. I think the wording on this point needs to be changed; why would it only apply to those in the US?


Because, officially at least, software patents in the EU are not allowed.


This is very interesting and something I've been thinking about lately.

My research is in molecular dynamics. I've only been a grad student for one semester, but I've written a lot of code. The code I am currently working on takes a force-field description, combines it with a listing of atom coordinates, and completely automates the production of an input file to a simulation program. (This is less trivial than it sounds. All stretching, bending, torsional, and improper atom connections must be generated. Bond-orders must be determined, solely from the structure. And then this is combined with equations from the paper describing the force-field to give an energetic potential for each connection type).

I would think this code could have a lot of value to other researchers. Though I doubt anyone in non-CS departments have even heard of Github.


"create your own type to encapsulate the underlying type (float or double, 32-bit or 64-bit), and use this type in your code. This can be done with a define is C/C++ or a class in Java."

No. A Java class would certainly have worse performance than just float or double, since each instance would be individually heap allocated, have the usual per-Object overhead. Better use text preprocessing (or just "double") than that.

(otherwise, this is all fine advice)


6.3 and 6.4 read funny one after the other: put references to the paper in the comments, but change all the notation from the paper.


Probably the worse piece of advice in the article :

> 6.4 – Avoid mathematical notations in your variable names Let’s say that some quantity in the algorithm is a matrix denoted A. Later, the algorithm requires the gradient of the matrix over the two dimensions, denoted dA = (dA/dx, dA/dy). Then the name of the the variables should not be “dA_dx” and “dA_dy”, but “gradient_x” and “gradient_y”. Similarly, if an equation system requires a convergence test, then the variables should not be “prev_dA_dx” and “dA_dx”, but “error_previous” and “error_current”. Always name things for what physical quantity they represent, not whatever letter notation the authors of the paper used (e.g. “gradient_x” and not “dA_dx”), and always express the more specific to the less specific from left to right (e.g. “gradient_x” and not “x_gradient”).

Especially when you're just starting out, creating your own naming scheme just creates more opportunities to do something wrong.


Have to disagree. Those derivatives are a bad example of a good point. Most mathematical symbols aren't representable in code, at least until we're able to use unicode identifiers and sub/superscripts in every language. When you're forced to write 'theta' instead of θ then you might as well just say 'angle' so your future maintenance programmer will have an easier time of it.

An equation and an algorithm might achieve the same result but the ways they get there are so different that using different notation styles makes perfect sense. For example, a simple finite summation is a compact block in mathematical notation but it's a multi-line for loop in C. Trying to force the constraints of the 'source' notation on the implementation makes no sense.

Remember, dA/dx means 'gradient'. You're not creating your own naming scheme, you're translating the concept of 'gradient' to the appropriate notation for the medium you're working in.


>Most mathematical symbols aren't representable in code, at least until we're able to use unicode identifiers and sub/superscripts in every language.

If the Linux compose key supported more of the common mathematical symbols, I would have so much trouble not using them in all of my JS code. It's already hard not to use names like â to denote unit vectors.

As a side note, I really want to make a JS library called "Eta" for creating progress bars (puns!), where the global namespace is under Η (the Greek letter), but I think that might piss people off, even if I did allow the visually identical H as an alias.


> If the Linux compose key supported more of the common mathematical symbols, I would have so much trouble not using them in all of my JS code.

In case you happen to use vim, you can easily type Unicode characters by using digraphs (Ctrl-K and then two characters):

   <c-k>a* → α
   <c-k>b* → β
   <c-k>g* → γ
   <c-k>OK → ✓
   <c-k>NB → ∇
   <c-k>RT → √
   <c-k>(- → ∈
   <c-k>-> → →


> If the Linux compose key supported more of the common mathematical symbols,

Google '.xcompose github' (without quotes) and see what you come up with. I use this one, myself:

https://github.com/kragen/xcompose

but there are a lot of others.


That's pretty sweet. If anybody is having trouble getting it to work in KDE, I found that these instructions were necessary: https://wiki.edubuntu.org/ComposeKey#Persistent_Configuratio...


I think you're generally right. There are exceptions, but in many domains there are standard notations for often-encountered quantities.

You say "F" for forces, "p" for probabilities, "x" for machine learning inputs, "y" for classes, etc. In some cases, you might want to use English terms for these things, but usually you'd want to use the standard letters. Especially if the source paper that you're following uses some flavor of standard notation. (So many standard notations to chose from!)

Of course, coming up with universal rules for naming things is impossible. The advice would probably have been better left out.


How about this one:

Figure out if the pseudocode notation in the paper is using 0- or 1-based array indexing.

If the paper doesn't match your implementation language, consider doing the initial implementation using an array adapter class. The adapters can be removed later if they reduce performance, but they will likely have save you a number of maddening errors in the meantime.


I'm a little ignorant on the topic, what legal rights do I have to use a (self)implemented academic paper?


Pray for an appendix with source code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: