Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Should I publish my research code?
421 points by jarenmf on Jan 14, 2022 | hide | past | favorite | 353 comments
I'm looking for advice om whether I should publish my research code? The paper itself is enough to reproduce all the results. However, the implementation can easily take two months of work to get it right.

In my field many scientists tend to not publish the code nor the data. They would mostly write a note that code and data are available upon request.

I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.

But on the other hand it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc). Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.




> it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

Matt Might has a solution for this that I love: Don't clean & organize! Release it under the CRAPL[0], making explicit what everyone understands, viz.:

"Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"

[0] https://matt.might.net/articles/crapl/


> "Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"

What do you know, it turns out the professional software developers I work with are actually scientists and academics!!


They don't call it "Computer Science" for nothing ;)


notoriously Philip Wadler says that computer science has two problems: computer and science.

It's not about computers and "You don't put science on your name if you're a real science"

He prefers the name informatics.

source: https://youtube.com/watch?v=IOiZatlZtGU


You don't put science on your name if you're a real science

Having flashbacks to when a close friend was getting an MS in Political Science, and spent the first semester in a class devoted to whether or not political science is a science.


Well? Don't keep us hanging.


Kinda like countries that feel the need to put “democratic” in their name.


Also applies to newspapers which include some version of "truth" in the title. (Russian propaganda "Pravda", Polish tabloid "Fakt", etc.)


Or "People"...


Or "Republic"...


Or "Of China"....


"Information science" is basically long form of "informatics" so that breaks it I'd say. Also, "information" tends to imply a focus on state and places computational aspects (operations performed on information) as second hand.

I've yet to find a classification I really like but this is an interesting take. I still tend to like CIS (Computing and Information Sciences). The problem with CS is it focuses on computation and puts state as second class. The problem with IS is it focuses on state and puts computing as second class. To me, both are equally important.


TBF informatics is "the study of computational systems"

I have studied informatics, but don't call myself an informatic, because I am not doing any research.

I call myself a programmer, because I am not doing engineering either.


> It's not about computers and "You don't put science on your name if you're a real science"

"Computer science is no more about computers than astronomy is about telescopes." - Edsger W. Dijkstra [0], [1].

[0] https://en.wikipedia.org/wiki/Edsger_W._Dijkstra

[1] https://en.wikiquote.org/wiki/Computer_science


Materials Science is about as sciency as you can get.


An economist friend of mine told me that once, so this isn't just a CS quip.


Informática is a common term in Spanish, as you probably know.


Maybe a little OT, but, I'd rather it be called "computing science." Computers are just the tool. I believe it was Dijkstra who famously objected to it being called "computer science," because they don't call astronomy "telescope science," or something to that effect.


Peter Naur agreed with that which is why it is called "Datalogi" in Denmark and Sweden, his English term Datalogy never really caught on though. He wrote it in a letter to the editor in 1966 and it has pretty much been used since then here as Naur founded the first Datalogy university faculty a few years later.

This also lead to a strange naming now we have data science as well where it is called "Data videnskab" which is just a literal translation of the English term.

[0]: https://dl.acm.org/doi/10.1145/365719.366510 (sadly behind a wall)


informatics may be the closest english analogue. incidentally also what computer science is called in german (informatik)


> incidentally also what computer science is called in german (informatik)

Computerwissenschaften (literally computer science) exists, too, but it's the less common word.


> his English term Datalogy never really caught on though

Perhaps because no one had any idea about how it would be pronounced?


In Portuguese it's called "Ciência da Computação" (computing science or science of computation).


I never thought about this, but it's right. However, to get more nitpicky, most of the uses of "Comput[er|ing] Science" should be replaced with "Computer Engineering" anyway. If you are building something for a company, you are doing engineering, not science research


Be careful there. If you start calling what you're doing 'engineering', people will want you to know statics, dynamics, and thermodynamics.


The average developer isn't often doing "engineering". Until we have actual standards and a certification process, "engineer[ing]" doesn't mean anything.

The average software developer doesn't even know much math.

Right now, "software engineer" basically means "has a computer, -perhaps- knows a little bit about what goes on under the hood".


> The average software developer doesn't even know much math.

Well, I know stupid amounts of math compared to the average developer I've encountered, since I studied math in grad school. Other than basic graph traversal, I only remember one or two times I've gotten to actually use much of it.


Engineering is something like “deliberately configuring a physical or technological process to achieve some intended effect”. That applies whether you’re building a bridge or writing fizzbuzz


IfSmith


I'm not talking about the "average developer", I'm talking about college graduates having a "Computer Science" degreee but in practice being "Computer engineers"


College degrees aren't standardized and most of the time don't really mean anything. Ask some TAs for computer science courses about how confident they are in undergrads' ability to code.

There isn't a standard license that show that someone is proficient in security, or accessibility, or even in how computer hardware or networking work at a basic level.

So all we're doing is diluting the term "engineer", so as to not mean anything.

The only thing the term "software engineer" practically means is: they have a computer. It's meaningless, just a vanity title meant to sound better than "developer".


I agree. I can't find it right now, but there was an article on HN within the past few days talking about how software engineering is often more like writing and editing a book than engineering. That makes perfect sense to me. Code should be meant primarily to communicate with people, and only incidentally to give instructions to computers. It seems to be lost to the sands of time who said this first, but it is certainly true that code is read by humans many more times than it is written. Therefore, ceteris paribus, we should optimize for readability.

Readable code is often simple code, as well. This also has practical benefits. Kernighan's law[0] states:

> Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

Dijkstra says it this way:

> The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague.[1]

Knuth has a well-documented preference for "re-editable" code over reusable code. Being "re-editable" implies that the code must be readable and understandable (i.e. communicate its intention well to any future editors), if not simple.

I know that I have sometimes had difficulty debugging overly clever code. I am not inclined to disagree with the giants on this one.

---

[0]: https://www.defprogramming.com/q/188353f216dd/

[1]: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD03xx/E...


All my job titles have been of the form Software Engineer plus a modifier.

I believe they are referring to what the degree currently known as Computer Science should be called.


Yes. Specifically, it's referring to what the academic discipline ought to be called.

Incidentally, I don't think I know any SWEs who actually majored in software engineering. I know one who didn't even bother to graduate high school and therefore has no academic credential whatsoever, a couple of music majors, a few math majors, and a lot of "computer science" majors, but I can't think of anyone who actually got a "software engineering" bachelor's degree. Hell, I even know one guy with a JD. I think I know 1 or 2 who have master's degrees in "software engineering," but that's it.


Please don't use this license. Copy the language from the preamble and put it in your README if you'd like, but the permissions granted are so severely restricted as to make this license essentially useless for anything besides "validation of scientific claims." It's not an open-source license - if someone wished to include code from CRAPL in their MIT-licensed program, the license does not grant them the permission to do so. Nor does it grant permission for "general use" - the software might not be legally "suitable for any purpose," but it should be the user's choice to decide if they want to use the software for something that isn't validation of scientific claims.

I am not a lawyer, just a hardcore open-source advocate in a former life.


I‘m a proponent of MIT and BSD style licenses normally, but this calls for something like AGPL: Allow other researchers and engineers to improve upon your code and build amazing things with it. If someone wants to use your work to earn money, let them understand and reimplement the algorithms and concepts, that’s fine too.


That's probably not viable under US copyright law, especially with the Bright Tunes Music v. Harrisongs Music precedent; if someone is going to reimplement the algorithms and concepts without a copyright license, they're better off not reading your code so they don't have to prove in court, to a computer-illiterate jury, that the aspects their code had in common with your code were really purely functional and not creative.


Yeah, this seems a bit verbose and overbearing to me. The open-source licenses I've used myself include something like this, which seems quite sufficient:

> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

From a non-academic's point of view I might include a brief disclaimer in the README too, explaining the context for the code and why it's in the state it's in, but there's no obligation to hold the user's hand. To be frank, anybody nitpicking or criticizing code released under these circumstances with the expectation that the author do something about it can go fuck themselves.

Competitive advantage, on the other hand, is a perfectly valid reason to hold code back. There may also be some cost in academia to opening an attack surface for any sort of criticism, even irrelevant criticism made in obvious bad faith. Based on what I've heard about academia, this wouldn't surprise me.


From the post: "The CRAPL says nothing about copyright ownership or permission to commercialize. You'll have to attach another license if you want to classically open source your software."

It is explicitly the point of the license that the code is not for those purposes, because it's shitty code that should not be reused in any real code base.


That's not a good excuse for putting your readers at legal risk of copyright infringement. A real, non-shitty code base could easily be a "derivative work" of the shitty code.


In the research industry, it is well established for anyone wanting to publish or utilize / include another's research in their own, to contact the source author and receive explicit permission to do so.

More often than not, they are more than willing to help.


"Well-established" norms are a barrier to newcomers and perpetuate power structures (including racial, ethnic, national, and socio-economic).

That's not theoretical; I know many people who were or would be embarrassed to ask.

Explicitly posting things is helpful.


It looks quite explicitly designed as a short-term temporary license for the period when the main paper is unpublished and you'd be expected to keep the code non-public (due to e.g. reviewing anonymity requirements), so the basic open source freedoms are explicitly not included.

I would expect that anyone wanting to actually publish their code should publish the code with a "proper" license after the reviewing process is done and the relevant paper is published.


> the permissions granted are so severely restricted as to make this license essentially useless

Indeed, also there're things like "By reading this sentence, You have agreed to the terms and conditions of this License.". That can't hold up in court! How can I know in advance what the rest of the conditions say before agreeing to them?

Then again, I am not a lawyer either.


While I whole heatedly agree with you, I would seriously question anyone trying to reuse research code in production without completely reimplementing it from scratch.


I'd like to piggyback and say that increasing the surface for nitpicking and criticism is exactly why OP should release his code. It improves the world's ability to map data to the phenomenon being observed. It becomes auditable.

Certainly don't clean it up unless you're going to repeat the experiment with the cleaned up code.


Agreed on both points! As somebody who bridges research and production code, I can typically clean code faster than I can read & understand it. It really helps to have a known-good (or known-"good") example so that I can verify that my cleanup isn't making undesired changes.

And, yeah. I've found some significant mistakes in research code -- authors have always been grateful that I saved them from the public embarrassment.


I do this with my code and can highly recommend it.

Supplying bad code is a lot more valuable than supplying no code.

Also in my experience, reviewers won't actually review your code, even though they like it a lot when you supply it.


The CRAPL is a "crayon license" — a license drafted by an amateur that is unclear and so will prevent downstream use by anyone for whom legal rigor of licensing is important.

https://matt.might.net/articles/crapl/

> I'm not a lawyer, so I doubt I've written the CRAPL in such a way that it would hold up well in court.

Please do release your code, but please use a standard open source license. As for which one, look to what your peers use.


I think this could be done much better by putting a very restrictive license like GPLv3 / AGPL and then in the README putting in that I don't support this project at all and ignore everything associated with wherever you are hosting it.

Using this license would actually make me suspect that your results aren't even valid and I don't trust many experiments that don't release source code.


In case OP, and others don't know, it is the copyright holders that can decide on the license. Copyright holders are the persons who contributed to the code. In this case, it sounds like OP is the sole author and therefore the sole copyright holder.

You cannot change the past, but as a copyright holder, you can always set a new license for future releases.

Thus, OP, if you're uncertain, I definitely was when I started out, go with a restrive license as recommended here (GPL). That, together with publishing the code online (e.g. GitHub, Gitlab, ...) as well as a with your article, will give you some protection against plagiarism. Anyone who use include parts of your code for their research code, will have to share theirs code the same. If you later on feel like you want to relax the license, you can always change it to, say, MIT.


Anecdotally most of the research papers I see and have worked on publish their code but don't really clean it up. Even papers by big companies like Microsoft Research. Still significantly better than not publishing the code at all.


A thousand times this. A working demonstrator of a useful idea that is honest about its limits is so valuable. Mover over most commercial code is garbage! :)


Please do absolutely publish your code.

If only to help people who simply can't read academic papers because it's not a language their brain is wired to parse, but who can read code and will understand your ideas via that medium.

[EDIT]: to go further, I have - more than once - run research code under a debugger (or printf-instrumented it, whichever) to finally be able to get an actual intuitive grasp of the idea presented in the paper.

Once you stare at the actual variables while the code runs, my experience is it speaks to you in a way no equation ever will.


This is not what licenses are for!! They are not statements about the quality of your work or anything similar.

Use standard and well understood licenses e.g. GPL for code and CC for documentation. The world does not need more license fragmentation.


This has explicit usage limitations that matter in science land, which is very much the kind of thing that belongs in a license.

Eg:

   You are permitted to use the Program to validate scientific claims
   submitted for peer review, under the condition that You keep
   modifications to the Program confidential until those claims have
   been published.

Moreover, sure, lots of the license is text that isn't common in legal documents, but there's no rule that says legal text can't be quirky, funny or superfluous. It's just most practical to keep it dry.

In this particular case, however, there's very little risk of actual law suits happening. There is some, but the real goal of the license is not to protect anyone's ass in court (except for the obvious "no warranty" part at the end), but to clearly communicate intent. Don't forget that this is something GPL and MIT also do besides their obvious "will likely stand up in court" qualities. In fact I think that communicating intent is the key goal of GPL and MIT, and also the key goal of CRAPL.

From this perspective, IMO the only problem in this license is

    By reading this sentence, You have agreed to the terms and
    conditions of this License.
This line makes me sad because it makes a mockery of what's otherwise a pretty decent piece of communication. Obviously nobody can agree to anything just by reading a sentence in it. It should say that by using the source code in any way, you must agree to the license.


> clearly communicate intent

Again, this is not how a license work. You can express your intents, ideas and desires in a README file and in many other ways.

The license is nothing more than a contract that provides rights to the recipient under certain conditions. Standing up in court is its real power and only purpose.

That's why we should prefer licenses that stood up in court and have been written by lawyers rather than developers or scientists.


I strongly disagree. Contracts very much primarily communicate intent, ideally in such a way that they also stand up in court. People regularly argue over details in contracts, people regularly look up things in contracts, also when there is no court to be seen and no intention anywhere to go to court. The vast vast vast majority of contracts never make it to court.

Plenty of contracts aren't even written down. When you buy a loaf of bread at the bakery, you make an oral contract about that transaction.

The idea that contracts, or licenses, need to be written in dull legalese and be pretty much impenetrable to be useful or "valid" or whatever, is absolutely bonkers. Lawyers like you to think that but it's not true. It's an urban legend.

If you need to make sure that you can defend your rights in court, then sure, you're probably going to need some legalese (but even then there's little harm in also including some intent - it's just not very common). Clearly that's not the goal here. No scientist is gonna sue another scientist who asked for support and got angry about not getting any even though the code was CRAPL licensed.


> Plenty of contracts aren't even written down.

That's a well known fact. And it's besides the point.

> Lawyers like you to think that but it's not true.

Is that a conspiracy theory? Writing long, detailed contracts on a persistent medium is safer: it lowers the risk of he-said-she-said scenarios and ambiguities.

That is meant to save you tons of legal expenses.

> No scientist is gonna sue another scientist

Then there is no need for such license in the first place. Just a readme file.


By existing, you have agreed to the terms and conditions.


I agree. There's a lot of confusion surrounding even the most established ones, so there's no need to further muddy the situation with newer licenses. In my opinion a "fun" license, with its untested legal ambiguity, restricts usage more than a well established license with a similar level of freedoms.


For instance, the Java license explicitly forbids the use in/for real-time critical systems, and such limitations are good to stress in a license so that they may reach legal force, also to protect the author(s).

Incidentally, I've seen people violate the Java "no realtime" clause.


Used to, OpenJDK is licensened under GPLv2 with the classpath excemption that allows this for years. If not running an OpenJDK build it depends on your vendor license.


And it makes the license non-opensource.

Plus, the usual "no warranty" is strong enough to protect the authors anyways.


> This is not what licenses are for!!

You must be fun at parties :)


Yes, if you feel you have to make it "release ready" then you'll never publish it. I'm pretty sure a good majority of the code is never released because the original author is ashamed of it, but they shouldn't be. Everybody is in the same boat.

The only thing I would add is a description of the build environment and an example of how to use it.


I like it so far, other than

4) You recognize that any request for support for the Program will be discarded with extreme prejudice.

I think that should be a "may" rather than a "will." If I find out someone is using my obscure academic code, and they ask for help, I'd be pretty pumped to help them (on easy requests at least).


The point of the license is to set your expectations as low as possible. Then, when you actually /do/ get support, you'll be ecstatic rather than non-plussed.


Discarding a request for support with extreme prejudice might entail using LinkedIn to look up the boss of the person who asked you for support, then phoning them up to complain about the request for support, or it might entail filing for a restraining order against the person requestings support. The point of this clause is to intimidate people out of making the request in the first place.


That's a pretty over the top reading.


You should probably familiarize yourself with the meaning of the phrases it's alluding to, "dismissed with extreme prejudice" and "terminated with extreme prejudice".


When phrased like this,

> 4) You recognize that any request for support for the Program will be discarded with extreme prejudice.

There is no way I'd even make a request for support.


Exactly. If nothing else, a request for support has a chance of being an indication that there's somebody else in the field that cares about some aspect of the problem. I might not act on it, but it is good to have some other human-generated signal that says "look over there."


> Academic code is about "proof of concept."

Why does he think that but presumably not the same about the paper itself and the “equations”, plots, etc. contained within?

It’s really not that hard to write pretty good code for prototypes. In fact, I can only assume that he and other professors never allowed or encouraged “proof of concept” code to be submitted as course homework or projects.


I think you don't understand the realities of work in scientific/academic organisations. Unless you work in computer science you likely never received any formal education on programming except for some numerical methods in c/matlab/fortran course during your studies (which often also focus more on the numerical methods and not the programming). So pretty much every person, just learned by doing.

Moreover you are not being paid for writing reasonable programs you're paid for doing science. Nobody would submit "prototype" papers, because they are the currency of academic work. There is lots of time spend on polishing a paper before submission, but doing that for code is generally not appreciated because nobody will see this on your CV.


I understand it fine. Like I said, it’s really not that hard to write pretty good code for prototypes. I'm not saying the code needs to be architected for scale or performance or whatever else needless expectation. I don't have a formal education in programming or computer science and write clean code just fine, as do some other non-computer science people I've worked with in advanced R&D contexts. And then some (many?) don't. It's not really about educational background, it's more about just being organized. Even when someone is "just" doing science, a lot of times, the code explicitly defines what's going on and has major impacts on the results. (Not to mention that plenty of people with computer science backgrounds write horrible code.)

If code is a large part of your scientific work, then it's just as important as someone who does optics keeping their optics table organized and designed well. If one is embarrassed by that, then too bad. Embarrassment is how we learn.

Lastly, you're describing problems with the academic world as if they are excuses. They're reasons but most people know the academic world is not perfect, especially with what is prioritized and essentially gamified.


I'm not making excuses I'm just talking about the realities. For the significant majority of researchers I know about version control is still program_v1.py, program_v1.py, program_final.py, program_final2.py (and this is the good version, at least they are using python), so talking about clean code is still quite a bit away. I'm teaching PhD students and it's even hard to convince them, because they just look at how to get the next publishable result. For academics it becomes even more unrealistic, most don't even have time to go to a lab, they are just writing grants typically.

I'm actually a strong supporter of open science, release my code OSS (and actually clean it up) and data (when possible). But just saying it's easy and there is no cost, is just setting the wrong expectations. Generally unless you enjoy it, spending significant time on your code is not smart for your career. Hopefully things are changing, but it is a slow process.

Funny that you talk about optics I know some of the most productive people in my field and their optical tables are an absolute mess (as is their code btw). They don't work at universities though.


I still think it’s really not that hard, and actually, it’s really not even about code. It’s really just about organization, because as you point out, not everyone is great at it. But for example, messy optics tables, labs, or whatever do in fact cause problems, like efficiency and knowing what “state” of supporting tools yielded what result and several other derivative problems. I think my push would be just applying even a modicum of organization on supposedly ancillary things will go a long way rather than accepting them as reality.

I understand the realities and have even been a part of PowerPoint development where slides are emailed back and forth. Sometimes one just has to go with things like that. But I have also seen the reality of stubbornness. I have tried introducing source code control to scientists or even stuff like wikis, all supported and already up and running by IT and used by other groups. Scientists and engineers, especially those with PhDs, can a bit rejective and set in their ways. I have been told flat out by different people that they wouldn’t use tools like a wiki or that Python and C was all they ever needed. I have even noticed physicists saying “codes” for software instead of “code”. It’s fairly rampant, and I have seen it in research papers, published books, and in industry in person. I have never seen that use of “codes” anywhere else. That alone is evidence of a certain amount of culture and institutionalization of doing things incorrectly but viewed as acceptable within the field.

I have written code in some research contexts. I get the constraints. One just needs to take it seriously. But organization, in general, often takes a back seat in basically any field. The only way to change things like this are like anything, which is to push against culture.


I'm not a scientist so maybe I don't get it, but it seems like code could be analogized to a chemist's laboratory (roughly). If a chemist published pictures of their lab setup in their paper, and it turned out that they were using dirty glassware, paper cups instead of beakers, duct taping joints, etc etc, wouldn't that cast some doubt on their results? They would be getting "nitpicked" but would it be unfair? Maybe their result would be reproducible regardless of their sloppy practices, but until then I would be more skeptical than I would be of a clean and orderly lab.


MIT and BSD are established and well accepted licenses, literally named after the academic institutions where they originated. Licenses are legal documents, part of what makes them "explicit what everyone understands" is their legally recognized establishment.

If you want to set expectations, this can simply be done in a README. Putting this in a license makes no sense. Copyright licenses grant exceptions to copyright law. If you're adding something else to it, you're muddying the water, not making it better.


> Don't clean & organize!

FWIW I basically did this: My thesis numbers were run on a branch based on the unstable version of an upstream project that was going through a major refactoring. I took a tarball of the VCS tree at that point in time and posted it online. Over the years 3-4 people have asked for the tarball; nobody has ever come back to me with any more questions. I can only assume they gave up trying to make it work in despair.

I think I tried to build it a couple of years ago, and it wouldn't even build because gcc has gotten a lot more strict than it used to be (and the upstream project has -Werror, so any warnings break the compilation).

I think it's definitely worth doing, but I think you need to be realistic about how much impact that kind of "publishing" is really going to have.


> Matt Might has a solution for this that I love: Don't clean & organize! Release it under the CRAPL[0]

> "Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"

This is brilliant!

edit: it seems that, aside from that great snippet from the text, the license itself isn't so great. another comment [1] has a great analysis of the actual license and suggests using a superior solution (copying the preamble part yet still using MIT/(A)GPL).

[1] https://news.ycombinator.com/item?id=29937180


You don't need an esoteric license, just use a standard license like MIT>

relevant section from MIT license:

"THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."


lol didn't know about this license, amazing!


Hi, I’m a Research Software Engineer (a person who makes software that helps academics/researchers) at a university in the UK. My recommendation is that not only do you publish the code, you mint a DOI (digital object identifier, Zenodo is usually the go to place for that) for the specific version that was used in your paper and you associate them. And you include a citation file (GitHub supports them now: https://docs.github.com/en/repositories/managing-your-reposi...) in your software repo.

Benefits: people who want to reproduce your analysis can use exactly the right software, and people who want to build on your work can find the latest in your repo. Either know how to cite your work correctly.

In practice drive-by nitpicking over coding style is not that common, particularly in (some) science fields where the other coders are all other scientists who don’t have strong views on it. Nitpicks can be easily ignored anyway.

BTW should you choose to publish, the Turing Way has a section on software licenses written for researchers: https://the-turing-way.netlify.app/reproducible-research/lic...


And I would suppose it will drive more citations. Which is a plus!


/me waves from Leeds Uni. Hello fellow RSE.


For a physician that writes pretty awful code, I like this comment.


I would encourage you not to use DOIs for software. They are not made for this, and have limitations which are not appropriate.

Instead, use Software Heritage https://www.softwareheritage.org/ , it provides unique identifiers and actually understand repositories, versioning, history, etc. It also allows you to cite the software and even give proper, durable links to point to the code.


Why not just link to a specific commit hash? What more do they provide?


In my view and personal experience, the pros outweigh the cons:

* You increase the impact of your work and as a consequence also might get more citations.

* It's the right thing to do for open and reproducible research.

* You can get feedback and improve the method.

* You are still the expert on your own code. That someone picks it up, implements an idea that you also had and publishes before you is unlikely.

* I never got comments like "you could organize the code better" and don't think researchers would tend to do this.

* Via the code you can get connected to groups you haven't worked with yet.

* It's great for your CV. Companies love applicants with open-source code.


> It's the right thing to do for open and reproducible research.

Everybody here talks about how publishing code helps (or even makes possible) reproducibility, but this is not true, on the contrary, it hinders it. Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper. This is trivial and good for nothing. Reproduction is other researchers independently reimplementing the code using only the published theory, and getting the same results. If the author publishes the code, no one will bother with this, and this is bad for science.


This is a common misconception, but you are actually talking about "replicability" which is "writing and then running new software based on the description of a computational model or method provided in the original publication, and obtaining results that are similar enough" [1]. Reproducibility instead refers to running the same code on the same data to get the same results [2].

[1] Rougier et al., "Sustainable computational science: the ReScience initiative" https://arxiv.org/abs/1707.04393

[2] Plesser, "Reproducibility vs. Replicability: A Brief History of a Confused Terminology" https://doi.org/10.3389/fninf.2017.00076


It's worth noting that the two provided references disagree on their definitions of reproducibility. Plesser quotes ACM as reproducibility being tested by a "Different team, different experimental setup", consistent with use in the GP's comment, but ultimately seems to favor ignoring "replicability" and instead using "reproducibility" in combination with various adjectives. The common understanding in the physical sciences is also that merely re-running the same code is not a test of reproducibility.


First, this is overtly not true. Reproducibility refers to all forms: that the paper figures can be built from code and don't have errors, that a reimplementation of new code on the same data produces the same results, and that gathering new data (e.g. by conducting the same experiment again if possible, in other words replication) produces comparable results.

Second, publishing code helps make invisible decisions visible in a far better manner than the paper text does. Try as we might to imagine that every single researcher degree of freedom is baked into the text, it isn't and it never has been.

Third, errors do occur. They occur when running author code (stochasticity in models being inadequately pinned down, software version pinning, operating system -- I had a replication error stemming from round-towards-even behaviour implementation varying across platforms). If you have access to the code, then it's far easier to determine the source of the error. If the authors made a mistake cleaning data, having their code makes it easier to reproduce their results using their exact decisions net of the mistake you fix.

Most papers don't get replicated or reproduced. Making code available makes it more likely that, at a minimum, grad students will mechanically reproduce the result and try to play around with design parameters. That's a win.

Source: Extensive personal work in open and transparent science, including in replication; have written software to make researcher choices more transparent and help enable easier replication; published a meta-analysis of several dozen papers that used both reproducing author results from author code, producing author results with code reimplementation, and producing variant results -- each step was needed to ensure we were doing things right; a close friend of mine branched off into doing replications professionally for World Bank and associated entities and so we swap war stories; always make replication code available for my own work.


In my experience, the published paper is super vague on the approach, and implementing it without further references is really hard. I'm not necessarily arguing that papers should get longer and more detailed to counter this; expressing the details that matter in code seems like a more natural way to communicate anyway.

Why trust results if you can't see the methodology in detail and apply the approach to your own data? I once knew somebody who built a fuzz tester for a compilers project, got ahold of a previous project's compiler code that won best paper awards, and discovered a bunch of implementation bugs that completely invalidated the results.

Why is the peer review process limited to a handful of people who probably don't have access to the code and data? If your work is on Github, anybody can come along and peer review it, probably in much more detail. And as a researcher, you don't get just one chance to respond to their feedback -- you can actually start a dialogue, which other people are free to join in.

As long as a project's README makes any sort of quality / maintenance expectations clear upfront, why not publish your code?


> In my experience, the published paper is super vague on the approach, and implementing it without further references is really hard.

This is my experience, too, and in my opinion this is exactly what has to change for really reproducible research, not ready to run software supplied by the author.

There are many good arguments in support of publishing code, but reproducibility is not one of them, that's all I'm saying.


And just like OP said, it generally takes a couple of months to go from paper to working code. I've implemented a few papers as code as a side-gig for a while, and I wouldn't mind having a baseline from the authors to check and see if I'm following the paper correctly!


I disagree for another reason. Have access to the code allows easy comparison. I did some research in grad school on a computational method and there was a known best implementation that was in the research. I reached out to the author and he kindly supplied me with the source code of his work. I wasn't trying to replicate his results, but rather I wanted to compare his results to my implementations results in a variety of scenarios to see if I had improved over the other method.

And to the original author's credit, when I sent him a draft of my paper and code, he loved how such a simple approach outperformed his. I always felt that was the spirit of collaboration in science. If he hadn't supplied his code, I really would never have known how they performed unless I also fully implemented the other solution -- which really wasn't the point of the research at all.


Often the text in a paper that describes some algorithm will not be completely clear or complete. Providing the code fills in those blanks. I've taken a paper with code and gone through the code line by line comparing it with what was described in the paper. The code often ends up clarifying ambiguities. In some cases there's an outright disagreement between the paper and the code - that's where the feedback comes in, ask the author about these disagreements, it will help them clarify their approach (both the text and the code).

> Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper.

Sure, to some extent. But the code does provide a baseline, a sanity check. People who are trying to reproduce results should (as I describe above) go through both the paper and the code with a fine tooth comb. The provided code should be considered a place to start. I'll often re-implement the algorithm in a different language using the paper and the provided code as a guide.


What are your thoughts on including pseudocode within the paper directly? It seems to clear up some of the ambiguities while adding brevity since it doesn't provide the somewhat frivolous details of implementation. I think it also limits some of the potential stylistic critiques.


It's not a bad idea to include pseudocode, but my pet peeve is that there's really no standard for pseudocode. I tend to write pseudocode that looks like python. I did that in an internal paper for a group I worked in a few years back and the manager of the group made me change it to look more like C (complete with for loops as: for(i=0; i<x; i++) which made no sense to me).


Oh, haha, then disregard my intuition that it would help avoid stylistic critiques :-)


> Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper. This is trivial and good for nothing.

I agree with this statement, however I think you may have a misunderstanding on reproducing results. It's not that you can reproduce their graphs from their dataset, but rather seeing if their code reproduces on to your (new) dataset.

Another way to think of it is that the research paper's Methodology section is describing how to set up a laboratory environment to replicate results. By extension the laboratory for coding research IS the code. Thus, by releasing the code along with your paper, you are effectively stating "how is a direct copy of my laboratory for you to conduct your replicate on".


I guess things are a spectrum. I've worked on research projects where understanding and developing the algorithm is the research. There isn't really an "input data set" other than a handful of parameters that are really nothing more than scale factors, and the output is how well the algorithm performs. So "setting up the laboratory" by cloning the code and building it is...fine, but a reimplementation of the algorithm with "the same" results (modulo floating point, language features, etc. etc.) aligns much better with reproducibility.


"Reproducing results" in a scientific context doesn't mean taking the original author's data and going through the same process. It usually means taking different data and going through the same process. Having code on hand to put new data through the same process makes that a lot easier.


"It's the right thing to do for open and reproducible research."

I think this is the most important reason to do it. Research code is not meant to be perfect as another op said, but it can be instrumental in helping others, including non-academics, understand your research.

I think the sooner it's released the better (assuming you've published and you're not needing to protect any IP.) There's some great advice here: https://the-turing-way.netlify.app/reproducible-research/rep...


> It's great for your CV. Companies love applicants with open-source code.

While I strongly support sharing the code, I am not sure if this is a great reason to do so. Companies are made up of many individuals, and while some might appreciate what it takes to open source code, other individuals might judge the code without full context and think it is sloppy. My suggestion is that you fully explain the context before sharing code with companies.


> The paper itself is enough to reproduce all the results.

No, this is almost never the case. It should be. But it cannot really be. There are always more details in the code than in the paper.

Note that even the code itself might not be enough to reproduce the results. Many other things can matter, like the environment, software or library versions, the hardware, etc. Ideally you should also publish log files with all such information so people could try to use at least the same software and library versions.

And random seeds. Make sure this part is at least deterministic by specifying the seed explicitly (and make sure you have that in your log as well).

Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).

> In my field many scientists tend to not publish the code nor the data.

This is bad. But this should not be a reason that you follow this practice.

> clean and organize the code for publishing

This does not make sense. You should publish exactly the code as you used it. Not a restructured or cleaned up version. It should not be changed in any way. Otherwise you would also need to redo all your experiments to verify it is still doing the same.

Ok, if you did that as well, then ok. But this extra effort is really not needed. Sure it is nicer for others, but your hacky and crappy code is still infinitely better than no code at all.

> it will increase the surface for nitpicking and criticism

If there is no code at all, this is a much bigger criticism.

> publishing the code will be removing the competitive advantage

This is a strange take. Science is not about competing against other scientists. Science is about working together with other scientists to advance the state of the art. You should do everything to accelerate the process of advancement, not try to slow it down. If such behavior is common in your field of work, I would seriously consider to change the field.


I agree with almost all of this, however I believe that publishing random seeds is dangerous in its own way.

Ideally, if your code has a random component (MCMC, bootstrapping, etc), your results should hold up across many random seeds and runs. I don’t care about reproducing the exact same figure you had, I want to reproduce your conclusions.

In a sense, when a laboratory experiment gets reproduced, you start off with a different “random state” (equipment, environment, experimenter - all these introduce random variance). We still expect the conclusions to reproduce. We should expect the same from “computational studies”.


The thing is, if you want to ignore someone's random seed, you can if it's provided. If it's not provided and you need it to chase down why something isn't working, you're SOL.

It's zero cost to include it.


I think being able to re-run code with a paper is great, but I think we should be sure to distinguish it from scientific replication.

When replicating physics or chemistry, you build fresh the relevant apparatus, demonstrating that the paper has sufficiently communicated the ideas and that the result is robust to the noise introduced not just by that "random state" you discuss but also to the variations from a trip through human communication.

I acknowledge that this is substantially an aside, but it's something I like to surface from time to time and this seemed a reasonable opportunity.


> And random seeds. Make sure this part is at least deterministic by specifying the seed explicitly (and make sure you have that in your log as well).

> Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).

Publishing the weights of a trained model allows verification (and reuse) of results even before going to the effort of reproducing it. This is especially useful when training the model is prohibitively expensive.


To some extent Science is a big project to understand how the universe works. We should hope to understand the phenomena that we investigate to the point where library versions and random seeds don't matter so much -- assuming the code is not buggy, and the statistics are well done, those factors shouldn't come into play.

However, sometimes chemists find out that the solvents they use to clean their beakers are leaving trace amounts of residue, which accidentally contribute to later reactions.

> Ideally you should also publish log files with all such information so people could try to use at least the same software and library versions.

looks to me like a result that requires borrowing a particular lab's set of beakers. Not what we're looking for.


What a great question. You've come to the right community.

My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

You'll open yourself up for comments. They may be positive or negative. You'll only know how it pans out afterwards.

Is the code something that you'll want to improve on for further research? If so publish it on github. It opens the way for others to contribute and improve the code. Be sure to include a short readme that you welcome PRs for code cleanup, etc. That way you can turn comments criticizing your code into a request for collaboration. It'll really separates helpful people from drive by commenters.


> My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

Worth mentioning specifically: If you make a git (et al) repository public, make sure there are no passwords or secret keys in the history of the repository either. Cleaning a repository history can be tricky, so if this is an issue, best to just publish a snapshot of the latest code (or make 100% sure you've invalidated all the credentials).


The brute force way around this is to remove the .git folder and re init the git repo.

For my 2 cents I'd prefer to see sloppy code vs no code.

If you did something wrong, you did it wrong. Hopefully someone would put in a PR to fix it


If there is sensitive data to remove and the history is important to keep, then GitHub has some recommendations for scrubbing the history

https://docs.github.com/en/authentication/keeping-your-accou...


> My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

Also personal data of any human subjects.


Is your goal to help advance the science and our general knowledge? Publish the code. You don’t even need to clean it up. Just publish. Don’t worry about coding style nitpicks. Having the code and data available actually protects you from claims of fabrication or unseen errors in hidden parts of your research.

On the other hand, if your goal is only to advance your own career and you want to inhibit others from operating in this space any more than necessary to publish (diminish your “competitive advantage”) then I guess you wouldn’t want to publish.


Not sure if posting the paper only is even the best move. I personally never work with papers with no code published. Just not worth the effort to reproduce them, when I can use the second SOTA for nearly no performance penalty and much less effort.

All the groundbreaking papers in deep learning in the last decade had code published. So if you're aiming for thousands of citations, you need code.


> All the groundbreaking papers in deep learning in the last decade had code published. So if you're aiming for thousands of citations, you need code.

I am in this field and I would say less than 10% of the top papers have code published by the author, and those are most of the time another 0.1% improvement in imagenet. All the libraries that you generally use are likely to be recreated by others in this field. Lot of most interesting work's code never come out like alphazero/muzero, GPT-3 etc.


Can confirm.

I personally look at any paper without code with great suspicion. The reviewers certainly did not try to reproduce your results, and I have no guarantee that a paper without code has enough information for me to reproduce.

I always go for the papers with code provided.


As a reviewer I have reproduced results with my own independent plasma simulation code. And I have had a reviewer write in a report about my paper "result X seemed strange, but I ran it with my code and it does it too. I don't know why, but the results is valid". In my opinion that is even better than just rerunning the same code.


Agreed, reproducibility helps a lot, and it is very easy to get details wrong when reimplementing. Having the source code is a bit plus.


This is very domain specific. OP said it is not the norm to do publish code in his field. I have a PhD and in my field it is the same. So much so that I can't think of any paper in my field that has code published. Therefore, a paper with no code would not be at a disadvantage.

Personally, it is a pet peeve I have about my field. But there is no incentive for a new researcher to publish code as it decreases barriers to entry. As much as it's nice to say that researching in academia is about progressing science, as a researcher, you are your own startup trying to make it (i.e., get tenure).


> it will increase the surface for nitpicking and criticism

Anyone who programs publicly (via streaming, blogging, open source) opens themselves up for criticism, and 90% of the time the criticism is extremely helpful (and the more brutally honest, the better).

I recall an Economist magazine author made their code public, and the top comments on here were about how awful the formatting was. The criticism wasn't unwarranted, and although harsh, would have helped the author improve. What wasn't stated in the comments is that by publishing their code, the author already placed themselves ahead of 95% of people in their position who wouldn't have had the courage to do so. In the long run, the author will get a lot better and much more confident (since they are at least more aware of any weaknesses).

I'd weigh up the benefits of constructive (and possibly a little unconstructive) criticism and the resulting steepening of your trajectory against whatever downsides you expect from giving away some of your competitive advantage.


Do you really mean 90% of the criticism is extremely helpful? Or did you mean 90% was useless.

I've published 100,000s of lines of code from my research over 20 years, and I think I've had exactly one useful comment from someone who wasn't a close collaborator I would have been sharing code with anyway.

I still believe research code should be shared, but don't do it because you will get useful feedback.


I had the same experiences, but only publishing for 5 years so far. I still try to puplish everything openly, but I do not expect any responses anymore. In none of my papers, the reviewers appeared to have even seen the Jupyter Notebooks I attached as HTML. The papers are cited, some more, some less, but there is no reaction towards the source code. I still don't regret publishing it.


Interesting. Are the unhelpful comments coming from academics or random peanut gallery folks?


Peanut gallery, in my experience. The number of people who I've never met before who decide to complain about hardcoded file paths or run a linter and tell me my paper must therefore be garbage is frustratingly high.

This seems to depend on a paper getting a modest amount of media traction. That seems to set off the group of people who want to complain about code online.


peanut gallery are likely to give stupid feedback. Academics are likely to ask for help using my code -- which is nice but doesn't (usually) contribute anything useful to me, and takes up time I could be spending on other things.


> 90% of the time the criticism is extremely helpful

Citation needed. I have rarely seen valuable feedback from random visitors from the internet.


This. Feedback (less loaded term than "criticism") is something you should want. You can obviously ignore tabs vs spaces types of comments but if your code takes 2 months to get right then it probably still has bugs in it after 2 months and it would be a win if others started finding them for you. Also, if the style is really that bad then it could be obscuring bugs that would otherwise be easy to spot (missing braces, etc), and you might find bugs while fixing it up.

ps always use an auto formatter/linter. I can't believe we ever used to live without them. So much time used to be wasted re-wrapping lines manually and we'd still get it wrong.


Yes you should! And not only for ethical reasons (actually reproducible research, publicly financed work, etc), even if those are good enough by themselves.

I've always published my research code. Thanks to that, one of the tools I wrote during my PhD has been re-used by other researchers and we ended up writing a paper together! In my field is was quite a nice achievement to have a published paper without my advisor as a co-author even before my PhD defense (and it most likely counted a lot for me to get a tenured position shortly after).

The tool in question was finja, an automatic verifier/prover in OCaml for counter-measures against fault-injection attacks on asymmetric cryptosystems: https://pablo.rauzy.name/sensi/finja.html

My two most recently published papers also come with published code released as Python package:

- SeseLab, which is a software platform for teaching physical attacks (the paper and the accompanying lab sheets are in French, sorry): https://pypi.org/project/seselab/

- THC (trustable homomorphic computation), which is a generic implementation of the modular extension scheme, a simple arithmetical idea allowing to verify the integrity of a delegated computation, including over homomorphically encrypted data: https://pypi.org/project/thc/


I agree. I published code that I used for my dissertation (more than 30 years ago). I think it led to thousands of citations.


Just do a super-minimal cleaning and upload to Zenodo or similar, then stick the DOI to the code and input/output files in your paper somewhere. 99% certain your reviewers will not brother to look at your code. 10 years from now someone new looking into the same topic gets a leg up. Don't feel obligated to update, clarify, or even think about the code ever again. If you want to build a community or something, then by all means go for github, but providing code along with your paper should be something automatic and quick, not adding an unwanted burden.


Two times I have published my research code - both times I have found many other papers/projects plagiarized my work without giving me any credit. This happens way more than you would think, especially if you are working under less known advisor, and at less known university.

As the other comment said, if you care about "advancing the science", and won't mind stuff like the above happening, then go for it. In my experience, it is not worth it.


> Two times I have published my research code - both times I have found many other papers/projects plagiarized my work without giving me any credit. This happens way more than you would think, especially if you are working under less known advisor, and at less known university.

This has been very much my experience.


I wonder how often it is the case that code isn't considered an academic product per se, and so free to use. May have to make it very explicit.


> I wonder how often it is the case that code isn't considered an academic product per se, and so free to use.

As an outsider with occasional glimpses into academia, I've observed a bimodal response:

On the one hand, code seems to be seen as more available for reuse in terms of copy-and-paste, particularly in the exploratory phase.

On the other hand, code reuse is somewhat less likely to trigger an acknowledgement and citation.

This unacknowledged copy-and-paste borrowing of code seems in turn to be one influential factor inhibiting the derived code from being shared in turn.

> May have to make it very explicit.

That certainly can't hurt. I've certainly seen that reusable tools aren't cited as often as they are used, which in turn inhibits efforts from being made by academics in the production and refinement of such tools (the citation being the primary currency and reward mechanism for publication in academia).

Explicitly asking for citations (and getting them) helps with that, though the more general issue of "releasing open source software isn't seen as 'publication' for academic status purposes such as being considered for tenure" remains a problem, though it varies considerably by field, institution, and department.

But if given some thought and extra effort, a researcher might actually get more papers (possibly with different sets of collaborators) out of roughly the same body of work, especially if they look more broadly for appropriate conferences and journals.

BTW, again as an outsider with occasional glimpses into academia, it seems to me that there are many vacant niches for cross-disciplinary journals and conferences focused on reusable assets such as datasets and tools necessary for research.


This in my experience as well. How did you find it out though?


While I publish in a field where making source code available is much more common, let me just make a couple of points:

* I have never had someone come back to criticize my code style. And if they do, so what? I'll block them and not think about it again. I don't need to get my feathers ruffled over this.

* Similarly, if someone's trying to replicate my results, and they fail, it's on them to contact me for help. After that it's on me to choose how much effort to put into helping them. But if they don't contact me, or if they don't put in a good faith effort to replicate the results, that's their problem. If they try to publish a failure to replicate without having done that, it's no more valid science than publishing bad science in the first place.

Overall, I think most people who stress about publishing code do so because they haven't done it before. I've personally only ever had good consequences from having done so (people picking up the code who would never have done anything with it if it weren't already open source).


> The paper itself is enough to reproduce all the results.

No, it isn't.

Reproducing the results means that you provide the code that you used so that people can reproduce it just by running "make" (or something similar). If you do not publish the code and the input data, your research is not reproducible and it should not be accepted in a modern, decent world.

It doesn't matter that your code is ugly. Nobody is going to look at it anyway. They are only going to call it. If the code is able to produce the results of the paper with the same input data, that's enough. If the code is not able to at least do that, this means that even you are not able to reproduce your own results. In that case, you shouldn't publish the paper yet.


Worth noting that this is a modernization of scientific tradition, which predates code. I did publish my code, but the bulk of my work was a series of manual steps. That was 30 years ago. The closest thing to a replication involved changes to my design. The science world at large is still coming up to speed on this.


Agreed! Even if your code is just a slight modification of some other well known tutorial, in order to reimplement the code means that I, as a scientist, have to reverse engineer your codebase based on your text - which may not be as straightforward as expected. Remember that some researchers are not native English speakers, so there's the added complexity of translating your words into a readable format.


I always published all of my code/papers/source for my publications. I never made anything "revolutionary" but I still felt it was important to produce reproducible research, even if relatively insignificant.

This was kind a change for my advisor who was definitely less interested in that aspect of research. I think this is an issue in academia and needs to change.

Also, ultimately if someone wants to copy and publish your work as their own it will be relatively easy to show that and the community as a whole will recognize it.

Also, for me it felt good when another student/researcher was aided by my work.

https://shankarkulumani.com/publications.html

You don't need to clean it up or make the code presentable. Everyone knows it's research grade code. Most important part is that you have the code in a state that you can reuse in the future for another publication.

I've been saved multiple times by being able to easily go back to decade old work and reproduce plots.


There's a lot of advice here, but very little data to support any of it. Since you're a scientist, why not take an experimental approach to answering this question? Publish your code, for one (selected) paper. Monitor (a) the download log, and (b) the emails you get related to your code.

I hypothesize that you will see some combination of three effects: (1) you will get lots of downloads (which means people are using your code, good work!), perhaps with lots of follow-up emails and perhaps not depending on what the code does; (2) you will get lots of emails from random nutjobs looking to pick holes in your work, and you will waste your time answering them; (3) you will get almost completely ignored.

Whatever the outcome, I think a lot of people would be interested in to hearing about what you learn.


Having a polished public implementation can lead to a massive increase in the number of citations a paper recieves, if it is really a useful system. Some of my papers I think would have received far fewer citations if I had not released the code. Of course, if it is a really niche area with only a handful of researchers, this may not be true.


Does your publication venue have an artifact review committee? That would be a good way to share your code and (redacted or anonymized) data. I'm in security/privacy research, and our venues recently started doing this. They serve as a quality check, labeling your artifacts from merely "submitted" to "results reproduced."

https://www.usenix.org/conference/usenixsecurity22/call-for-...

https://petsymposium.org/artifacts.php


Emphatically YES. Put the code on GitHub. It doesn't have to be perfect. Especially if it will take two months for someone to "get it right" from the paper. I've been involved in projects where we were trying to reproduce results from some paper both with code and without. The description of an algorithm in a paper can sometimes be unclear, often reading the code makes the description in the paper much clearer. In the cases where no node code is provided it's that much harder to reproduce the results. You want to make it as easy as possible for others to reproduce your results - give them your code - put it in a github repo. If they spot discrepancies between the code and what's described in the paper, then all the better - you can use that feedback to improve both.

I'll add: I think that we need to change the mindset in academia about code. If code was involved in producing the results in the paper that code should be considered part of the paper and (at least) as important as the text of the paper. (Same for data)


I agree with you but have found the data is much harder because organizations are so sensitive to that risk. Even though it can be anonymized, many don’t even want to allow it.

Absurd anecdote: I had a coworker who tried to present healthcare data from their research. They attempted to anonymize it humorously with names and SSN like “Mickey Mouse 123-45-678”. They were told them they couldn’t share the data on the chance somebody might actually have those name/SSN combinations


In my field (bio) it's fairly common not to publish code, but it's becoming more common. Biologist's code is generally crappy and I think everyone understands that. The better developers are often valued for producing tools that are reliable and people can use and get lots of attention and citations for their tool papers.

The mathematicians and computer scientists I've worked with generally wrote more complicated code, but from a bugginess and maintainability standpoint I'm not sure it was any better. I had a mentor with an applied math degree who was extremely fond of one and two character variable names.

Just publish it. Unless your paper is a _BIG_DEAL_ barely anybody is going to look at it, and some people (hopefully the right people) will respect you for showing your work. I think I'm one of the few reviewers that actually try to run and maybe glance at the code for papers I review. In the papers I've reviewed I've never seen a comment that indicated any of the other reviewers even looked at it.


This is a question near to my heart. I'm not an academic but a practicing systems software engineer. A good chunk of my work is sourcing interesting academic ideas and trying to turn them into practically useful software. Papers that don't release their source code are often not as reproducible as the authors think. Perhaps there's a bug that the results depend on, perhaps the implementation is very specific to the context the software runs in, perhaps the paper gives _most_ of what you'd need to re-implement but the fine details are missing. I've seen them all.

In a very real sense unless the paper has a result that's so compelling I can't ignore it if there's no published source code -- even if it's an obvious prototype! -- I'll pass it by. I'm not alone in that in my line of work. Industry folks might also be more willing to accept prototype code than academic folks, I dunno.

Worth consider, I guess, if you're interested in your work crossing the academic/industry boundary smoothly.


The Carpentries[0] provide some great resources and training for academics who are interested in this. Check out Software Carpentry[1] and Data Carpentry[2] in particular.

Publishing research code is admirable, and in an ideal world everybody would publish their code and data. That said, we shouldn't pretend that there aren't tradeoffs. Time spent polishing your code to make it presentable is time not spent on other aspects of your research. Time spent developing software development skills is time that could be spent learning new research techniques (or whatever). Reproducible research is great, but it's certainly possible to take it too far at the expense of your productivity/career.

You should also take your own personality into account. If you're a perfectionist you might struggle to let yourself publish research-quality code rather than production-quality code and consequently over-allocate the time you spend prettying up your code.

[0] https://carpentries.org/

[1] https://software-carpentry.org/

[2] https://datacarpentry.org/


As someone making a career in academia, I recognize both pros and cons here, but I think that the pros far outweigh the cons. Essentially, I think the question is one of identity - do you want your reputation to be "This investigator is the kind of person who's code is always available"? I know that as I evaluate job applications, funding proposals, or papers, I weigh this reputation highly, and consider the opposite "This investigator is the kind of person that hesitates to share their code" to be a big red flag.

BUT, I have definitely encountered the situation where I read a paper, then looked at the associated code, and found that the exciting result was entirely because of a bug. The reputation, "This investigator is someone who does shoddy, error-prone work" is probably the worst possible one.


I am not a researcher in the sense that I'm not publishing papers but I'm a consumer of research. Every day I can find the source code for the paper is a great day. Even if it's some language I don't use I still have something to go off of. Often it's easier to red some code to understand the method than to read the paper itself. I'm used to read code. I do it almost everyday and I'm relatively proficient at it. I'm not very well at untangling academic language or having to read 30 years worth of papers to get all the assumptions made in a paper.

As an example, I've found a paper that promises a method to do the very thing I want to accomplish. It's not too dense but it skips a few crucial moments and I've been working on coding the method for a year now (on and off, of course but still for a long time). If the code was available it probably wouldn't take as long. The paper didn't mention that the code was available upon request but it was implemented in a piece of software. I've found it eventually but it was a version just before the feature I'm after was added. I tracked the author and they were great sport about cold emails bet didn't have the source any more.

So yes, please publish the code. You don't have to clean it up. It worked for the paper — it's good enough. Even the most terrible code is immeasurably better than no code.


Publishing code could be nice, if for example your code has a commercial application and a company wants to use it, a reference implementation might be nice.

Reproducibility -- I dunno. A re-implementation seems better for reproducibility. The paper is the specific set of claim, not the code. If there are built-in assumptions in your code (or even subtle bugs that somehow make it 'work' better), then someone who "reproduces" by just running your code will also have these assumptions.

Coding time -- are you sure? Professional coders are pretty good. If you have, for example, taken the true academic path and written your code in FORTRAN, there's every chance that a professional could bang out a proof of concept in Python or C++ in like a week (really depends on the type of code -- EIGEN and NUMPY save you from a whole layer of tedium that BLAS and LAPACK 'helpfully' provide). Really good pseudocode might be more useful than your actual code

Another note -- personally I treat my code as essentially the IP of my advisor. (He eventually open sources most things anyway). But do check on the IP situation if you want to open source it yourself. If you are working as a research assistant, some or all of your code may belong to you University. They probably don't care, but it is better to have the conversation before angering them.


> Really good pseudocode might be more useful than your actual code

Hear hear! OP, if you go this route, treat your implementation as a practice run, and write out exactly how it works in pseudocode.

My 2 cents:

I think that hiring a (good) professional for a rework/reimplementation would be productive, but it would certainly run the risk of exposing errors in your work. If that's desirable or not depends on your timeline to publish, I guess.


To be frank, nobody cares about your code. I’d be shocked and flattered if anybody read any of the code I wrote during my PhD. Publish the code in its current state and move on. If people take the time to actually read and nit pick your code you’ll have succeeded.


> it will increase the surface for nitpicking and criticism

You're supposed to welcome criticism and 'nitpicking' as a scientist.


That’s a bit of a dismissive straw man. The quote was explicitly referring to nitpicking of things unrelated to the research. You intentionally snipped the very next three words “e.g. code style”. Contrary to your implication, it is not a scientist’s job to welcome any and all nitpicking and criticism, which is why there are professional moderated platforms for relevant science critique, as opposed to criticism.


Not from untrained randos on the internet. Signal noise ratio and prior of “not a nutjob” have to be high enough to offset the cost of lost focus.


This back-and-forth has happened before on HN. [0]

As I rambled at the time, [1] it seems to me (non-scientist) that publication norms are well behind the times. Researchers shouldn't be in a position to decide how much they graciously deign to disclose for independent review. If the scientific publication process permits researchers to withhold details they fear won't withstand independent review, that means it's failing to do its job.

[0] https://news.ycombinator.com/item?id=24261706

[1] https://news.ycombinator.com/item?id=24264376


There are about 100 comments saying the same thing already, but I would highly suggest publishing the code:

1. It gives your work more visibility. If there is a easy git clone route to reproducing your work, it offers a low effort starting point for people to build upon your work which means they are more likely to use it. Plus you get free citations from anyone who touches it.

2. There is no reason that people should be hoarding code in academia, and the only reason people do it now is a sort of prisoners dilemma problem (first person to publish their code had to start from scratch, so they feel possessive and let it die when they graduate). Every researcher who releases their code chips away at the problem and pushes the community to be more open with their code which is intrinsically more efficient.

3. If you get lucky and the community adopts your code it will be viewed very positively by potential future career advancement committee being 'they guy who wrote _x'

4. When I started in academia I based my codebase on an existing publicly available code, which saved me a huge amount of time in my work. I built upon it (not expanding the base code, but using it as a module to integrate experimental measurements to the simulations tools I wrote from scratch) in my PhD and when I graduated I handed a virtualbox image with the whole mess (yay free code--wouldn't have been possible with nonfree code) off to my successors, people in new groups, etc which ended up being the base of an entire new research group at a different university. Every once in a while I get an email asking for help, and get a notification saying that someone cited the code.


Disclaimer: Not an academic, and my whilst undergrad thesis included code it was so broken that when others saw it I had nothing to lose except my pride.

Personally, I would. Open source is a form of peer review, and if you're wanting to stand by your paper as peer-reviewable then I believe the code should be included in that. Generally speaking, I feel more researchers need to open up their code to peer review because generally speaking, research code tends to not have the same robustness against mistakes (through coding convention as well as tests) as professional software development. I shudder to think how many papers have flawed results that no one realises and are just accepted, because no one can spare the effort rebuilding the code from scratch and without any prior reference in order to verify said results.

I don't think you need to clean it up. You're not competing for a coding elegance competition, but rather allowing someone to find bugs if they exist and point it out, just as they would peer reviewing your paper.

More cynically, spaghetti code probably helps as a defense against people ripping off your code, so if you're worried about your competitive advantage then not cleaning it up is a form of security through obscurity :)


Can you ask scientists who are very experienced in your field and successful in the career track that you want to be?

Separate from that, is there fairly new chatter in your field about reproducible science, publishing code and data, etc.? If so, what's the current thinking there about how valuable this is to collective science, and how that should affect the sometimes unfortunate conflicts of interest between career and science?


You're right, it is substantially more work to clean and organize the code for publishing. Being open about your work does make the attack surface much larger and more likely to be nitpicked, criticized, have an error found, etc.

But it is more honest. Whatever you think about the effort required to do this, there's value in honesty.

Here is an example of my own scientific work:

- paper [0]

- preprint [1]

- GitHub [2]

It certainly wasn't easy to get all of this done. But doing this can also be a guide for others. They get to see exactly what you've done so that they don't waste months on the exact implementation. They can see where maybe you've made some mistakes to avoid them. They can see so much of the implicit knowledge that is left out of your paper and learn from it. Your code isn't going to be perfect, but what paper is, either?

Everyone will be a critic, anyway, so make it easy to pick up criticism of the stuff you feel the least confident in and do better next time. You won't get better if no one sees your code.

[0]: https://cancerres.aacrjournals.org/content/81/23/5833

[1]: https://www.biorxiv.org/content/10.1101/2021.01.05.425333v2

[2]: https://github.com/LupienLab/3d-reorganization-prostate-canc...


Been there, done that. I published my doctoral research code [1] so that others could inspect, verify, replicate, extend, etc. YMMV, but the feedback I received from other researchers ranged from neutral to surprisingly positive (e.g. people using it in ways that pleasantly surprised me). But let me expand on my own experiences while developing that software, trying to figure out how to replicate the then-current state of the art.

At the time there were two widely used software packages for phylogenetic inference, PAUP* [2] and MrBayes [3]. The source code for MrBayes was available, and although at the time I had some pretty strong criticisms of the code structure, it was immensely valuable to my research, and I remain very grateful to its author for sharing the code. In contrast the PAUP* source was not available, and I struggled immensely to replicate some of its algorithms. As a case in point, I needed to compute the natural log of the gamma function with similar precision, but there was no documentation for how PAUP* did this. I eventually discovered that the PAUP* author had shared some of the low-level code with another project. Based on comments in that code I pulled the original references from the 60s literature and solved these problems that had plagued me for months in a matter of days. Now, from what I could see in that shared PAUP* code, I suspect that the PAUP* code is of very high quality. But the author significantly reduced his scientific impact by keeping the source to himself.

[1]: https://github.com/canonware/crux

[2]: https://paup.phylosolutions.com/

[3]: http://nbisweden.github.io/MrBayes/


Publish your code only after you have made the journal publications / conference papers. I have witnessed a researcher getting robbed of his work when another researcher took his almost complete code from github and submitted faster to a journal for publication.

Now both of the researchers have to be cited, but only one of them did the discovery work.


Releasing the code is the very least you should do to make your analysis reproducible. I would be surprised if it was possible to exactly reproduce the results from the paper alone.

From Heil et al. (https://www.nature.com/articles/s41592-021-01256-7):

> Documenting implementation details without making data, models and code publicly available and usable by other scientists does little to help future scientists attempting the same analyses and less to uncover biases. Authors can only report on biases they already know about, and without the data, models and code, other scientists will be unable to discover issues post hoc.

Even better would be to containerize all software dependencies and orchestrate the analysis with a workflow manager. The authors of the above paper refer to that as "gold standard reproducibility"


It's sort of funny when the pro list includes "better for science" and there's still a need for a con list. There should be a scientific equivalent to the hypocratic oath; a lot of us laypeople imagine that scientist default to "good for science" and "ease and possibility of replicatability."


Yup.

CS scientific journals should make the bar much higher in that regard: no code? no publish. unless really good excuse.


I mean this is just unrealistic. There are plenty of valid reasons not to want to publish code. For example, you want to commercialize the product. As long as you are given a description of the system you can always go and reimplement it yourself if you are not being lazy.


So it really depends upon what you want out of your research career. Part of being a successful researcher is making an impact on the community. This involves producing works that the community finds useful. I've always looked at making my code available as another avenue to help increase the impact of my work. In my case, many more people have used my public codes than have ever read my papers.

You have limited time. I'd prioritize that time on what you think others will find useful.

Don't worry about ugly code. There are research codes with 1k+ stars on GitHub that are ugly. They have so many stars because people find them useful.

You absolutely don't have to publish your code, or anything else of that matter. Don't let the the drive for impact on the community force you into working on something you're not interested in.

Congrats on your publication.


This whole mindset is so shockingly wrong from an academic perspective.

Research based on or involving code/models/algorithms should always be accompanied by a code drop. Nobody expects the code to be of good quality.

Everything else is not reproducible - and against the scientific codex (IMO).

I read so many papers that claim incredible results, and wondering how they implemented their models in this particular simulator (close to impossible with only what is out there), only to find that there is just nothing to be found, anywhere. No repo, no models, no patch. NIL.

Sending an E-Mail? No response.

Further, anyone could just claim anything this way. Why bother doing any real work?

What if there is a small error in the code?

Wouldn't it be better to know that? In a scientific sense, searching for "the truth"?


The economic incentive of science is for your work to be replicated and cited. Not publishing the code and data means your work is harder to reuse for subsequent studies and will hurt citations.

If it's uncommon to release code then I'd doubt anyone in the peer review will review it.


>But on the other hand it's substantially more work to clean and organize the code for publishing

It's better than nothing, it also is the only way for others to reproduce your results. I am surprised you were not asked to do that by whatever journal you chose to publish your results.

>many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage

LOL, what!? What is this crap about "competitive advantage"? Are you privately funded? Then it's fine. If you're funded by public (i.e. government) money, you are (at least ethically) obliged to share your work with everybody.


One argument against publishing code is that maybe there's an error (or more) in your code, which validates your possibly mistaken theory, and forcing an independent reimplementation by others would uncover this problem.


That's a good point I'd not considered. I suppose it's ultimately a risk scientists/researchers take to not lead humanity down a broken path however I'm of the firm belief the truth will "out" and transparency almost always trumps opaqueness


The fact that there may be an error is an argument in favor of publishing the code.


Publishing the code can increase the number of people tinkering with it, and possibly debugging it, it's true. But people just going with it, start using it without looking into the details, and blindly trusting the author (i.e. being lazy) sounds pretty realistic, too.


I assume computer science results without published artifacts to be fake. When it's so easy to publish and run code, if the researcher can't even do that then I assume the code does not work and thus the results must be fabricated. If your work has trade secrets or something, research publication with peer review is the wrong way to distribute your results.

Computer engineering on novel systems is a bit harder, but a /complete/ spec of the system (enough for someone to precisely rebuild it) should be published in that case. Remote access on request to the prototype would be better.


> it's substantially more work to clean and organize the code for publishing

Regardless of whether or not you release the code, you should do this.

It’s so common for people to think that cleaning/refactoring/documenting code is a waste of time, but it’s exactly the opposite.

The point at which the code is working, but not yet polished is exactly the prime “teachable moment” for improving your skills as a programmer and for refining your knowledge of the domain the program solves for. (This is true no matter how skilled or knowledgeable you already are).

Your brain is perfectly primed to do this now, so don’t let that go to waste.


I'm currently a student getting a master's in computer science. In my experience, having the code available in research papers is rare, but useful. Many times I find myself struggling to understand how something can be implemented or, when presented with choices, choosing one when reading research papers. When the paper has the code published I am able to follow it better.

Some papers link to the code instead of including it. Maybe I'm just unlucky, but this usually leads to dead links (but that's a different topic altogether).


I work as a researcher and I try to publish full source code for all my publications. On the point of increasing surface for nitpicking, I agree in principle that's a risk, but in practice I have not experienced any such problems in my field. I am in a field of applied natural science where most researchers write terrible code, if any, and so I suppose there are not much expectations or even concepts of coding style.

There is a nice Perspective piece in Science from 2011 [1] touching on the question of cleaning up the code. It suggests basically the same thing as several of the comments in this thread: if you don't have time or motivation to clean up the code, don't.

"even incremental steps would be a vast improvement over the current situation. To this end, I propose the following steps (in order of increasing impact and cost) that individuals and the scientific community can take. First, anyone doing any computing in their research should publish their code. It does not have to be clean or beautiful (13), it just needs to be available. Even without the corresponding data, code can be very informative and can be used to check for problems as well as quickly translate ideas. ... The next step would be to publish a cleaned-up version of the code along with the data sets in a durable non-proprietary format."

[1] Peng (2011) Science 334 1126-1127 https://doi.org/10.1126/science.1213847


Well I'm no scientist, but the one thing I do know about science is that if it can't be replicated then it ain't really science. Seeing as how the code is pretty important for expediently replicating your work, it seems like doing so in the interest of facilitating that replication takes precedent over any sort of "competitive advantage". If that sort of transparency ain't common in your field, then maybe this is your opportunity to lead by example and change that?


Publishing the code is great, as some of the questions a reader of your paper may have can only answered by looking into the source code, as no paper has enough space to talk about all implementation details in a real-life complex system.

There is value in scrutinizing the code - not w.r.t. coding styles or standards but to discover bugs in the implementation, which are very common. Scientists are only human, and scientific software is less often checked by a second pair of eyes. There is also value in trying to replicate a study from scratch with a fresh implementation only from the details in the paper. Many conferences, for instance the European Conference on Information Retrieval (ECIR), Europe's largest scientific search technology conference, has a replication track only for replication papers, and these are often the most interesting/insightful papers. It occasionally happens that a result is not caused by what the authors think, but is merely an artifact of the implementation code. A very famous MIT researcher (not naming him or her here on purpose) fell into this trap in their Ph.D. thesis, but it can happen to anyone, really. Scientific results become objective knowledge as others solidify the body of knowledge by carrying out replications and arriving at the same results.

Whatever your decision about past code, going forward, if you plan to release all future research code, you will likely write better code in the first place, as you will constantly be aware that people will be looking at it, and that can only be a good thing.


First, you should be proud of yourself for striving to do "the right thing".

In the field I follow the most (Computer Graphics/Rendering) I think there is a big problem with reproducibility as well, and to be honest, I think some of the major players actually have little interest in making this significantly better, since they can take advantage of the visibility of a flashy render/fps counter shown at an event while still keep on building a "moat" between them and others that want to adopt the same methods.

Which is in the end partly an answer to your question: your paper could clearly describe all the elements needed to implement a method correctly, but by providing a sample implementation you allow others to "stand" on your shoulders, as they say, instead of having to climb there first and then proceed. You can not worry too much about the state of your codebase by making clear via README/documentation/license that it's still in "proof of concept" phase.

One reasonable observation I have heard is that in some fields, during peer review, some reviewers seem to like to nitpick on the code rather than the paper, sometimes in subtle ways. Because of that, I think it can be (unfortunately) OK to release the code after acceptance or publication. But apart from this, I see only advantages.


Many conferences are starting to adopt a badge system and will evaluate your artifact. And this is becoming more and more popular, and I know many researchers that will keep these badges in mind when reading the evaluation in the paper. For example here is the artifact evaluation that was done at SOSP 2021 https://sysartifacts.github.io/sosp2021/results.html.


These badges are kinda controversial. The message they send out is "we give you an extra goodie if you do proper science, because we don't expect that to be the default".

Thus badges can become a kinda excuse for not fixing stuff by default.


You've gotten a ton of feedback already, but: Please do! Don't try to make it perfect. Just publish it. As the saying goes: "The perfect is the enemy of the good." Release the code you used to get YOUR results. You can always improve it later if it turns out people end up really interested in it.

(FWIW, I'm a professor at an R1 university. I give this advice to all of my Ph.D. students and strongly, strongly encourage them to put their code out there on our github.)


There are a lot of difficult questions posed here on HN but this is not one of them: unequivocally, you should publish the code.

It is better for science, it will be better for you and it will be better for people who want to play with your code.

Publishing is a form of advertising what you did, and helping others reproduce it makes it go viral and is a testament to how much they care. It can only help your career.

You’ll definitely get people who nitpick the code. This won’t hurt and it may even help in its own way.


Disclaimer: I'm not an academic. I cannot possibly speak to the possible benefits and implications of this from an academic point of view. Like there might only be downside to doing this. I don't know and don't pretend to know.

As an outsider looking in, many academic fields seem to have a reproducibility crisis. Many psychological studies, for example, cannot be reproduced yet they continue to be cited.

I personally feel like every academic paper should be reproducible. I should be able to email you the study and you should get the same results. Obviously clinical trials may vary (and thus the important of statistical significance) but the real problem is data and models. If I, as someone reading your study, don't have your data, how can it possibly be reproduced? If I gather my own data will I get completely different results? If I'm solely relying on what details you give, how do I know you haven't made a fatal assumption or even just buggy code with your model?

I personally feel like a condition of all Federal funding should be that the data and any code should be made freely available.

So I support the idea of releasing it and that releasing something messy is better than releasing nothing but I can't speak to your individual circumstances.


Unambiguously, yes. If possible, release it using some sort of open source license, and grab a DOI for the initial and any subsequent release of the code - you can use Zeonodo or some other tool for this.

I left the academic world a few years ago, but several of the analysis codes/models I published (either as stand-alone tools or artifacts published alongside a journal article) still regularly get used... if anything, there's probably a larger user base for one of my models today than there ever has been, and it's leading to a long-tail of publications where my initial work is either cited or I'm offered co-authorship when I have time to offer hands-on support for improving the model/code and offering my insight as a domain expert.

If you can take the time to clean up some code or author a lightweight package, that's amazing! But it's a bang-for-your-buck type thing. If you ever aspire to leave academia, it's undoubtedly worth spending some time to clean up the code, add documentation, add some unit tests, etc - great artifacts to use in supporting a hiring process if you move into a technical role somewhere in industry. But is far from necessary.


I highly recommend adding it. It doesn't have to be exposed, but is super useful for anyone who will want to reproduce or build upon your work later.

You can embed this to the PDF, e.g. see section A.1 [1] for how.

[1]: https://raw.githubusercontent.com/motiejus/wm/main/mj-msc-fu...


Publish it on GitHub (or GitLab or your code hosting service of choice).

Then answer any criticism about it by asking for a PR.

To preempt code style complaints find a code formatter for your language and run everything through that first.

Refer to the repository in your paper, but don't put a link. Create a little bit of friction to get to the repo to discourage the casual readers who don't really need the code from popping over too easily.


Research is not a zero-sum game, it’s about bringing useful knowledge to everyone. If releasing your code aids in understanding what you’re contributing, then by all means please contribute! You don’t even need to spend a ton of time cleaning it up. Releasing it “as is” is fine. Hoarding your code as a defense against criticism goes against the entire purpose of open academic work.


> The paper itself is enough to reproduce all the results.

I've heard this claim so many times, from many an author who had their brain so deep in the problem they were working on that they were 100% incapable of properly gauging the validity of this claim.

To verify that what you claim is true, wait two years to give your brain time to flush the context, pick your research paper up (and nothing else that wasn't made available to others) and try to reproduce the results on a brand new computer without any of the environment your developed your research with.

See how much blood you end up sweating.

PLEASE publish your research code. Don't worry about it being disgusting and hackish, it's research code, so by definition, no one expects it to be industrial strength.

Don't spend time cleaning it up either, your time is better spent on doing more research.

If you feel responsibility towards the community:

     - put a huge disclaimer at the start of the README explaining what a mess the whole thing is *because* it's research code.

     - if your really must: list requirements and provide a build.sh


The trend is in the direction of requiring open code and data. There's been a big movement that direction in economics, and most fields will likely also move that way, so it's more a question of whether you should do it now or in the future.

For the journal I edit, authors are required to include the code and data with the submission. The code and data are available along with the paper if it's published. We do replication audits of some papers to make sure you can take the materials they've included and reproduce every result in the paper. If not, the conditional acceptance changes to rejection. I've had cases where reviewers found errors in the code, so I rejected the paper.

On the argument that it's a competitive advantage: what does that mean? You should be able to claim results but not show where they came from? That's not science.

Keep in mind that this is a "source available" requirement, not an open source requirement. It is a matter of transparency. You have to let others see exactly what you did.


The code should be published, and knowing this, researchers will hopefully try avoid certain commonly harmful practices. One of these is re-using the same script to run slightly different models by editing some of the hard-coded parameters. I've myself found more than one mistake in someone else's reported results due to this sort of thing. But identifying it was quite a bit of trouble because the record of what was ran was erased when they moved on to the next model.

What I would not expect from people is code that would necessarily run in your environment. For example, in many cases, the paths are going to be hard-coded, for a variety of reasons. It might be ideal to write code that will just work, in a reproducible environment, but that often takes more work than people are willing to commit to, given all the other things they have to do.

Finally, cleaning up your code for presentation is a final opportunity for you to discover any mistakes before you publish and then later have an embarrassing public retraction.


I'd only clean it of stuff like passwords and such, and add a header that the code is provided as-is.

You could add a disclaimer that the code was worked on until it provided a satisfactory result, and no further, and is not intended for (any) use. You might even add that, except for outright, actual errors that affect the result of the research, comments are discouraged.

I often publish very bad code, terrible terrible spaghetti, it's not how I write code at my job, because at my job, I'm paid to produce not only working and correct code, but also code that is maintainable and understandable and follows certain practices.

However, my hobby is not writing corporate code, but writing code that get done what I want to get done, nothing more, and sometimes less. It might even have actual bugs in it that I can plainly see and don't care about because they don't affect my uses

If people can't tell the difference, I don't care, not my problem. If a future employer can't tell the difference, I won't work with them.


At the end of the day the impact and perceived quality of your research correlates to how peer reviewed it is, and how reproducible it is. Everything necessary to reproduce your research should be published, including the code. However, if you publish cleaned up versions of your code, that isn't the code you used to do your research.

I suggest publishing the code as is on something such as Github, Gitlab, etc. I suspect you have ideas on how you can improve the code, perhaps there's even a way of improving your research methodology by doing so, enabling new insights with further research. If you did a follow up experiment with improved analysis enabled by your improved code, then that's another paper, and another (more cleaned up) version of the code to push to the repository.

The above is all supposition though, as I don't know your field. If deep learning then the above seems more likely. If your field is geology, then improvements in the software might not enable better insights.


Depends on the climate of the field you're in, and where you're at in your career. There are fields where entire research groups routinely harvest preliminary ideas from graduate student publications, and then finish them and rush to publication before the student realizes what's happened.

I'd say, grad student owes nobody anything until they finish, because they're bearing the greatest risk of losing priority, and the openness of science is being used against them. Nothing lost by waiting until they have their degree in the bag before sharing. Then clean it up and use it as part of your portfolio. Or append it to your thesis. Advancing science after you've secured your career is a fair compromise.

I love open source and open science, but also look back on my own graduate studies, and I chose a topic that was protected by virtue of a large capital investment plus domain knowledge that was not represented by code. Also, my thesis predates widespread use of the Internet. ;-)


> There are fields where entire research groups routinely harvest preliminary ideas from graduate student publications, and then finish them and rush to publication before the student realizes what's happened.

Can you provide a source, or example of this? What does the Amazon of academia look like?


Biology, and synthetic chemistry. Unfortunately all anecdotal. I live near a major research university, have lots of friends who are involved at all levels, and relatives who are even closer to it. It tends to be in areas that require minimal capital investment to pivot into a new study. Also, the student pursuing the original idea is hampered by their own emerging skills. "My student's thesis just got scooped" is something that every professor has experienced or knows about.

My field, physics, much harder. Building my experiment required a bunch of expensive equipment (maybe half a million in today's dollars), gear that I built myself, the technique of operating it, and so forth.

My career, much harder. I work in business. You learn about my ideas when a patent comes out. ;-)


I can attest to this. I myself am victim of this. My undergraduate thesis was plagiarized by two other papers. Code was 80% the same, they just added some trivial things. No citing of my work at all.

Look at my other comment for more explanation - if you are working under less known advisor, or at less known university, there is a high chance that this will happen if your work is good.


Write to the journal they published in and call them out.


This kind of thing is best done after the thesis is in the bag. A student is racing against the clock. Grad study has many kinds of hard failures. At the most extreme, your advisor could up and die. The focus has to be on finishing. That's how you get out.


Matter of fact, I did, even with help of my advisor. The journal did not take any action (it is Q1 open access journal), since they come up with all kinds of mental gymnastics why it is not copied (which boiled down it is NOT 100% the same).

That was for the first occurrence. For the 2nd one, we just did not bother because it hurts my advisor's reputation as well. It is not in the interest of journal to admit the mistake once they made it -- they will fight you about it and try to keep their reputation/image up.


Would gpl have helped?


1. It might seem like the paper is enough to reproduce to you, but in my field, it’s not more often than it is. Hell, even a software version change produces different result with the same params and seed. So don’t exclude a possibility you’re biased.

2. Code IS a competitive advantage. Some times you’ll reach out to the author to ask for clarification. And after some back and forth they’ll just suggest you send them the data and put them on the paper because they don’t really want to disclose the details or the method they’ve previously published.

3. I don’t think you’ll have issues if you share less than perfect code. Most reviewers are as bad at production code as you are.

All in all, I think sharing code advances science. Yes, there’s gatekeeping, tricks to keep the knowledge inside the lab. But didn’t you choose the field because you want to advance the knowledge, help humankind? Making your research more reproducible by sharing the source code is a step in that direction.


Consider that "cleaning and organizing" your code means that it is no longer the code that actually produced the results in your paper!

The fact that your code is a mess means that it might be buggy; if other people can see your code, someone might find a bug in it. As you said, this is a good thing for open science, and makes your work easier to reproduce.


As always I may be wrong, but the (admittedly very few) times I find an article/paper based or revolving around code that is interesting/useful for some purposes I read the "code is available on request" (or similar) as the (in-) famous Fermat's Last theorem note: Hanc marginis exiguitas non caperet.

Nowadays margins are large enough and cost nothing or next to nothing, and you don't probably have any other use of your code, so what would be the advantage for you in not publishing it?

What kind of competitive advantage does it give to you? (what many scientists think might be not as relevant as what you think about this "competitive advantage" secifically in your specific case/field)

About "cleaning it", why?

I mean, if as-is it works (but it is "ugly") it still works, what if in the process of "cleaning it" you manage to introduce a bug of some kind?

Unless you plan to also re-test it after the cleaning, I guess it would be better to not clean it at all.


> What kind of competitive advantage does it give to you?

For every paper introducing the revolutionary Algorithm X, there are a bunch of follow-up papers like "Algorithm X applied to self-driving cars", "Algorithm X applied to smartphones", "Algorithm X with some tweaks that provide marginal improvements", "Algorithm X but using consumer-grade hardware" and so on.

If every other lab has to spend several months to replicate your first paper, you and your colleagues can spam out the follow-up papers before anyone else can catch up. This makes your publication count go up.

Other means for achieving similar effects include delaying the publication of your code, or releasing undocumented spaghetti-code with missing dependencies and entirely comprised of one-character variable names.

Of course, this stuff comes at a cost: Making it harder for people to use your work makes them less likely to use your work. So it might be better for your citation count to release the code - and in any case, who goes into research hoping their ideas will be ignored?.


NDA requires you to share data several months after reporting it. But in many cases, data collection has not even completed by then. Theoretically someone could scoop you by analyzing your data before all data collection is completed (e.g. N=100, instead of N=120). I'd think that would be career suicide if it were found out, but the risk of it happening doesn't exactly provide much of a incentive to make it any easier on them.


Myself and co-authors argued here https://www.nature.com/articles/nature10836?proof=t%2Btarget... for open computer code in science.

"Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail."


Publishing replication data and code increases the impact and citation rate of published work. For a literature review, see https://osf.io/kgnva/wiki/Open%20Science%20Literature/#Open_...


In my experience, typically researchers only publish the relevant and/or core algorithms of their research. If you would like, you can always publish the code to Github (if it isn't already), and reference it in your paper.

If it is too much work to refactor the code for publishing, you can also just publish pseudocode.

I don't think anyone will nitpick or criticize coding style or things like that unless it is particularly egregious (ie naming variables something vulgar etc). The point of research papers is to communicate new and valuable findings. If people in this conference or journal are nitpicking things like that, you may want to find a different place to submit your work.

I don't know what your field is, but in Computer Science I can't say I have ever known people to consider their code a competitive advantage. The only time they might shy from releasing code is when they think they can commercialize it or something.


I would agree with many others here who say publish it. In some fields there is an additional question of where to host it, lest your paper's impact outlast the lifetime of your current GitHub repo or whatever. There are good solutions out there. Assuming you are at a university it's worth having a chat with a librarian.


Yes, the university library is where I would try to find a publishing solution that could remain after I am dead and all my online accounts expire.

GitHub is not a viable solution; Microsoft can not be trusted to keep these important cultural artifacts safe and accessible in the near future.


I think you probably already have the answers you were seeking (yes, ideally you should) but I'd like to add some points:

Ideally, it would be nice if the code has a professional-level quality to it, but I think everyone involved in evaluating research understands that it is at best a prototype. Proper software engineering is expensive, and it is not the role of research to do this. The process, as it was explained to me: university research pushes the state of the art, industrial research labs are slightly behind this and looking to transfer into practical uses (along with this some government agencies are interested in tech transfer) and finally software engineering takes these ideas and turns them into actual products. You aren't making a product, so it is OK for the code not to be perfect (also, from experience, 'professional' industry code is not always that great either). The main point is that someone has some chance of reproducing your results.

The exception to this is if you are making a product, where the definition of product is a tool for further research. Examples might be tools for symbolic execution or formal verification, in which case it might be worth some time to make the experience of using it good for that benefit, to reduce friction so that people try and want to use your tool.

Artefact evaluation is rapidly becoming something people are encouraged to do and helps enormously in verifying results, but the point is usually to try to reproduce the results of the paper to back up the science, not to start an argument over coding style. I would hope that artefact evaluation processes make this clear and ensures that evaluations of artefacts focus on reproducibility. For outside comments that might arise, I suggest you publish the work as open source and respond to any criticisms with a fairly standard line: yes this is research quality code and we would like to have time to improve it. If you would like to submit a patch/pull request we would welcome any help.


Though you don't mention this particular issue, it often comes up, and as someone who used to work as a DoD research scientist, I will say this: I think academics are largely under the impression that they should be worried about people "taking their idea" and building something amazing with it without compensating you in some way. In reality, it is vanishingly rare that a published paper gets used for anything, by anyone, and it is even rarer by an additional order of magnitude that someone successfully tries to use something without consulting the author and/or trying to bring them along. You are the expert on the thing you have made, so if someone sees massive potential in it, they will likely bring you along. Publishing some quick and dirty research code that is able to reproduce the results of the paper can only help you in the long run.

If you want real protection of course you can always try to get a patent, but then I've got you because 90% of the people I have this conversation with are worried about people stealing their idea but don't think it is patent-worthy.

A similar analogue exists in startups: ideas are really a dime a dozen. Execution is what matters. There are millions of great startup ideas floating around -- I bet almost anyone could come up with at least a few that are viable -- but actually having the follow-through and dedication to execute that idea, that is what is challenging. I can't tell you how many people I've had calls with where the exchange is basically "I want your thoughts on this amazing idea but you have to sign an NDA first". 90% of the time these people aren't willing to go all-in on their idea and stake their career on it (hence them seeking second opinions), so it makes no sense for them to worry about me "stealing" their half-baked, unrealized idea. I say to them "would you take $3M in interest-free debt to develop this idea right now" and they say "no!" to which I say "then why should I sign an NDA?"


I've posted a huge amount of academic code (I've linked to a small number at the end). I think you should, but it won't help advance your career immediately. However, I still think it's better for science.

What is useful is if you can produce code people can build on and do their own cool stuff with -- then they will cite you. However, getting something to a state where it is tested for all reasonable inputs, has some basic docs, etc. is a hard untaking.

https://github.com/minion/minion (C++ constraint solver)

https://github.com/stacs-cp/demystify (Python puzzle solver)

https://github.com/peal/vole (Rust group theory solver)


Thanks, agreed. Small note: it is not clear what Minion is doing, from just visiting the github repo. Perhaps add "C++ constraint solver" in the github description, but it is still unclear: it could be a rigid body constraint solver for games? Maybe add a link to a paper?


Yes, I should practice making things more accessible :)

In practice Minion is generally used as a backend to Conjure ( https://conjure.readthedocs.io/en/latest/ ), which provides a much nicer input language.


Thanks, I was not familiar with Conjure and general Constraint Programming. I haven't seen it in real-time appications for games or robotics (usually highly optimized domain specific constraint solvers are used there, for rigid body, fluid sim, cloth, deformables etc)


Depends on the field. Let's assume you are not in math /CS/Physics

What can go "wrong"

- Someone may find a minor rounding error and now you have to issue a correction to the paper which, laudable as it is, is a bad thing

- You 'll end up having to maintain an open-source-something and possibly forks

- Your open source code may end up as a github repo in which you are just one of the contributors, not the owner and others are leeching credit from u

- People who want to criticise you will find excuses in the coding style.

Research code is messy -- it must be messy imho, or else it's probably insignificant. People who don't publish it are definitely shielded by the obscurity , while i have received scrutiny for entirely inconsequential details. You can choose to publish it in a less accessible way , which will thwart people with bad intentions. Even publishing it as a tarball in a web server is enough work to keep them away.


Isn't it worse for everyone if the paper but not the code is published and somebody doesn't find the error?

The world suffers for acting off of untrue information, and eventually the author suffers for having been wrong all this time.


I agree, and of course it is. But the underlying issue here is that paper production is the wrong way to do science, there simply isn't enough attention to validate all those papers


That's a good point. However, it suggests to me that one should publish the code for an additional reason:

1) if we acknowledge that paper production is the wrong way to do science, perhaps papers with attached published code are a step in the right direction. A critical mass of papers that are easy to reproduce because anyone can execute the code attached with them could, hypothetically, come to dominate the zeitgeist and push out papers with grandiose but unverified claims.


Here is an incentive they could use: If a paper publishes the code it used, they can be allowed to skip the methods section about it. Describing code with words is at best awkward and usually error prone as people forget to update the text.

To be clear in my field the journals require publishing the code, but in my experience the code that gets reused gets more scrutiny, often without benefit to the researcher.


I published some of my Academic code like a tool for simulating superconducting circuits [1] or a tool to manage lab instruments for quantum computing (or other) experiments [2]. It's super niche but both tools have found users in other labs that even keep developing them (at least for [2]). And it's nice to look at your code after 10 years and realize how much you've grown as a programmer :)

[1]: https://github.com/adewes/superconductor [2]: https://github.com/adewes/pyview https://github.com/adewes/python-qubit-setup


There is a reason most free-software licenses have the following clauses:

"THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."

So just pick one that is compatible with the 3rd party code you used to write your software (mostly pertaining to copyleft licenses like the GPL) - MIT and BSD licenses are generally "fine" - and just publish it. Just because your code is not "clean" or whatever doesn't preclude it from being free.


As a fellow scientist I would say go for it. I know people who had a vast amount of citations (>2000) for a paper accompanying a code/program release that they made at an opportune moment (they released a code for designing/analysing photonic crystals just when the field was taking off).

Now in the vast majority of cases you will only get a couple of people looking at your code (my experience so far), but still I think it's worth it. The question is, clean up the code or not. Ideally you would, because it increases the chance of someone using it by a lot. On the other hand with the realities of academic work, this is largely underappreciated.

So I recommend to find a balance, clean up enough so it is reasonably straight forward to run the code. Write a good readme that points to the paper and gives the appropriate citation.


Besides all the pros already mentioned, there is a high chance other researchers will use your code and cite you. If they have to compare their results with some previous state of the art, it will be the one with available code. The whole thing about “the paper is enough to reproduce” never happens, ever.


Source Code or it Didn't Happen.

Science that is not reproducible is not science.

If you can, publish something high-level. Matlab or Python or Julia is fine. C or Java, not so much, because the build environment will not be available any longer after a few years. Actually, if you can, publish several translations.

And don't forget to publish your data sets as well. And your data augmentation or whatever. Everything you need to reproduce your results.

And for the love of Knuth, DO NOT OPTIMIZE YOUR CODE. Dumb code is good code in science. You would not believe what kinds of havoc some algorithms wreaked on my systems in the name of optimization. Optimizations that made a ten-year-old algorithm run in two nanoseconds instead of four (vastly exaggerated). Optimizations that obfuscated otherwise perfectly reasonable algorithms.

The goal is reproducibility.


Be prepared for a metric crapton of crushing silence when you release your code. But do release your code.


FWIW, this is how I've released the crappy barely-working "academic quality" code for a paper in the past:

https://github.com/DarwinAwardWinner/cd4-histone-paper-code

The main points are that I made only a minimal attempt to organize it, and I made the state of the code clear in the README. I don't recall anyone complaining about the code or even mentioning it during review. (Though to be fair, I also don't recall whether I published the code before or after the paper was accepted.)

Looking at things from the other side, I'm am at least an order of magnitude more likely to read, use the work/methods from, and therefore cite a paper that comes with code.


Yes, absolutely.

You should have confidence in the correctness of your code if you are publishing.

If your code is a shitshow, why do you trust it? Decent code is to your own advantage even if no one else ever looks at it.

In the best case, it’s possible to build a community around your code, to wide benefit and your career benefit. I’ve seen this with several peers and students.

As a hiring manager, it’s very nice indeed to read a paper and scan the code of an fresh grad applicant.

My lab’s approach is to put the repo in public and put the hash of the relevant commit in the paper. Then you can keep developing there but readers can be confident they can get the exact code used to justify the claims in the paper.

An exception is if you plan to make a company around your IP. You should estimate how likely this is to happen before defaulting to this.


I some sense the way you phrase your question shows how broken incentives in science are.

The obvious answer for science is: publish. The goal of science should be to make it easy for others to reproduce your work. Not to make it theoretically possible, but hard, because of the "competitive advantage".

The right thing to do would be to publish and next time you review another paper that does not publish code use that as a reason to reject it. The whole "code and data upon request" is obvious bullshit, there have been studies on it and often enough it ends up with "well, we don't have that code/data any more", "why do you need that? we won't help you plan to publish something we don't like" etc. pp.


It would be nice if everybody would publish code for their papers. But in a field where most people don't do it, releasing your code will probably not be beneficial for you due to the loss of the competitive advantage. I know for people with cs background this sounds weird but it is reality n academia.

In your position, I would only release code which is not too hard to reproduce anyway or which only provides negligible competitive advantage for you. I mainly have "normal" paper in mind (experiments or data analysis) - if the main contribution is, for example, an algorithm which you want people to use, the you should publish an implementation obviously.


> The paper itself is enough to reproduce all the results.

Every researcher thinks this, and it's always wrong. If you care about scientific progress, publish the code and data.

Besides, available code should cause more people to look at your work and ultimately cite it.


>> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

That "competitive advantage" is just holding everyone back, slowing progress. This is particularly annoying to hear coming from "research" which I thought was supposed to be advancing the state of the art for the benefit of society. That's ostensibly the reason for publishing papers right, to disseminate knowledge? Or is it really just to increase ones ego and get paid?

Not saying you should publish code, just that deliberately keeping secrets in your field seems to go against what I thought you were doing.


Agreed, perhaps there is some competition in citations for follow up work, releasing source code makes it easier to get 'scooped' on your future plans section? (not saying that optimizing for citations is a good thing)


What is the purpose of doing research?

If the purpose is to push human knowledge forward, then it seems backwards not to publish everything.

Personally, I've found it difficult in my various careers to date when I've been put in positions where the actions that serve my immediate interests are in any way in conflict with my underlying principles or overarching goals. It's demotivating and deflating.

If I were in your position, I would publish everything and let myself feel pride in what I did. Even if we're all just insignificant specks in the grand scheme of things, pursuing a greater purpose can help make it feel like something matters.


Like my CV and AI professor said: there were 100s of papers claiming to achieve faster or more accurate results than Viola Jones, but they never published data sets or code, so no lone believed them and all were forgotten.


If you are for Open science(https://en.wikipedia.org/wiki/Open_science_data), go ahead and publish it. Would you ever publish the code on some GIT platform? If you would, this would be the equivalent. A lot of researches don't want to give their data to the public, but if locking their data they are just making harder for others to confirm or improve their findings. I guess sometimes there are legal issues behind that, and sometimes it is pure ego.


Here's a good example. The Fisher's iris flower data was released with his work in a 1936 paper. It was used as an example of his discriminate analysis. This data set has been repeatedly used over and over to show examples of cluster analysis and segmentation. Many statistics teachers use it in their curriculum. You never know where the research could lead to growth and development in a field.

https://en.wikipedia.org/wiki/Iris_flower_data_set


> Here's a good example. The Fisher's iris flower data was released with his work in a 1936 paper. It was used as an example of his discriminate analysis. This data set has been repeatedly used over and over to show examples of cluster analysis and segmentation. Many statistics teachers use it in their curriculum. You never know where the research could lead to growth and development in a field.

You raise a good tangential point:

Releasing a data set can be just as useful as releasing code, and every bit as necessary to reproducing results.

Moreover, reproducing a well-curated dataset can be just as prohibitive in terms of time and expense.

How many papers have reused datasets such as ImageNet, Celeb-A, etc. in recent years?

By all means, release your datasets even if you don't release your code.


> it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

I should take a couple of hours. The code works? You know how to reproduce what you did, right? It shouldn't be perfect. Shouldn't even pass code review. Should just work.

> many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

Well depends on the field I guess, but you also want recognition and impact. What is the point of publishing a result no one uses?


If your field is not embracing open source yet, you should go for it ASAP. I believe in the end the field will recognize the benefits and move towards that and the sooner you are the larger impact you will make.


> The paper itself is enough to reproduce all the results.

Unlikely. Following the algorithm from scratch may produce "similar" results, but not "reproduce", bugs and all. The only thing that can do that is your code.

Plus, typically, when you set out to reproduce a paper from only the algorithmic description, it's typically not until you're 2 or 3 weeks into coding that you realise the original paper made many assumptions in the code that were not explicitly stated in the paper.

> However, the implementation can easily take two months of work to get it right.

An even more important reason why you should release your code.

> In my field many scientists tend to not publish the code nor the data.

A regrettable state of affairs indeed.

> They would mostly write a note that code and data are available upon request.

I have personally come across many cases where this promise could no longer be honoured by the time of the request. Publish the code.

> I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.

It is also increasingly a requirement for funding bodies

But on the other hand it's substantially more work to clean and organize the code for publishing

> Then don't. Release it under the CRAPL, stating as much. It is still better than nothing.

> it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

If you were an entrepreneur hoping to peddle snake oil and not get found out, then I would see your point. But you're a scientist, you're supposed to welcome such criticism and opportunities for improvement. If anything, you might even get collaborations / more publications on the basis of improving on that code.

> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

I would sincerely not feel very comfortable calling such people "scientists".


I enjoyed working w/ R and R studio to have the code and published pdf always under version control and found that the reproducible research community added readers and feedback and interest from outside my narrowly-defined subject area. I didn't enjoy that frequent package updates would break everything. What kind of coding eco-system does your work exist in?


OP, you shouldn't worry about the state of your code. The could be criticism, but I don't think there's anything that's public and not criticized. A horrible thing that's open source is much better than something that's not. The only real thing to consider here is the type of the license, and weighing the competitive advantage you're talking about. With the license, sites like this[0] can help.

[0] https://choosealicense.com/


(A) Publish your code as is, so the code is the actual code used in the paper.

> But on the other hand it's substantially more work to clean and organize the code for publishing

(B) Don't spend time cleaning code for publishing. Spend your time writing more papers.

> it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

(C) Don't worry about this.

> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

(D) If you do B, if will also reduce your worries about this. I am half joking.


Personally, as someone who's had to dig through academic code in the field of bioinformatics, I do appreciate code being attached to the paper, regardless of the paper's level of detail or the code's quality (or lack thereof). I don't think many researchers expect high quality code unless you're releasing a library explicitly for general use and expect contributors. That said, a brief README with at least instructions to execute is a rare but welcome addition in my personal opinion.


Please publish. Even if you are filing some patents or working on some other commercial licensing, you could publish under a source available licenses. Just yesterday I saw in the 3D printing Reddit some academics developed an interesting approach to segment large panel sections and posted a paper. A number of people are interested to try it but no source code seems to be published - so I just moved on as I'd have to take the trouble to reimplement the paper even to just try it .


Please do! Maybe the paper is technically enough to reproduce the results but if other researchers can start from a working example, they can both verify your results and extend them with more original research far faster.

While any published code receives some nitpicking and bikeshedding, most academic code is terrible so unless you literally use random joke/meme variable names as your only 'documentation' (I wish I were joking) you're not going to look bad to anyone who matters.


Yes you should. Just publishing as it is would be enough. Everybody understands that academic code is pretty experimental and nobody would judge it if it is pretty or not. The reason why you should publish it is to gain trust. Back when I was doing my PhD I found several instances of papers that had results that were nearly impossible to reproduce to the point that I sometimes believed they were just fakes. I am pretty sure in most of the cases that is not the case but....


It seems to me that if it would take two months to replicate the actual mechanism, you're doing the world a favor by publishing what that two months of work resulted in.

If you want to do the world or further favor, get a grad student to read it first and indicate where they cannot follow the code. In my brief stint in academia, I saw very little overlap between brilliant theoreticians coming up with novel approaches and code to support them and people who knew how to write readable code.


This is the wrong forum to ask this question because the audience here is mostly in favor of open disclosure of information and open source licensing of code, which always comes at a cost to someone. For example, publishing your code may have significant impact on whether or not you can obtain ownership protection of your inventions/discoveries. If you are interested in protecting these interests, then you should consult with an intellectual property lawyer.


Let me say that if you do decide to release it it's not just scientists and academics who can stand to benefit. Chances are your paper is less approachable to those outside academia and your code would be easier to understand for an engineer. I would honestly encourage all researchers to publish their code on that basis. You don't have to clean it up or write any scripts to help build it. Just attach what you have and I second the idea to use the CRAPL license!


What benefit would you receive if you publish your code? Will that give you some privilege or earn you more money and/or more reputation?

If the answer to the above is no, and it will mostly cost you time and effort. Then don't publish.

If the answer to the above is yes, then consider the return on investment for publishing your code. If you earn more reputation/money/whatever if you publish than what you expenditure on doing the work of publishing, then publish, if not, then don't.


> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

This is probably wrong, depending on the field. At least in machine learning, the papers that get cited the most are those that other people can easily pick up and work on. They become the basis for future work, get cited as baselines more often, etc. Publishing research ML code is a competitive advantage.


As someone who has had to reproduce others' research results: it is much much better to release the unclean, unorganized code that actually produced your results than it is to release nothing. Even if it doesn't run (e.g. it depends on a hardware system that the user won't have access to) it's still better for people to be able to read your code and understand some tricky part that isn't fully explained in your paper.


After finding a mistake in a paper, having to fix it, and then publishing my code, I’ve found other people contact me for the fix rather than the author of the paper. I would recommend publishing the code rather than assuming your paper is bug free and complete.

Similarly, I’ve found papers that don’t include their complete data set in the paper, and had to try to reverse engineer it from images and so on. It is really frustrating when papers are incomplete.


> as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

I wish it wasn't viewed as a competition in the first place.


Yes, absolutely, and don't worry about how the code looks. So long as someone else can download your code and run it without issue, you're good to go. I've worked on multiple computational neuroscience papers and pushed to have the code published alongside each paper in every case. Not once has it come back to bite us, and if anything, it seems to get us significantly more citations.

Do it. There's no good reason not to.


In undergrad I was so grateful whenever a CS paper came with code. It helped me learn and comprehend so much better and I always wanted to thank the person who did it (sometimes I even did if their email was there :)).

You might be doing a young student a solid :D And don’t worry about cleaning it up!

If you use GitHub you could even disable Issues and have a note saying you don’t accept pull requests (in case you’re worried about support burden).


You could add the code as a PDF attachment, so it's available directly with the paper, but somewhat "hidden". That also answers the question about where to host it.

https://alltamedia.com/2014/04/14/how-to-make-a-link-or-butt...


Honestly, put your code out, and version control it. Benefits:

- People who use your work will cite you.

- You may get collaborators.

- It's an easy-to-get-to backup

- For non-academic jobs, it's part of your resume


If you think you've got something that will give you a competitive advantage, seize that advantage. Otherwise, it will be no more than source code on a thumb drive that you'll eventually forget the encryption password to, or gets damaged during a move, or is lost when you stop paying for your cloud storage membership, is lost when you re-partition the wrong part of your hard drive, whatever.


That you were allowed to publish the paper without the code is the core problem here.

You shouldn't even be able to ask this question. The journal should have required you to first or along with the paper publish the code.

Unfortunately, the number of Journals that do this is still small and even the ones that do sometimes are even satisfied with a "Code can be obtained upon request".


Many journals now require relevant code to be published. Those journals that don't are likely to be lower impact journals, but also are probably moving towards requiring the relevant code to be published. The reviewers are likely to complain about the code not being available, so you can defeat one review hurdle by publishing it. It's generally better for science if you publish it.


Unless you are publishing a software methods paper, you don’t have to worry about cleaning the code or making it portable. In my field, publishing code (and data) is a requirement and has been for years. That doesn’t mean that the code needs to be pretty ( it usually isn’t), it just needs to support the paper.

So, yes. Please publish the code, it will make the rest of the paper stronger.


You could also paint a picture no one else will ever see.

Personally, I hate it when academics do not publish their code. Some academics publish the code but not the pretrained model or withhold the dataset, to collect dust on their computer.

People who publish code, datasets and models become the core building blocks of future work. People who don't fade away people do not remember their names.


I'm a field where it's never been so easy to replicate an experiment, I really wish more people would make their code available. It can be liberating to put your code out there, sort of setting it free, and you'll be much more forgiving if other people's code quality in the future.

The negatives are overestimated, is unlikely many people will read the code.


This one depends on both the field you are in as well as your own academic philosophy OP.

If the paper is enough to reproduce the results AND cleaning up the code can/is tedious, then adding the "code and data are available upon request" note seems both fair and justified.

That way, whoever wants the code can still ask for it and it does not lay an unnecessary burden on the author.


“Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.”

While I appreciate this is true, it’s also quite sad. Science shouldn’t be a competitive sport to increase a couple metrics like publications and citations such that useful parts of replicating and extending studies aren’t shared. :(


> Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.

One of my most cited papers is a relatively uninteresting one we wrote for a conference competition. But we have code so it is easy to compare your alternative approach to us. That means citations.

So it can work for your benefit as well.


Publish the code or attach it as a listing, so that in 10, 15 years someone who finds your paper can find the code, too. When everything is "hot" and "live" it can be easy to reach out and get something, but when you're digging through papers and code that have been abandoned for decades, it's nice to find source.


Yes do it! I did it and you'll not receive criticism, that's just anxiety talking. Be clear at what the code is, why and that you may not maintain it as it is just a proof implementation. Most good humans understand that all researchers are not the-one-and-perfect coder. The bad ones are too busy arguing with others to even notice.


My take on this is that some code is 10X better than no code.

There have been times when I've had to abandon incorporating an idea presented in a research paper because the paper doesn't have enough information for me to implement it in code. I could've made a lot of progress with some proof of concept code, even if it wasn't clean.


Your code should have exactly the same license and distribution as your paper. Anyone who tells you different is simply wrong.

If you published a paper that uses information from the code then yes you absolutely must publish your code. Otherwise you're contributing to the decline of science via the opaqueness of papers and irreproducibility problem.


My advice is to leave industry standard formatting and style arguments to engineers.

If people want great code that runs easily and is easy to read, that's engineering work, built off the back of novel implementations.

If people want novel implementations that are likely rough around the corners and require a bit of finagling to run, leave that to the scientists.


Publishing the code does have some selfish benefits too: better chance of people building on your research (and citing it).


It is at least as likely that they'll take your code, integrate it into their own research, and never mention you at any point in the process. So you have to be OK with that.


Publish it.

Put a huge note in the readme that this is research code and only licensed for non commercial use.

Put a note on your personal homepage that you're available to hire as a research consultant for $1000 per day.

Companies who like your research will put 1+1 together. A friend of mine got hired straight out of university at a very competitive salary with this approach.


I'm a scientist too, in computational chemistry. To me, releasing the research code that accompanies a paper is an imperative. Increasingly, journals or individual (peer) reviewers demand it. It's essential for reproducibility. I consider the work that goes into making the code releasable simply part of the job.


> But on the other hand it's substantially more work to clean and organize the code for publishing

Make sure it's all safe to publish but don't spend any effort on organizing it, unless you can find some grant money for an undergrad to work on it.

If it has users they will contribute their changes to better organize it and use it.


There are scientific venues where the focus is the software architecture and the software product.

I previously published in one of them (SoftwareX by her majesty Elsevier the Evil) and I wish there could be more venues that could bring value and recognition to the piece of code we develop in research for other purposes.


Yes definitely, and turn it into an open source project. You'll get more citations that way too. The Debian science team can probably help with some of the process.

https://wiki.debian.org/DebianScience


Here are my checklist to publishing research codes:

1. encumbered by pending or active patent(s)?

2. release of proprietary holds by corporates or participants

3. any tangible market values worth pursuing, then keep it to yourself.

4. any conflict with trademarks, copyrights, or domain hold? Rename it

that’s just some of the points. Contact your local VCs if it has any traction.


If you publish your paper with code, you'll get more citations I would assume. When I look at research papers, one of the first things I look at is code and/or data availability. It would be even better if it's easy to run though and that's definitely not always the case.


> it will increase the surface for nitpicking and criticism

This is unfortunately. In one of my articles I linked to my github repo where I had implemented the algorithm in C. One of my reviewers complained that I had used C instead of C++. Probably advisable to not publish code before peer review.


Publish it, if it's interesting people will clean and improve it. It's the beauty of open source.


Yes, please, the state of affairs currently is that it's impossible to get code, data, and pretty much anything besides the actual paper.

To me at least sends a signal of people hiding stuff. That's not good. It made me distrust some papers in the past. I tried to reach out with no success.


In the past I've chosen to publish key algorithms. Publishing your entire code base can become a substantial demand for your support. As an open source project supported by one person, that can be very demanding.

So identify what's most critical or novel about your work and publish that.


I absolutely LOVE research that has code released with it. Just because then I can quickly explore the code and play around + tinker with it.

Like others have said, research code isn't meant to be production quality code so I wouldn't worry about "quality" in that way.


My 2 cents: you should publish the code as you used it in your research, so that it's possible to review your code. If there is a bug in your code, that could impact your results, and that problem would be much harder to find/reproduce without your source code.


TL;DR YES!!!

Frankly any paper which can't offer the basics of reproducibility is adding to the current problems across many fields.

Real-World data may restricted by copyright so be careful if this applies. If it does consider publishing with some MC data demonstrating how things are supposed to work. (You did verify your code behaviour didn't you?)

Don't clean and organise code for publishing. It is a tool, it is not 100% perfect, but it is supposed to work. Unfortunately after years in the field sometimes the correct response to nit-picking is "I don't care".

This is the trap between "writing code that was intended to give an answer" and "code that was intended to be re-used by others". Scientists often write code that fits into the former and this code should be published (in-case of mistakes and in the interests of reproducibility). But this code should never be taken to be of the quality that it should be built on by others unless this was the express intent. People who mistake it for that haven't understood the point of the work the author is engaging in.

With regard to what license, I tend to use DWTFYWWI, or just GPL, but frankly you can pick some wonderfully closed thing if you think your code might revolutionise something which in principle stops commercial entities ripping it off directly.


Having written a couple papers myself, I think it's entirely fair to release the code as is. Code written for research purposes is obviously not suitable for production but can still serve as a great tool for others to understand and build upon your work.


Balaji[0] goes into a long thread on why reproducibility is important. [0] https://twitter.com/balajis/status/1337620554971410434


Yes you should absolutely publish it. I wrote a paper about modelling radio wave propagation through the ionosphere all my code for it is on my github. The reason you should is simple, you are providing proof that your numbers aren't just made up.


Genome Research made me publish the code used for the data analysis, requiring a zip of the repo for archiving.

The thing is that I was required to provide a way to reproduce, so code obfuscated and/or uncommented were not a problem. I provided clean code anyway.


Given the enormous amount of papers that come out, I personally tend to read papers that come with code (and data) first.

For me, it shows the authors are confident yet also open to critique. Which is a wonderful thing.

Secondly, I usually need the code to really understand the paper.


If you're proud of what you achieved then implement the code, but if you fear it's not good enough and rather stay under the guise of ambiguity of "imagine what could be" don't publish the code. Publishing code is less evil.


I've written a few paragraphs on the topic. Maybe it's helpful: https://nymity.ch/book/#make-your-code-public


Depends if you want the public to be able to apply your research or if you want to keep the "competitive advantage" to yourself. If your research was funded by public grant money, then I think you owe it to the public.


An excellent paper on this issue here: https://aclanthology.org/J08-3010.pdf

Agree with other comments on CRAPL, but you should release it.


Release the code as-is. It's alright if it's not clean and organized, research code is usually crappy code (no offense given).

Worst case scenario, it will end up in a star-less github repo that nobody reads.


Publish it under AGPL, after your paper has been accepted. If criticism surfaces after your paper has been published, great, you can now write a paper about your V2.

Science progresses by criticism, after all.


Yes, you should publish. Mainly because it will give you a sense of accomplishing something, also, nobody cares really :). Especially about old code, if they do that's even better.


I personally do not trust research that does not have reasonably polished publicly available code behind it.

A strong result isn’t just the final number, it’s also the process how you arrived there.


Yes, you should publish it. Don't bother cleaning it up if you don't feel like it. No one will judge you for the code quality.

Published terrible code is far better than unpublished code.


Definitely publish! Check out the Journal of Open Source Software - https://joss.theoj.org/


Publish the code.

If someone has comments about style ask them to improve it for you.

Worry about maintaining things after someone asks for maintenance, the vast majority of code is never read again.


I don't know your field, but personally when I read a paper the code makes things 100x clearer and resolves my questions. Are you afraid people will use your code?


Publishing it gives an additional advantage that people are more likely to use and cite your work over a peer's who has a similar paper but no code.


You absolutely should. Papers should always have reproducible code, otherwise there is no practical usage for the community. Crap is better than nothing.


I don't think you have to do any work at all on it if you don't want to. Just release it and let people fork it, let it grow.


Publish the code. At worst no one will look at it, at best you will draw more attention to your work and maybe get some good tips.


Was the research funded with public money? If so, then the public interest would be a reason to publish the research code.


Why not? It's a loss to humanity's progress if all researchers make it difficult to find the code and data.


The code is more important than the paper


How do you even know it works if you haven't been able to create a working implementation as the author?


If you publish it, please make sure it works today, and can work tomorrow. Pin the versions of any dependencies, or bundle them if feasible.

Also, include basic instructions for running your code.

I helped my wife with a replication study that should have been straightforward, and I was unable to get the code running after about a week. I don’t necessarily believe the research was suspect, but broken code does draw more suspicion.


I disagree that this is a requirement. It is a strong nice to have, but even if the code isn't maintained or runnable on other systems it is incredibly valuable as a reference when reading through ambiguous sections in a paper to go look at what the author was trying to do.

Will it take some work to reproduce the results with broken code? Sure, but its better to have the code than not. Cleaning it up might be the straw that prevents that code from getting released at all.


Nice to have. But it is often the case that it is often difficult to avoid things like hard-coded paths, system-specific environment variables, etc. that simply won't translate to another environment. Most of the times, those things don't interfere with comprehension.


> please make sure it works today, and can work tomorrow.

This would fall in the "nice to have but not required" column.

Publishing messy code that's hard to run is a million time better than no code at all.


You'll probably get more references if you have code, which will probably help your research career.


Most code in the world is crap code, so don't worry, putting it out there lets people make it better


If Public money paid for you developing it, make the source Public und a liberal Open Source License.



Can you provide a description of the what the software does and what language(s) it uses?


Well the saying is "publish or perish" so I would definitely choose publish.


I have published research code a couple of times. I did this out of principle because I believe in knowledge sharing and collaborative science.

But to be honest, I am truly underwhelmed by the response. For several papers I created Jupyter notebooks that reproduce every single figure in the paper. It has been a huge amount of work. But in spite that the papers with code are reasonably often cited, I’ve been getting only minimal feedback.

So it‘s really difficult to judge whether the effort of properly preparing the code is worth the effort.

On the other hand I have run into several papers that turned out to be not reproducible without the code. Chances are that these particular papers would not have been reproducible with the code, too :D (there were just too many things not adding up). But it would have saved us a lot of time if the code would have been available.

Tl;dr: make the code available, but don‘t invest too much time in polishing it. Hardly anyone is going to thank you.

One exception: if you want to impress future employers, polishing code is worth it. A good portfolio on GitHub can open doors.


Maybe do it and see what happens. If something bad then don’t do it again…


Check with your admins if you are actually allowed to publish it.


Publish the code as is and move on.


Yes you should publish your code.


Yes, absolutely. Next question?


Absolutely, yes. The other comments here have some fantastic reasons for doing this, and several do a good job of weighing the pros vs cons.

The paper alone is, almost always, never enough to fully reproduce the result. I've been bitten by this almost every time I've tried to implement someone else's computational model. It comes down to that only relying on your paper to explain your code leaves a LOT of room for errors. I've experienced all of these when trying to implement someone else's computational work without their code being published:

    1. Despite your best efforts, you include fundamental, result-breaking typos in the equations you write up to explain the math of what you're doing. This WILL happen to you at some point in your career, and in my experience, it's a problem in >>50% of computational modeling papers.
    2. There are assumptions in the logic of the code that you don't include in the writeup, since they're obvious to you, but you don't realize that someone else trying to understand your paper won't necessarily be starting with those same assumptions. This happens frequently with neural models that use complicated synapse-computation schemes.
    3. Your codebase may be big enough that you think code part X works a certain kind of way from memory, but you forget that you changed the logic late in the project to work in a different way.
    4. Publishing your code at the time of publication prevents "Which version did I use?" problems. It's very common for people to continue to work on their science code for new work, but they don't bother to save/tag a SPECIFIC version of their code that was used for the actual paper. This results in that even the author doesn't know what exact values were used for the results in the paper!
Any "competitive advantage" has to be weighed versus "positive exposure". If your code is the primary research object (as opposed to the data), then it's technically possible that someone may grab your code, extend it to do the next, interesting use of it, and then scoop you before you can do it yourself. However, even if this happens (which it probably won't), consider the following:

    1. You can't build a successful career out of just small extensions to the same piece of code, and so that codebase won't be the main kernel of your career, but rather your understanding of it.
    2. For every 1 person that tries to use that to scoop you, IMHO there's going to be at least 10 other people who see your code and reach out to you for help with it, or just to ask a question about it, or reach out for potential collaboration! In other words, depending on the field, if you publish the code, I think you're likely to gain new/future collaborators at a MUCH faster rate than people who compete against you. You'll be surprised at how many researchers on the other side of the planet are interested in your software!
    3. Even if someone scoops you with your own code, if they give any indication it came from you, you still get to count that as a publication that built off of your software work when you're applying to jobs :)
    4. At least with US federal government funding, it's gradually becoming required to do this anyways, and I believe/hope that it's going to become the standard anyways very soon.
Finally, don't fret about polishing/cleaning/organizing the code, especially style. For others trying to reproduce your results or just investigating how you did things, the main thing that matters is that your code runs "correctly", i.e. how you ran it to get the results that you did. One idea is to publish it "as is" for the CORRECTNESS of the paper, put a git tag indicating "original version", and THEN clean it up on Github/wherever. This helps prevent any new "organizing" of the code from potentially breaking something, which is counterproductive. This way, when people go to your code page, the first thing they see is a nicely-organized version, and gives you time to test that it works the same. Honestly, if you care enough about this at all, then your code is probably significantly more organized than 95% of research code out there; the standards of code quality in science are VERY low, which is completely different than private sector software engineering.

* edits are for markup


Yes


publish; bad code is far better than no code

someone might clean it up for you, too


1. create a github repo

2. push it there


Yes


Yes


Yes.


As an amateur who reads journal papers (maybe not the audience you're most concerned about), the two most important things to helping my understanding of the paper's results are, in order:

1. a program that I can run against the data in the paper (where I can modify the data to see how that changes the results the program generates); and

2. the source code to that program, that I can read to understand what it does.

For #1, I'd encourage you to publish something like a Docker image of your built binary, to a permanent public Docker image host; to use that Docker image version of your program to do the actual experiment/data processing for your paper; and then to cite, in your paper, the specific fully-qualified Docker image ID (e.g. hub.docker.com/foo/bar@sha256:abcdef0123...6789) that was used to create the results.

I would also encourage you to, if possible, publish your data in some repository, e.g. GitHub; and to cite the data using a fixed hash (e.g. Git commit hash) as well.

With these two pieces of information, anyone can easily do the simplest possible kind of "reproduction" of your results: namely, they can fetch the same Docker image used in the paper, and then run it against the same data used in the paper, to — hopefully — produce the same results shown in the paper.

---

As for #2...

If you're really worried about "trade secrets", you can just solve #2 by making the code itself only "available upon request."

But don't underestimate the number of people in your field who say they're hoarding their code for reasons of "competitive advantage", but who are really doing so out of personal shame at the state the code is in, and fear that a bug might be found there that will invalidate their result.

These people are, IMHO, not embracing the spirit that led them to become scientists. You should want any bugs in your papers — including in the code — to be found! That's what the pursuit of (academic) science is about — everyone checking each-other's work so that we can all believe more strongly in the results!

You don't need to clean up your code. Maybe get an "alpha reader" to go over it first, like self-published authors do, if you're worried about nitpickers. But the only thing code really "needs" to be valuable, is to compile and run and do something useful.

Personally, all I'd want from your repo is for there to be a Dockerfile in there that will, within its fiddly little internal build environment, manage to output the exact Docker image cited in the paper.

If I cared about modifying the code, I could take the rest from there.


Yes


You absolutely should publish the code and dump it on GitHub somewhere. I did that 20 years ago on sourceforge and it backs up a lot of claims that would otherwise get dismissed as me making s** up. Plan ahead and make sure you have the receipts because if your research ever becomes relevant you want to have all the receipts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: