Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Should I publish my research code?
420 points by jarenmf 5 days ago | hide | past | favorite | 352 comments
I'm looking for advice om whether I should publish my research code? The paper itself is enough to reproduce all the results. However, the implementation can easily take two months of work to get it right.

In my field many scientists tend to not publish the code nor the data. They would mostly write a note that code and data are available upon request.

I can see the pros of publishing the code as it's obviously better for open science and it makes the manuscript more solid and easier for anyone trying to replicate the work.

But on the other hand it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc). Besides, many scientists look at code as a competitive advantage so in this case publishing the code will be removing the competitive advantage.






> it's substantially more work to clean and organize the code for publishing, it will increase the surface for nitpicking and criticism (e.g. coding style, etc).

Matt Might has a solution for this that I love: Don't clean & organize! Release it under the CRAPL[0], making explicit what everyone understands, viz.:

"Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"

[0] https://matt.might.net/articles/crapl/


> "Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"

What do you know, it turns out the professional software developers I work with are actually scientists and academics!!


They don't call it "Computer Science" for nothing ;)

notoriously Philip Wadler says that computer science has two problems: computer and science.

It's not about computers and "You don't put science on your name if you're a real science"

He prefers the name informatics.

source: https://youtube.com/watch?v=IOiZatlZtGU


You don't put science on your name if you're a real science

Having flashbacks to when a close friend was getting an MS in Political Science, and spent the first semester in a class devoted to whether or not political science is a science.


Well? Don't keep us hanging.

Kinda like countries that feel the need to put “democratic” in their name.

Also applies to newspapers which include some version of "truth" in the title. (Russian propaganda "Pravda", Polish tabloid "Fakt", etc.)

Or "People"...

Or "Republic"...

Or "Of China"....

> It's not about computers and "You don't put science on your name if you're a real science"

"Computer science is no more about computers than astronomy is about telescopes." - Edsger W. Dijkstra [0], [1].

[0] https://en.wikipedia.org/wiki/Edsger_W._Dijkstra

[1] https://en.wikiquote.org/wiki/Computer_science


"Information science" is basically long form of "informatics" so that breaks it I'd say. Also, "information" tends to imply a focus on state and places computational aspects (operations performed on information) as second hand.

I've yet to find a classification I really like but this is an interesting take. I still tend to like CIS (Computing and Information Sciences). The problem with CS is it focuses on computation and puts state as second class. The problem with IS is it focuses on state and puts computing as second class. To me, both are equally important.


TBF informatics is "the study of computational systems"

I have studied informatics, but don't call myself an informatic, because I am not doing any research.

I call myself a programmer, because I am not doing engineering either.


An economist friend of mine told me that once, so this isn't just a CS quip.

Informática is a common term in Spanish, as you probably know.

Materials Science is about as sciency as you can get.

Maybe a little OT, but, I'd rather it be called "computing science." Computers are just the tool. I believe it was Dijkstra who famously objected to it being called "computer science," because they don't call astronomy "telescope science," or something to that effect.

Peter Naur agreed with that which is why it is called "Datalogi" in Denmark and Sweden, his English term Datalogy never really caught on though. He wrote it in a letter to the editor in 1966 and it has pretty much been used since then here as Naur founded the first Datalogy university faculty a few years later.

This also lead to a strange naming now we have data science as well where it is called "Data videnskab" which is just a literal translation of the English term.

[0]: https://dl.acm.org/doi/10.1145/365719.366510 (sadly behind a wall)


informatics may be the closest english analogue. incidentally also what computer science is called in german (informatik)

> incidentally also what computer science is called in german (informatik)

Computerwissenschaften (literally computer science) exists, too, but it's the less common word.


> his English term Datalogy never really caught on though

Perhaps because no one had any idea about how it would be pronounced?


In Portuguese it's called "Ciência da Computação" (computing science or science of computation).

I never thought about this, but it's right. However, to get more nitpicky, most of the uses of "Comput[er|ing] Science" should be replaced with "Computer Engineering" anyway. If you are building something for a company, you are doing engineering, not science research

Be careful there. If you start calling what you're doing 'engineering', people will want you to know statics, dynamics, and thermodynamics.

The average developer isn't often doing "engineering". Until we have actual standards and a certification process, "engineer[ing]" doesn't mean anything.

The average software developer doesn't even know much math.

Right now, "software engineer" basically means "has a computer, -perhaps- knows a little bit about what goes on under the hood".


> The average software developer doesn't even know much math.

Well, I know stupid amounts of math compared to the average developer I've encountered, since I studied math in grad school. Other than basic graph traversal, I only remember one or two times I've gotten to actually use much of it.


Engineering is something like “deliberately configuring a physical or technological process to achieve some intended effect”. That applies whether you’re building a bridge or writing fizzbuzz

IfSmith

I'm not talking about the "average developer", I'm talking about college graduates having a "Computer Science" degreee but in practice being "Computer engineers"

College degrees aren't standardized and most of the time don't really mean anything. Ask some TAs for computer science courses about how confident they are in undergrads' ability to code.

There isn't a standard license that show that someone is proficient in security, or accessibility, or even in how computer hardware or networking work at a basic level.

So all we're doing is diluting the term "engineer", so as to not mean anything.

The only thing the term "software engineer" practically means is: they have a computer. It's meaningless, just a vanity title meant to sound better than "developer".


I agree. I can't find it right now, but there was an article on HN within the past few days talking about how software engineering is often more like writing and editing a book than engineering. That makes perfect sense to me. Code should be meant primarily to communicate with people, and only incidentally to give instructions to computers. It seems to be lost to the sands of time who said this first, but it is certainly true that code is read by humans many more times than it is written. Therefore, ceteris paribus, we should optimize for readability.

Readable code is often simple code, as well. This also has practical benefits. Kernighan's law[0] states:

> Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

Dijkstra says it this way:

> The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague.[1]

Knuth has a well-documented preference for "re-editable" code over reusable code. Being "re-editable" implies that the code must be readable and understandable (i.e. communicate its intention well to any future editors), if not simple.

I know that I have sometimes had difficulty debugging overly clever code. I am not inclined to disagree with the giants on this one.

---

[0]: https://www.defprogramming.com/q/188353f216dd/

[1]: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD03xx/E...


All my job titles have been of the form Software Engineer plus a modifier.

I believe they are referring to what the degree currently known as Computer Science should be called.


Yes. Specifically, it's referring to what the academic discipline ought to be called.

Incidentally, I don't think I know any SWEs who actually majored in software engineering. I know one who didn't even bother to graduate high school and therefore has no academic credential whatsoever, a couple of music majors, a few math majors, and a lot of "computer science" majors, but I can't think of anyone who actually got a "software engineering" bachelor's degree. Hell, I even know one guy with a JD. I think I know 1 or 2 who have master's degrees in "software engineering," but that's it.


Please don't use this license. Copy the language from the preamble and put it in your README if you'd like, but the permissions granted are so severely restricted as to make this license essentially useless for anything besides "validation of scientific claims." It's not an open-source license - if someone wished to include code from CRAPL in their MIT-licensed program, the license does not grant them the permission to do so. Nor does it grant permission for "general use" - the software might not be legally "suitable for any purpose," but it should be the user's choice to decide if they want to use the software for something that isn't validation of scientific claims.

I am not a lawyer, just a hardcore open-source advocate in a former life.


I‘m a proponent of MIT and BSD style licenses normally, but this calls for something like AGPL: Allow other researchers and engineers to improve upon your code and build amazing things with it. If someone wants to use your work to earn money, let them understand and reimplement the algorithms and concepts, that’s fine too.

That's probably not viable under US copyright law, especially with the Bright Tunes Music v. Harrisongs Music precedent; if someone is going to reimplement the algorithms and concepts without a copyright license, they're better off not reading your code so they don't have to prove in court, to a computer-illiterate jury, that the aspects their code had in common with your code were really purely functional and not creative.

Yeah, this seems a bit verbose and overbearing to me. The open-source licenses I've used myself include something like this, which seems quite sufficient:

> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

From a non-academic's point of view I might include a brief disclaimer in the README too, explaining the context for the code and why it's in the state it's in, but there's no obligation to hold the user's hand. To be frank, anybody nitpicking or criticizing code released under these circumstances with the expectation that the author do something about it can go fuck themselves.

Competitive advantage, on the other hand, is a perfectly valid reason to hold code back. There may also be some cost in academia to opening an attack surface for any sort of criticism, even irrelevant criticism made in obvious bad faith. Based on what I've heard about academia, this wouldn't surprise me.


From the post: "The CRAPL says nothing about copyright ownership or permission to commercialize. You'll have to attach another license if you want to classically open source your software."

It is explicitly the point of the license that the code is not for those purposes, because it's shitty code that should not be reused in any real code base.


That's not a good excuse for putting your readers at legal risk of copyright infringement. A real, non-shitty code base could easily be a "derivative work" of the shitty code.

In the research industry, it is well established for anyone wanting to publish or utilize / include another's research in their own, to contact the source author and receive explicit permission to do so.

More often than not, they are more than willing to help.


"Well-established" norms are a barrier to newcomers and perpetuate power structures (including racial, ethnic, national, and socio-economic).

That's not theoretical; I know many people who were or would be embarrassed to ask.

Explicitly posting things is helpful.


It looks quite explicitly designed as a short-term temporary license for the period when the main paper is unpublished and you'd be expected to keep the code non-public (due to e.g. reviewing anonymity requirements), so the basic open source freedoms are explicitly not included.

I would expect that anyone wanting to actually publish their code should publish the code with a "proper" license after the reviewing process is done and the relevant paper is published.


> the permissions granted are so severely restricted as to make this license essentially useless

Indeed, also there're things like "By reading this sentence, You have agreed to the terms and conditions of this License.". That can't hold up in court! How can I know in advance what the rest of the conditions say before agreeing to them?

Then again, I am not a lawyer either.


While I whole heatedly agree with you, I would seriously question anyone trying to reuse research code in production without completely reimplementing it from scratch.

I'd like to piggyback and say that increasing the surface for nitpicking and criticism is exactly why OP should release his code. It improves the world's ability to map data to the phenomenon being observed. It becomes auditable.

Certainly don't clean it up unless you're going to repeat the experiment with the cleaned up code.


Agreed on both points! As somebody who bridges research and production code, I can typically clean code faster than I can read & understand it. It really helps to have a known-good (or known-"good") example so that I can verify that my cleanup isn't making undesired changes.

And, yeah. I've found some significant mistakes in research code -- authors have always been grateful that I saved them from the public embarrassment.


I do this with my code and can highly recommend it.

Supplying bad code is a lot more valuable than supplying no code.

Also in my experience, reviewers won't actually review your code, even though they like it a lot when you supply it.


The CRAPL is a "crayon license" — a license drafted by an amateur that is unclear and so will prevent downstream use by anyone for whom legal rigor of licensing is important.

https://matt.might.net/articles/crapl/

> I'm not a lawyer, so I doubt I've written the CRAPL in such a way that it would hold up well in court.

Please do release your code, but please use a standard open source license. As for which one, look to what your peers use.


I think this could be done much better by putting a very restrictive license like GPLv3 / AGPL and then in the README putting in that I don't support this project at all and ignore everything associated with wherever you are hosting it.

Using this license would actually make me suspect that your results aren't even valid and I don't trust many experiments that don't release source code.


In case OP, and others don't know, it is the copyright holders that can decide on the license. Copyright holders are the persons who contributed to the code. In this case, it sounds like OP is the sole author and therefore the sole copyright holder.

You cannot change the past, but as a copyright holder, you can always set a new license for future releases.

Thus, OP, if you're uncertain, I definitely was when I started out, go with a restrive license as recommended here (GPL). That, together with publishing the code online (e.g. GitHub, Gitlab, ...) as well as a with your article, will give you some protection against plagiarism. Anyone who use include parts of your code for their research code, will have to share theirs code the same. If you later on feel like you want to relax the license, you can always change it to, say, MIT.


Anecdotally most of the research papers I see and have worked on publish their code but don't really clean it up. Even papers by big companies like Microsoft Research. Still significantly better than not publishing the code at all.

A thousand times this. A working demonstrator of a useful idea that is honest about its limits is so valuable. Mover over most commercial code is garbage! :)

Please do absolutely publish your code.

If only to help people who simply can't read academic papers because it's not a language their brain is wired to parse, but who can read code and will understand your ideas via that medium.

[EDIT]: to go further, I have - more than once - run research code under a debugger (or printf-instrumented it, whichever) to finally be able to get an actual intuitive grasp of the idea presented in the paper.

Once you stare at the actual variables while the code runs, my experience is it speaks to you in a way no equation ever will.


This is not what licenses are for!! They are not statements about the quality of your work or anything similar.

Use standard and well understood licenses e.g. GPL for code and CC for documentation. The world does not need more license fragmentation.


This has explicit usage limitations that matter in science land, which is very much the kind of thing that belongs in a license.

Eg:

   You are permitted to use the Program to validate scientific claims
   submitted for peer review, under the condition that You keep
   modifications to the Program confidential until those claims have
   been published.

Moreover, sure, lots of the license is text that isn't common in legal documents, but there's no rule that says legal text can't be quirky, funny or superfluous. It's just most practical to keep it dry.

In this particular case, however, there's very little risk of actual law suits happening. There is some, but the real goal of the license is not to protect anyone's ass in court (except for the obvious "no warranty" part at the end), but to clearly communicate intent. Don't forget that this is something GPL and MIT also do besides their obvious "will likely stand up in court" qualities. In fact I think that communicating intent is the key goal of GPL and MIT, and also the key goal of CRAPL.

From this perspective, IMO the only problem in this license is

    By reading this sentence, You have agreed to the terms and
    conditions of this License.
This line makes me sad because it makes a mockery of what's otherwise a pretty decent piece of communication. Obviously nobody can agree to anything just by reading a sentence in it. It should say that by using the source code in any way, you must agree to the license.

> clearly communicate intent

Again, this is not how a license work. You can express your intents, ideas and desires in a README file and in many other ways.

The license is nothing more than a contract that provides rights to the recipient under certain conditions. Standing up in court is its real power and only purpose.

That's why we should prefer licenses that stood up in court and have been written by lawyers rather than developers or scientists.


I strongly disagree. Contracts very much primarily communicate intent, ideally in such a way that they also stand up in court. People regularly argue over details in contracts, people regularly look up things in contracts, also when there is no court to be seen and no intention anywhere to go to court. The vast vast vast majority of contracts never make it to court.

Plenty of contracts aren't even written down. When you buy a loaf of bread at the bakery, you make an oral contract about that transaction.

The idea that contracts, or licenses, need to be written in dull legalese and be pretty much impenetrable to be useful or "valid" or whatever, is absolutely bonkers. Lawyers like you to think that but it's not true. It's an urban legend.

If you need to make sure that you can defend your rights in court, then sure, you're probably going to need some legalese (but even then there's little harm in also including some intent - it's just not very common). Clearly that's not the goal here. No scientist is gonna sue another scientist who asked for support and got angry about not getting any even though the code was CRAPL licensed.


> Plenty of contracts aren't even written down.

That's a well known fact. And it's besides the point.

> Lawyers like you to think that but it's not true.

Is that a conspiracy theory? Writing long, detailed contracts on a persistent medium is safer: it lowers the risk of he-said-she-said scenarios and ambiguities.

That is meant to save you tons of legal expenses.

> No scientist is gonna sue another scientist

Then there is no need for such license in the first place. Just a readme file.


By existing, you have agreed to the terms and conditions.

I agree. There's a lot of confusion surrounding even the most established ones, so there's no need to further muddy the situation with newer licenses. In my opinion a "fun" license, with its untested legal ambiguity, restricts usage more than a well established license with a similar level of freedoms.

For instance, the Java license explicitly forbids the use in/for real-time critical systems, and such limitations are good to stress in a license so that they may reach legal force, also to protect the author(s).

Incidentally, I've seen people violate the Java "no realtime" clause.


Used to, OpenJDK is licensened under GPLv2 with the classpath excemption that allows this for years. If not running an OpenJDK build it depends on your vendor license.

And it makes the license non-opensource.

Plus, the usual "no warranty" is strong enough to protect the authors anyways.


> This is not what licenses are for!!

You must be fun at parties :)


Yes, if you feel you have to make it "release ready" then you'll never publish it. I'm pretty sure a good majority of the code is never released because the original author is ashamed of it, but they shouldn't be. Everybody is in the same boat.

The only thing I would add is a description of the build environment and an example of how to use it.


I like it so far, other than

4) You recognize that any request for support for the Program will be discarded with extreme prejudice.

I think that should be a "may" rather than a "will." If I find out someone is using my obscure academic code, and they ask for help, I'd be pretty pumped to help them (on easy requests at least).


The point of the license is to set your expectations as low as possible. Then, when you actually /do/ get support, you'll be ecstatic rather than non-plussed.

Discarding a request for support with extreme prejudice might entail using LinkedIn to look up the boss of the person who asked you for support, then phoning them up to complain about the request for support, or it might entail filing for a restraining order against the person requestings support. The point of this clause is to intimidate people out of making the request in the first place.

That's a pretty over the top reading.

You should probably familiarize yourself with the meaning of the phrases it's alluding to, "dismissed with extreme prejudice" and "terminated with extreme prejudice".

When phrased like this,

> 4) You recognize that any request for support for the Program will be discarded with extreme prejudice.

There is no way I'd even make a request for support.


Exactly. If nothing else, a request for support has a chance of being an indication that there's somebody else in the field that cares about some aspect of the problem. I might not act on it, but it is good to have some other human-generated signal that says "look over there."

> Academic code is about "proof of concept."

Why does he think that but presumably not the same about the paper itself and the “equations”, plots, etc. contained within?

It’s really not that hard to write pretty good code for prototypes. In fact, I can only assume that he and other professors never allowed or encouraged “proof of concept” code to be submitted as course homework or projects.


I think you don't understand the realities of work in scientific/academic organisations. Unless you work in computer science you likely never received any formal education on programming except for some numerical methods in c/matlab/fortran course during your studies (which often also focus more on the numerical methods and not the programming). So pretty much every person, just learned by doing.

Moreover you are not being paid for writing reasonable programs you're paid for doing science. Nobody would submit "prototype" papers, because they are the currency of academic work. There is lots of time spend on polishing a paper before submission, but doing that for code is generally not appreciated because nobody will see this on your CV.


I understand it fine. Like I said, it’s really not that hard to write pretty good code for prototypes. I'm not saying the code needs to be architected for scale or performance or whatever else needless expectation. I don't have a formal education in programming or computer science and write clean code just fine, as do some other non-computer science people I've worked with in advanced R&D contexts. And then some (many?) don't. It's not really about educational background, it's more about just being organized. Even when someone is "just" doing science, a lot of times, the code explicitly defines what's going on and has major impacts on the results. (Not to mention that plenty of people with computer science backgrounds write horrible code.)

If code is a large part of your scientific work, then it's just as important as someone who does optics keeping their optics table organized and designed well. If one is embarrassed by that, then too bad. Embarrassment is how we learn.

Lastly, you're describing problems with the academic world as if they are excuses. They're reasons but most people know the academic world is not perfect, especially with what is prioritized and essentially gamified.


I'm not making excuses I'm just talking about the realities. For the significant majority of researchers I know about version control is still program_v1.py, program_v1.py, program_final.py, program_final2.py (and this is the good version, at least they are using python), so talking about clean code is still quite a bit away. I'm teaching PhD students and it's even hard to convince them, because they just look at how to get the next publishable result. For academics it becomes even more unrealistic, most don't even have time to go to a lab, they are just writing grants typically.

I'm actually a strong supporter of open science, release my code OSS (and actually clean it up) and data (when possible). But just saying it's easy and there is no cost, is just setting the wrong expectations. Generally unless you enjoy it, spending significant time on your code is not smart for your career. Hopefully things are changing, but it is a slow process.

Funny that you talk about optics I know some of the most productive people in my field and their optical tables are an absolute mess (as is their code btw). They don't work at universities though.


I still think it’s really not that hard, and actually, it’s really not even about code. It’s really just about organization, because as you point out, not everyone is great at it. But for example, messy optics tables, labs, or whatever do in fact cause problems, like efficiency and knowing what “state” of supporting tools yielded what result and several other derivative problems. I think my push would be just applying even a modicum of organization on supposedly ancillary things will go a long way rather than accepting them as reality.

I understand the realities and have even been a part of PowerPoint development where slides are emailed back and forth. Sometimes one just has to go with things like that. But I have also seen the reality of stubbornness. I have tried introducing source code control to scientists or even stuff like wikis, all supported and already up and running by IT and used by other groups. Scientists and engineers, especially those with PhDs, can a bit rejective and set in their ways. I have been told flat out by different people that they wouldn’t use tools like a wiki or that Python and C was all they ever needed. I have even noticed physicists saying “codes” for software instead of “code”. It’s fairly rampant, and I have seen it in research papers, published books, and in industry in person. I have never seen that use of “codes” anywhere else. That alone is evidence of a certain amount of culture and institutionalization of doing things incorrectly but viewed as acceptable within the field.

I have written code in some research contexts. I get the constraints. One just needs to take it seriously. But organization, in general, often takes a back seat in basically any field. The only way to change things like this are like anything, which is to push against culture.


I'm not a scientist so maybe I don't get it, but it seems like code could be analogized to a chemist's laboratory (roughly). If a chemist published pictures of their lab setup in their paper, and it turned out that they were using dirty glassware, paper cups instead of beakers, duct taping joints, etc etc, wouldn't that cast some doubt on their results? They would be getting "nitpicked" but would it be unfair? Maybe their result would be reproducible regardless of their sloppy practices, but until then I would be more skeptical than I would be of a clean and orderly lab.

MIT and BSD are established and well accepted licenses, literally named after the academic institutions where they originated. Licenses are legal documents, part of what makes them "explicit what everyone understands" is their legally recognized establishment.

If you want to set expectations, this can simply be done in a README. Putting this in a license makes no sense. Copyright licenses grant exceptions to copyright law. If you're adding something else to it, you're muddying the water, not making it better.


> Don't clean & organize!

FWIW I basically did this: My thesis numbers were run on a branch based on the unstable version of an upstream project that was going through a major refactoring. I took a tarball of the VCS tree at that point in time and posted it online. Over the years 3-4 people have asked for the tarball; nobody has ever come back to me with any more questions. I can only assume they gave up trying to make it work in despair.

I think I tried to build it a couple of years ago, and it wouldn't even build because gcc has gotten a lot more strict than it used to be (and the upstream project has -Werror, so any warnings break the compilation).

I think it's definitely worth doing, but I think you need to be realistic about how much impact that kind of "publishing" is really going to have.


> Matt Might has a solution for this that I love: Don't clean & organize! Release it under the CRAPL[0]

> "Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about 'proof of concept.'"

This is brilliant!

edit: it seems that, aside from that great snippet from the text, the license itself isn't so great. another comment [1] has a great analysis of the actual license and suggests using a superior solution (copying the preamble part yet still using MIT/(A)GPL).

[1] https://news.ycombinator.com/item?id=29937180


You don't need an esoteric license, just use a standard license like MIT>

relevant section from MIT license:

"THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."


lol didn't know about this license, amazing!

Hi, I’m a Research Software Engineer (a person who makes software that helps academics/researchers) at a university in the UK. My recommendation is that not only do you publish the code, you mint a DOI (digital object identifier, Zenodo is usually the go to place for that) for the specific version that was used in your paper and you associate them. And you include a citation file (GitHub supports them now: https://docs.github.com/en/repositories/managing-your-reposi...) in your software repo.

Benefits: people who want to reproduce your analysis can use exactly the right software, and people who want to build on your work can find the latest in your repo. Either know how to cite your work correctly.

In practice drive-by nitpicking over coding style is not that common, particularly in (some) science fields where the other coders are all other scientists who don’t have strong views on it. Nitpicks can be easily ignored anyway.

BTW should you choose to publish, the Turing Way has a section on software licenses written for researchers: https://the-turing-way.netlify.app/reproducible-research/lic...


And I would suppose it will drive more citations. Which is a plus!

/me waves from Leeds Uni. Hello fellow RSE.

For a physician that writes pretty awful code, I like this comment.

I would encourage you not to use DOIs for software. They are not made for this, and have limitations which are not appropriate.

Instead, use Software Heritage https://www.softwareheritage.org/ , it provides unique identifiers and actually understand repositories, versioning, history, etc. It also allows you to cite the software and even give proper, durable links to point to the code.


Why not just link to a specific commit hash? What more do they provide?

In my view and personal experience, the pros outweigh the cons:

* You increase the impact of your work and as a consequence also might get more citations.

* It's the right thing to do for open and reproducible research.

* You can get feedback and improve the method.

* You are still the expert on your own code. That someone picks it up, implements an idea that you also had and publishes before you is unlikely.

* I never got comments like "you could organize the code better" and don't think researchers would tend to do this.

* Via the code you can get connected to groups you haven't worked with yet.

* It's great for your CV. Companies love applicants with open-source code.


> It's the right thing to do for open and reproducible research.

Everybody here talks about how publishing code helps (or even makes possible) reproducibility, but this is not true, on the contrary, it hinders it. Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper. This is trivial and good for nothing. Reproduction is other researchers independently reimplementing the code using only the published theory, and getting the same results. If the author publishes the code, no one will bother with this, and this is bad for science.


This is a common misconception, but you are actually talking about "replicability" which is "writing and then running new software based on the description of a computational model or method provided in the original publication, and obtaining results that are similar enough" [1]. Reproducibility instead refers to running the same code on the same data to get the same results [2].

[1] Rougier et al., "Sustainable computational science: the ReScience initiative" https://arxiv.org/abs/1707.04393

[2] Plesser, "Reproducibility vs. Replicability: A Brief History of a Confused Terminology" https://doi.org/10.3389/fninf.2017.00076


It's worth noting that the two provided references disagree on their definitions of reproducibility. Plesser quotes ACM as reproducibility being tested by a "Different team, different experimental setup", consistent with use in the GP's comment, but ultimately seems to favor ignoring "replicability" and instead using "reproducibility" in combination with various adjectives. The common understanding in the physical sciences is also that merely re-running the same code is not a test of reproducibility.

First, this is overtly not true. Reproducibility refers to all forms: that the paper figures can be built from code and don't have errors, that a reimplementation of new code on the same data produces the same results, and that gathering new data (e.g. by conducting the same experiment again if possible, in other words replication) produces comparable results.

Second, publishing code helps make invisible decisions visible in a far better manner than the paper text does. Try as we might to imagine that every single researcher degree of freedom is baked into the text, it isn't and it never has been.

Third, errors do occur. They occur when running author code (stochasticity in models being inadequately pinned down, software version pinning, operating system -- I had a replication error stemming from round-towards-even behaviour implementation varying across platforms). If you have access to the code, then it's far easier to determine the source of the error. If the authors made a mistake cleaning data, having their code makes it easier to reproduce their results using their exact decisions net of the mistake you fix.

Most papers don't get replicated or reproduced. Making code available makes it more likely that, at a minimum, grad students will mechanically reproduce the result and try to play around with design parameters. That's a win.

Source: Extensive personal work in open and transparent science, including in replication; have written software to make researcher choices more transparent and help enable easier replication; published a meta-analysis of several dozen papers that used both reproducing author results from author code, producing author results with code reimplementation, and producing variant results -- each step was needed to ensure we were doing things right; a close friend of mine branched off into doing replications professionally for World Bank and associated entities and so we swap war stories; always make replication code available for my own work.


In my experience, the published paper is super vague on the approach, and implementing it without further references is really hard. I'm not necessarily arguing that papers should get longer and more detailed to counter this; expressing the details that matter in code seems like a more natural way to communicate anyway.

Why trust results if you can't see the methodology in detail and apply the approach to your own data? I once knew somebody who built a fuzz tester for a compilers project, got ahold of a previous project's compiler code that won best paper awards, and discovered a bunch of implementation bugs that completely invalidated the results.

Why is the peer review process limited to a handful of people who probably don't have access to the code and data? If your work is on Github, anybody can come along and peer review it, probably in much more detail. And as a researcher, you don't get just one chance to respond to their feedback -- you can actually start a dialogue, which other people are free to join in.

As long as a project's README makes any sort of quality / maintenance expectations clear upfront, why not publish your code?


> In my experience, the published paper is super vague on the approach, and implementing it without further references is really hard.

This is my experience, too, and in my opinion this is exactly what has to change for really reproducible research, not ready to run software supplied by the author.

There are many good arguments in support of publishing code, but reproducibility is not one of them, that's all I'm saying.


And just like OP said, it generally takes a couple of months to go from paper to working code. I've implemented a few papers as code as a side-gig for a while, and I wouldn't mind having a baseline from the authors to check and see if I'm following the paper correctly!

I disagree for another reason. Have access to the code allows easy comparison. I did some research in grad school on a computational method and there was a known best implementation that was in the research. I reached out to the author and he kindly supplied me with the source code of his work. I wasn't trying to replicate his results, but rather I wanted to compare his results to my implementations results in a variety of scenarios to see if I had improved over the other method.

And to the original author's credit, when I sent him a draft of my paper and code, he loved how such a simple approach outperformed his. I always felt that was the spirit of collaboration in science. If he hadn't supplied his code, I really would never have known how they performed unless I also fully implemented the other solution -- which really wasn't the point of the research at all.


Often the text in a paper that describes some algorithm will not be completely clear or complete. Providing the code fills in those blanks. I've taken a paper with code and gone through the code line by line comparing it with what was described in the paper. The code often ends up clarifying ambiguities. In some cases there's an outright disagreement between the paper and the code - that's where the feedback comes in, ask the author about these disagreements, it will help them clarify their approach (both the text and the code).

> Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper.

Sure, to some extent. But the code does provide a baseline, a sanity check. People who are trying to reproduce results should (as I describe above) go through both the paper and the code with a fine tooth comb. The provided code should be considered a place to start. I'll often re-implement the algorithm in a different language using the paper and the provided code as a guide.


What are your thoughts on including pseudocode within the paper directly? It seems to clear up some of the ambiguities while adding brevity since it doesn't provide the somewhat frivolous details of implementation. I think it also limits some of the potential stylistic critiques.

It's not a bad idea to include pseudocode, but my pet peeve is that there's really no standard for pseudocode. I tend to write pseudocode that looks like python. I did that in an internal paper for a group I worked in a few years back and the manager of the group made me change it to look more like C (complete with for loops as: for(i=0; i<x; i++) which made no sense to me).

Oh, haha, then disregard my intuition that it would help avoid stylistic critiques :-)

> Reproducing the results does not mean running the same code as the author and plotting the same figures as in the paper. This is trivial and good for nothing.

I agree with this statement, however I think you may have a misunderstanding on reproducing results. It's not that you can reproduce their graphs from their dataset, but rather seeing if their code reproduces on to your (new) dataset.

Another way to think of it is that the research paper's Methodology section is describing how to set up a laboratory environment to replicate results. By extension the laboratory for coding research IS the code. Thus, by releasing the code along with your paper, you are effectively stating "how is a direct copy of my laboratory for you to conduct your replicate on".


I guess things are a spectrum. I've worked on research projects where understanding and developing the algorithm is the research. There isn't really an "input data set" other than a handful of parameters that are really nothing more than scale factors, and the output is how well the algorithm performs. So "setting up the laboratory" by cloning the code and building it is...fine, but a reimplementation of the algorithm with "the same" results (modulo floating point, language features, etc. etc.) aligns much better with reproducibility.

"Reproducing results" in a scientific context doesn't mean taking the original author's data and going through the same process. It usually means taking different data and going through the same process. Having code on hand to put new data through the same process makes that a lot easier.

"It's the right thing to do for open and reproducible research."

I think this is the most important reason to do it. Research code is not meant to be perfect as another op said, but it can be instrumental in helping others, including non-academics, understand your research.

I think the sooner it's released the better (assuming you've published and you're not needing to protect any IP.) There's some great advice here: https://the-turing-way.netlify.app/reproducible-research/rep...


> It's great for your CV. Companies love applicants with open-source code.

While I strongly support sharing the code, I am not sure if this is a great reason to do so. Companies are made up of many individuals, and while some might appreciate what it takes to open source code, other individuals might judge the code without full context and think it is sloppy. My suggestion is that you fully explain the context before sharing code with companies.


> The paper itself is enough to reproduce all the results.

No, this is almost never the case. It should be. But it cannot really be. There are always more details in the code than in the paper.

Note that even the code itself might not be enough to reproduce the results. Many other things can matter, like the environment, software or library versions, the hardware, etc. Ideally you should also publish log files with all such information so people could try to use at least the same software and library versions.

And random seeds. Make sure this part is at least deterministic by specifying the seed explicitly (and make sure you have that in your log as well).

Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).

> In my field many scientists tend to not publish the code nor the data.

This is bad. But this should not be a reason that you follow this practice.

> clean and organize the code for publishing

This does not make sense. You should publish exactly the code as you used it. Not a restructured or cleaned up version. It should not be changed in any way. Otherwise you would also need to redo all your experiments to verify it is still doing the same.

Ok, if you did that as well, then ok. But this extra effort is really not needed. Sure it is nicer for others, but your hacky and crappy code is still infinitely better than no code at all.

> it will increase the surface for nitpicking and criticism

If there is no code at all, this is a much bigger criticism.

> publishing the code will be removing the competitive advantage

This is a strange take. Science is not about competing against other scientists. Science is about working together with other scientists to advance the state of the art. You should do everything to accelerate the process of advancement, not try to slow it down. If such behavior is common in your field of work, I would seriously consider to change the field.


I agree with almost all of this, however I believe that publishing random seeds is dangerous in its own way.

Ideally, if your code has a random component (MCMC, bootstrapping, etc), your results should hold up across many random seeds and runs. I don’t care about reproducing the exact same figure you had, I want to reproduce your conclusions.

In a sense, when a laboratory experiment gets reproduced, you start off with a different “random state” (equipment, environment, experimenter - all these introduce random variance). We still expect the conclusions to reproduce. We should expect the same from “computational studies”.


The thing is, if you want to ignore someone's random seed, you can if it's provided. If it's not provided and you need it to chase down why something isn't working, you're SOL.

It's zero cost to include it.


I think being able to re-run code with a paper is great, but I think we should be sure to distinguish it from scientific replication.

When replicating physics or chemistry, you build fresh the relevant apparatus, demonstrating that the paper has sufficiently communicated the ideas and that the result is robust to the noise introduced not just by that "random state" you discuss but also to the variations from a trip through human communication.

I acknowledge that this is substantially an aside, but it's something I like to surface from time to time and this seemed a reasonable opportunity.


> And random seeds. Make sure this part is at least deterministic by specifying the seed explicitly (and make sure you have that in your log as well).

> Unfortunately, in some cases (e.g. deep learning) your algorithm might not be deterministic anyway, so even in your own environment, you cannot exactly reproduce some result. So make sure it is reliable (e.g. w.r.t. different random seeds).

Publishing the weights of a trained model allows verification (and reuse) of results even before going to the effort of reproducing it. This is especially useful when training the model is prohibitively expensive.


To some extent Science is a big project to understand how the universe works. We should hope to understand the phenomena that we investigate to the point where library versions and random seeds don't matter so much -- assuming the code is not buggy, and the statistics are well done, those factors shouldn't come into play.

However, sometimes chemists find out that the solvents they use to clean their beakers are leaving trace amounts of residue, which accidentally contribute to later reactions.

> Ideally you should also publish log files with all such information so people could try to use at least the same software and library versions.

looks to me like a result that requires borrowing a particular lab's set of beakers. Not what we're looking for.


What a great question. You've come to the right community.

My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

You'll open yourself up for comments. They may be positive or negative. You'll only know how it pans out afterwards.

Is the code something that you'll want to improve on for further research? If so publish it on github. It opens the way for others to contribute and improve the code. Be sure to include a short readme that you welcome PRs for code cleanup, etc. That way you can turn comments criticizing your code into a request for collaboration. It'll really separates helpful people from drive by commenters.


> My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

Worth mentioning specifically: If you make a git (et al) repository public, make sure there are no passwords or secret keys in the history of the repository either. Cleaning a repository history can be tricky, so if this is an issue, best to just publish a snapshot of the latest code (or make 100% sure you've invalidated all the credentials).


The brute force way around this is to remove the .git folder and re init the git repo.

For my 2 cents I'd prefer to see sloppy code vs no code.

If you did something wrong, you did it wrong. Hopefully someone would put in a PR to fix it


If there is sensitive data to remove and the history is important to keep, then GitHub has some recommendations for scrubbing the history

https://docs.github.com/en/authentication/keeping-your-accou...


> My main concern would be to make sure there are no passwords or secret keys in the data, not how it looks.

Also personal data of any human subjects.


Is your goal to help advance the science and our general knowledge? Publish the code. You don’t even need to clean it up. Just publish. Don’t worry about coding style nitpicks. Having the code and data available actually protects you from claims of fabrication or unseen errors in hidden parts of your research.

On the other hand, if your goal is only to advance your own career and you want to inhibit others from operating in this space any more than necessary to publish (diminish your “competitive advantage”) then I guess you wouldn’t want to publish.


Not sure if posting the paper only is even the best move. I personally never work with papers with no code published. Just not worth the effort to reproduce them, when I can use the second SOTA for nearly no performance penalty and much less effort.

All the groundbreaking papers in deep learning in the last decade had code published. So if you're aiming for thousands of citations, you need code.


> All the groundbreaking papers in deep learning in the last decade had code published. So if you're aiming for thousands of citations, you need code.

I am in this field and I would say less than 10% of the top papers have code published by the author, and those are most of the time another 0.1% improvement in imagenet. All the libraries that you generally use are likely to be recreated by others in this field. Lot of most interesting work's code never come out like alphazero/muzero, GPT-3 etc.


Can confirm.

I personally look at any paper without code with great suspicion. The reviewers certainly did not try to reproduce your results, and I have no guarantee that a paper without code has enough information for me to reproduce.

I always go for the papers with code provided.


As a reviewer I have reproduced results with my own independent plasma simulation code. And I have had a reviewer write in a report about my paper "result X seemed strange, but I ran it with my code and it does it too. I don't know why, but the results is valid". In my opinion that is even better than just rerunning the same code.

Agreed, reproducibility helps a lot, and it is very easy to get details wrong when reimplementing. Having the source code is a bit plus.

This is very domain specific. OP said it is not the norm to do publish code in his field. I have a PhD and in my field it is the same. So much so that I can't think of any paper in my field that has code published. Therefore, a paper with no code would not be at a disadvantage.

Personally, it is a pet peeve I have about my field. But there is no incentive for a new researcher to publish code as it decreases barriers to entry. As much as it's nice to say that researching in academia is about progressing science, as a researcher, you are your own startup trying to make it (i.e., get tenure).


> it will increase the surface for nitpicking and criticism

Anyone who programs publicly (via streaming, blogging, open source) opens themselves up for criticism, and 90% of the time the criticism is extremely helpful (and the more brutally honest, the better).

I recall an Economist magazine author made their code public, and the top comments on here were about how awful the formatting was. The criticism wasn't unwarranted, and although harsh, would have helped the author improve. What wasn't stated in the comments is that by publishing their code, the author already placed themselves ahead of 95% of people in their position who wouldn't have had the courage to do so. In the long run, the author will get a lot better and much more confident (since they are at least more aware of any weaknesses).

I'd weigh up the benefits of constructive (and possibly a little unconstructive) criticism and the resulting steepening of your trajectory against whatever downsides you expect from giving away some of your competitive advantage.


Do you really mean 90% of the criticism is extremely helpful? Or did you mean 90% was useless.

I've published 100,000s of lines of code from my research over 20 years, and I think I've had exactly one useful comment from someone who wasn't a close collaborator I would have been sharing code with anyway.

I still believe research code should be shared, but don't do it because you will get useful feedback.


I had the same experiences, but only publishing for 5 years so far. I still try to puplish everything openly, but I do not expect any responses anymore. In none of my papers, the reviewers appeared to have even seen the Jupyter Notebooks I attached as HTML. The papers are cited, some more, some less, but there is no reaction towards the source code. I still don't regret publishing it.

Interesting. Are the unhelpful comments coming from academics or random peanut gallery folks?

Peanut gallery, in my experience. The number of people who I've never met before who decide to complain about hardcoded file paths or run a linter and tell me my paper must therefore be garbage is frustratingly high.

This seems to depend on a paper getting a modest amount of media traction. That seems to set off the group of people who want to complain about code online.


peanut gallery are likely to give stupid feedback. Academics are likely to ask for help using my code -- which is nice but doesn't (usually) contribute anything useful to me, and takes up time I could be spending on other things.

> 90% of the time the criticism is extremely helpful

Citation needed. I have rarely seen valuable feedback from random visitors from the internet.


This. Feedback (less loaded term than "criticism") is something you should want. You can obviously ignore tabs vs spaces types of comments but if your code takes 2 months to get right then it probably still has bugs in it after 2 months and it would be a win if others started finding them for you. Also, if the style is really that bad then it could be obscuring bugs that would otherwise be easy to spot (missing braces, etc), and you might find bugs while fixing it up.

ps always use an auto formatter/linter. I can't believe we ever used to live without them. So much time used to be wasted re-wrapping lines manually and we'd still get it wrong.


Yes you should! And not only for ethical reasons (actually reproducible research, publicly financed work, etc), even if those are good enough by themselves.

I've always published my research code. Thanks to that, one of the tools I wrote during my PhD has been re-used by other researchers and we ended up writing a paper together! In my field is was quite a nice achievement to have a published paper without my advisor as a co-author even before my PhD defense (and it most likely counted a lot for me to get a tenured position shortly after).

The tool in question was finja, an automatic verifier/prover in OCaml for counter-measures against fault-injection attacks on asymmetric cryptosystems: https://pablo.rauzy.name/sensi/finja.html

My two most recently published papers also come with published code released as Python package:

- SeseLab, which is a software platform for teaching physical attacks (the paper and the accompanying lab sheets are in French, sorry): https://pypi.org/project/seselab/

- THC (trustable homomorphic computation), which is a generic implementation of the modular extension scheme, a simple arithmetical idea allowing to verify the integrity of a delegated computation, including over homomorphically encrypted data: https://pypi.org/project/thc/


I agree. I published code that I used for my dissertation (more than 30 years ago). I think it led to thousands of citations.

The Carpentries[0] provide some great resources and training for academics who are interested in this. Check out Software Carpentry[1] and Data Carpentry[2] in particular.

Publishing research code is admirable, and in an ideal world everybody would publish their code and data. That said, we shouldn't pretend that there aren't tradeoffs. Time spent polishing your code to make it presentable is time not spent on other aspects of your research. Time spent developing software development skills is time that could be spent learning new research techniques (or whatever). Reproducible research is great, but it's certainly possible to take it too far at the expense of your productivity/career.

You should also take your own personality into account. If you're a perfectionist you might struggle to let yourself publish research-quality code rather than production-quality code and consequently over-allocate the time you spend prettying up your code.

[0] https://carpentries.org/

[1] https://software-carpentry.org/

[2] https://datacarpentry.org/


Just do a super-minimal cleaning and upload to Zenodo or similar, then stick the DOI to the code and input/output files in your paper somewhere. 99% certain your reviewers will not brother to look at your code. 10 years from now someone new looking into the same topic gets a leg up. Don't feel obligated to update, clarify, or even think about the code ever again. If you want to build a community or something, then by all means go for github, but providing code along with your paper should be something automatic and quick, not adding an unwanted burden.

While I publish in a field where making source code available is much more common, let me just make a couple of points:

* I have never had someone come back to criticize my code style. And if they do, so what? I'll block them and not think about it again. I don't need to get my feathers ruffled over this.

* Similarly, if someone's trying to replicate my results, and they fail, it's on them to contact me for help. After that it's on me to choose how much effort to put into helping them. But if they don't contact me, or if they don't put in a good faith effort to replicate the results, that's their problem. If they try to publish a failure to replicate without having done that, it's no more valid science than publishing bad science in the first place.

Overall, I think most people who stress about publishing code do so because they haven't done it before. I've personally only ever had good consequences from having done so (people picking up the code who would never have done anything with it if it weren't already open source).


> The paper itself is enough to reproduce all the results.

No, it isn't.

Reproducing the results means that you provide the code that you used so that people can reproduce it just by running "make" (or something similar). If you do not publish the code and the input data, your research is not reproducible and it should not be accepted in a modern, decent world.

It doesn't matter that your code is ugly. Nobody is going to look at it anyway. They are only going to call it. If the code is able to produce the results of the paper with the same input data, that's enough. If the code is not able to at least do that, this means that even you are not able to reproduce your own results. In that case, you shouldn't publish the paper yet.


Worth noting that this is a modernization of scientific tradition, which predates code. I did publish my code, but the bulk of my work was a series of manual steps. That was 30 years ago. The closest thing to a replication involved changes to my design. The science world at large is still coming up to speed on this.

Agreed! Even if your code is just a slight modification of some other well known tutorial, in order to reimplement the code means that I, as a scientist, have to reverse engineer your codebase based on your text - which may not be as straightforward as expected. Remember that some researchers are not native English speakers, so there's the added complexity of translating your words into a readable format.

I always published all of my code/papers/source for my publications. I never made anything "revolutionary" but I still felt it was important to produce reproducible research, even if relatively insignificant.

This was kind a change for my advisor who was definitely less interested in that aspect of research. I think this is an issue in academia and needs to change.

Also, ultimately if someone wants to copy and publish your work as their own it will be relatively easy to show that and the community as a whole will recognize it.

Also, for me it felt good when another student/researcher was aided by my work.

https://shankarkulumani.com/publications.html

You don't need to clean it up or make the code presentable. Everyone knows it's research grade code. Most important part is that you have the code in a state that you can reuse in the future for another publication.

I've been saved multiple times by being able to easily go back to decade old work and reproduce plots.


Two times I have published my research code - both times I have found many other papers/projects plagiarized my work without giving me any credit. This happens way more than you would think, especially if you are working under less known advisor, and at less known university.

As the other comment said, if you care about "advancing the science", and won't mind stuff like the above happening, then go for it. In my experience, it is not worth it.


> Two times I have published my research code - both times I have found many other papers/projects plagiarized my work without giving me any credit. This happens way more than you would think, especially if you are working under less known advisor, and at less known university.

This has been very much my experience.


I wonder how often it is the case that code isn't considered an academic product per se, and so free to use. May have to make it very explicit.

> I wonder how often it is the case that code isn't considered an academic product per se, and so free to use.

As an outsider with occasional glimpses into academia, I've observed a bimodal response:

On the one hand, code seems to be seen as more available for reuse in terms of copy-and-paste, particularly in the exploratory phase.

On the other hand, code reuse is somewhat less likely to trigger an acknowledgement and citation.

This unacknowledged copy-and-paste borrowing of code seems in turn to be one influential factor inhibiting the derived code from being shared in turn.

> May have to make it very explicit.

That certainly can't hurt. I've certainly seen that reusable tools aren't cited as often as they are used, which in turn inhibits efforts from being made by academics in the production and refinement of such tools (the citation being the primary currency and reward mechanism for publication in academia).

Explicitly asking for citations (and getting them) helps with that, though the more general issue of "releasing open source software isn't seen as 'publication' for academic status purposes such as being considered for tenure" remains a problem, though it varies considerably by field, institution, and department.

But if given some thought and extra effort, a researcher might actually get more papers (possibly with different sets of collaborators) out of roughly the same body of work, especially if they look more broadly for appropriate conferences and journals.

BTW, again as an outsider with occasional glimpses into academia, it seems to me that there are many vacant niches for cross-disciplinary journals and conferences focused on reusable assets such as datasets and tools necessary for research.


There's a lot of advice here, but very little data to support any of it. Since you're a scientist, why not take an experimental approach to answering this question? Publish your code, for one (selected) paper. Monitor (a) the download log, and (b) the emails you get related to your code.

I hypothesize that you will see some combination of three effects: (1) you will get lots of downloads (which means people are using your code, good work!), perhaps with lots of follow-up emails and perhaps not depending on what the code does; (2) you will get lots of emails from random nutjobs looking to pick holes in your work, and you will waste your time answering them; (3) you will get almost completely ignored.

Whatever the outcome, I think a lot of people would be interested in to hearing about what you learn.


Having a polished public implementation can lead to a massive increase in the number of citations a paper recieves, if it is really a useful system. Some of my papers I think would have received far fewer citations if I had not released the code. Of course, if it is a really niche area with only a handful of researchers, this may not be true.

Does your publication venue have an artifact review committee? That would be a good way to share your code and (redacted or anonymized) data. I'm in security/privacy research, and our venues recently started doing this. They serve as a quality check, labeling your artifacts from merely "submitted" to "results reproduced."

https://www.usenix.org/conference/usenixsecurity22/call-for-...

https://petsymposium.org/artifacts.php


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: