
Practices in source code sharing in astrophysics - ngoldbaum
http://arxiv.org/abs/1304.6780v1
======
Blahah
I'm a computational biologist. If I read a paper where the results of an
analysis are presented, but the code used is not, that analysis is worthless.
Thankfully, in computational biology & bioinformatics at least, open source is
widespread and growing.

Every programmer knows that every programmer makes mistakes. Lots of them.
Mistakes that can completely change the outcome of an analysis. And if the
results of your analysis are to be taken seriously, people have to be able to
trust them, which means they have to be able to check your working.

Not publishing your code is anti-science, selfish, and IMHO should be
disallowed in the literature.

~~~
dalke
"Not publishing your code is anti-science" ... "open source is widespread and
growing"

I've noticed that few people realize that "publishing your code" and "open
source" are two overlapping issues, but they are not the same and the needs of
one field are sometimes opposed to the other field.

Open source necessarily implies that the people who receive the software have
the right to modify the software, redistribute it with or without changes, and
be able to do so for a charge.

No one has been able to tell me why software for scientific publications
requires the third of those abilities. Everyone is agreed that access to the
code and the ability to modify it makes it much easier to review and
understand what it does. I can also understand the need to share modifications
with collaborators, in order to help carry out the analysis. But I don't see
why published software can't have a prohibition on charging a fee for using
modified versions of the software, or services rendered which use the
software.

My first question for you, then, is: do you need the ability to commercialize
someone else's software in order to provide good scientific review of their
publication? More specifically, what sorts of review would that prohibition
eliminate?

In the other direction, I have scientific software which I sell for about
$30K. (This is not hypothetical - I really do have this). Customers get the
source code under the BSD license. This falls into every standard definition
of "free software" and "open source" software. There's even an essay (at
<http://www.gnu.org/philosophy/selling.html> ) encouraging people to sell free
software.

If I publish a paper which uses the software, then am I obligated to give my
peers access to the source code for no fee? Or can I publish the paper, say
it's available under a BSD license, and charge $30K for access to it?

So my second question is, are there limits on what I can charge people in
order to get access to my open source software, which was used for a paper? If
so, what are they, and what is the ethical basis for that judgment? (For
example, should it be "fair, reasonable and non-discriminatory"? Can it cover
development costs? Distribution costs? Web site development costs?)

~~~
Blahah
To clarify, I'm saying that publishing the source code so that others can run
and modify it is the crucial thing. I'm not saying that not open sourcing your
code is anti-science, but not making it available for scrutiny is.

So to answer your first question, I don't think the right to sell someone
else's code is related to the scientific process. I can't think of a way it
affects the progress of science to use (or not use) commercialisation-friendly
licenses, except in some indirect economic ways I don't know enough about to
predict.

As to the second question, I think if you tried to charge reviewers for access
to your code, your paper would never be accepted at a journal. Science is
largely public funded. When public funds are used for research, the fruits of
that research should be made public. That includes your code.

~~~
dalke
Overall then it looks like you agree with me. The philosophical underpinnings
of free software are different than the underpinnings of good science.

 _nods_ to your first answer. I would also say that being able to distribute
source (with or without modifications) to others is important, though not as
crucial. This prevents someone ( _cough_ Elsevier _cough_ ACS) from having a
monopoly on providing the source. But a restriction on, say, military use or
wiretapping would not inhibit scientific progress, even if such a restriction
makes it something other than free or open source software.

As for the second question, the scientific software I sell is not nor ever has
been publicly funded. The people who have paid for it are all for-profit
organizations. I don't even get R&D tax credits for it. While you're correct
that much of science is publicly funded, my software isn't.

Going back to that question: are there limits on what I can charge people in
order to get access to my open source software, which was used for a paper? If
so, what are they, and what is the ethical basis for that judgment?

I can tell you that I would rather not publish a paper, in order to earn
income selling my software, than to describe the techniques I used in making
the software and the corrections and improvements to the existing literature
that I developed. Which is better for overall scientific progress, and why?

This is especially important should I choose to publish as open access, since
that journal in my field costs about $1,200, so I'm already planning to pay
over a month's rent in order to publish. Should I also expect to lose the
equivalent of a year's salary, in the hopes that publishing the paper helps as
big enough advertisement to my services?

~~~
Blahah
Yours is an interesting situation. Without public funding, in my opinion you
have no moral obligation to provide free (as in beer) access to your software.
Indeed, the greater social benefit (if there's a binary choice between nothing
being published and there being a paper but no free (as in beer) source code),
is gained by you publishing a description of the work. Others can then at
least benefit from your theoretical advances. To directly answer your first
question: I don't think there are any ethical limits on what you can charge
(but there are economic ones).

Which outcome is better for overall scientific progress depends entirely on
what your software does, how large the need for it is, etc. However, one thing
that is almost certainly true is that providing free (all senses) access to
the software will maximise the social benefit. So if you can find a model that
allows you to profit whilst still doing this, I encourage you to do so (see
last point).

If I were in your situation I would say the issue of whether or not to publish
is economic: publishing a paper about the software will bring it to the
attention of the scientific community. That should lead to increased custom
for you, provided the software is good and priced appropriately. An example of
this is Robert Edgar of <http://www.drive5.com>. He has written several pieces
of software which have really advanced computational biology, especially
USEARCH. He makes the 32-bit version available freely, and there's a ~$800 per
machine license for the 64-bit version. He also makes some tools available
freely. This mixed model seems to have performed very well for him, as he is
very widely cited which gets him a high profile and many large bioinformatics
institutions buy licenses for his software.

Again, whether publishing open access is a good idea depends on which journal
you publish in, how many people will benefit from learning of your advances,
etc.

Since you say all the people who have paid for your software are for-profits,
you could consider an unrestricted academic license with a paid commercial
license (similar to all the baseclear products:
[http://www.baseclear.com/landingpages/basetools-a-wide-
range...](http://www.baseclear.com/landingpages/basetools-a-wide-range-of-
bioinformatics-solutions/sspacev12/)).

~~~
dalke
"I don't think there are any ethical limits on what you can charge"

I think you can see that some others hold a different viewpoint from you and
me. I've heard people say that the ability to review the software is essential
for science and that the software must, in all cases, be available for trivial
if not no cost, in order to allow that review. (I think this view doesn't have
a moral justification.)

"the issue of whether or not to publish is economic"

Absolutely. It's advertising. It's then a question for me to decide how to
maximize my profits AND maximize improvements to the field. (Only somewhat
apropos, I always loved reading the PNAS blub for each article: "The
publication costs of this article were defrayed in part by page charge
payment. This article must therefore be hereby marked "advertisement" in
accordance with 18 U.S.C. §1734 solely to indicate this fact.")

Getting back to the topic, suppose that a group uses 64-bit USEARCH to do
research. You wrote "I'm saying that publishing the source code so that others
can run and modify it is the crucial thing." However, that group is unable to
provide the source code used to do their analysis. ("Licensee will not allow
copies of the Software to be made or used by others", says the 32-bit
license.)

If publishing the source code is crucial, then this secondary group, which
uses the algorithm, is obligated to publish the software they used, no? Or
does that obligation only apply to the primary developer of the algorithm? If
the primary developer never publishes the source code, then should all
secondary users be prohibited from using it in order to develop new science?

Does that prohibition extend to using Excel? Oracle? Built-in software in
sequencing hardware? I don't see an obvious bright-line demarcation.

As for "unrestricted academic license", I disagree with some of the
distinction between academic license/commercial license. There are academic
labs with a lot more money than I have. The group I was in, in the 1990s, had
a NeXT or IRIX box on each student's desk, for example. There are also
academic labs which act as a front, of sorts, for a professor's commercial
interests. Also, the software I'm working on makes things fast - some 40x more
than what people would do on their own. Any group can make the time/money
tradeoff, and an academic group may easily have more money than time.

I've decided on a different view. Anyone can get access to the older versions,
at no cost and under a BSD license. The newer versions are available to
anyone, for a fee, and under the BSD license. I'm not dependent on this
product for revenue, so it's a test to see how successful this business model
might be.

------
gituliar
I'm a particle physics theorist. In our field there are a lot of open-source
(see <http://www.hepforge.org/projects>), as well as some private projects. I
believe that source code should be published/shared in tandem with results it
produces. Otherwise, those results have no scientific value and I personaly
tend not to trust them, even though they are in agreement with well-known
predictions. What matters is an algorithm results are obtained with, not
results themselves. In other words, scientists should form a hacker-like
community and don't get into the trap of a developer-user relationships where
some develop while others use software.

I actually wanted to know your opinion concerning the migration of open-source
to private software. The most common open-source and free software licenses,
like MIT or GNU, allow this kind of migration. That is Bob could modify
Alice's public project, produce results, and publish them without sharing
sources with anybody. That looks unfair, since Alice did a great job and most
likely would like to see the changes. How to protect Alice from such
situations? Are there any kind of license which forces to share modified
software when some results it produces are published or available to the
general public?

~~~
tripzilch
Isn't that what GPL does?

~~~
gituliar
I think that no. GPL permits to keep one's changes private and doesn't force
to make them public, that is bad in the case of scientific software. However,
it forbids to distribute changes under the terms of other licenses, i.e. non-
free, that is good.

I believe that an enforcement to share source code should be imposed on
provate modifications of open-source scientific software.

------
ngoldbaum
As an astronomer, I hope this idea takes off. Unfortunately, judging by the
reactions my colleagues have had when I've brought up this idea in the past, I
don't think it's very likely.

~~~
A_Allen
I hope the idea takes off, too.

Providing codes is imperative; they are part of the methods and should be
available for examination to ensure the integrity of the science. Those
conducting research funded with public monies should be required by the
funding agencies to release the products of that research, not just the
results, but the data and codes, too (absent truly compelling reasons, such as
national security). Eventually, I expect funding agencies will indeed require
this for astronomy, just as they do for some other sciences; journals could
help the field along, could improve the transparency and reproducibility of
research, by requiring code release upon publication.

Absent funding agencies and journals insisting on code release and the moral
argument of reproducibility, what incentives would help convince code authors
to release their software?

~~~
nilx
> Absent funding agencies and journals insisting on code release and the moral
> argument of reproducibility, what incentives would help convince code
> authors to release their software?

Impact Factor? -> Patrick Vandewalle, "Code Sharing Is Associated with
Research Impact in Image Processing", CiSE 2012
<http://doi.ieeecomputersociety.org/10.1109/MCSE.2012.63>

It could also be the pressure of colleagues who, as anonymous reviewers, would
always ask for the code whenever a paper depends on computation. Journal
policies will not switch to REQUIRING the code anytime soon, but peer-review
can add some pressure.

------
lutusp
> While software and algorithms have become increasingly important in
> astronomy, the majority of authors who publish computational astronomy
> research do not share the source code they develop, making it difficult to
> replicate and reuse the work.

This is troubling. There's no field involving computation in which withholding
source code is routinely accepted. The four-color map theorem proof (Appel &
Haken, 1976) would never have been accepted without source code. Modern
mathematics, to the degree to which it relies on computer results, also relies
on the publication of source.

Another example is the recent revelation involving an error in an Excel
spreadsheet and its effect on an economic analysis -- the correction wouldn't
have been possible without publication of the spreadsheet alongside the
conclusions drawn from it.

Also, replication is a cornerstone of serious science. Without replication,
astrophysics becomes psychology, where replication is rare.

I hope this paper has the effect of correcting this systematic flaw in
astrophysics publication.

------
anon_barcode
As an astronomer I hope this does not take off.

The threat to job security is not just "perceived". Astronomers are frequently
kept in a state of constant job-induced anxiety by the prevalent practice of
fighting for 6 month - 2 year "postdocs" for the first 15-20 years of their
careers (seriously, go to an astronomical conference, you would not believe
the amount of people greying under 35). To "keep" your job (in reality, get
another postdoc) you must do two things: author papers and get papers cited.

This practice makes collaboration actually detrimental to a researcher's
career unless one of two things happens: 1) they lead the collaboration and
get their name as first author 2) the collaboration allows some kind of
recompense for the time invested into the collaboration.

For this reason, many collaborations have a period of proprietarity, a time
when the collaboration data is available exclusively to the people who have
sunk their time into the collaboration. Imagine, if the people behind the
Millenium or Aquarius simulations were forced to publish their code as soon as
they had run the simulations (this is true for any theorist)-- now these
people have spent the past years of their lives, tweaking and perfecting code
_for which they get no publications or citations_ and before they can even
begin to analyze the easiest results ("low hanging fruit", the simplistic
papers that usually go to collaboration members), the simulations are being
run and analyzed all over the world. They have gained nothing by their efforts
in actually developing the code (our world does not work like industry, we
don't get a pat on the back and a raise for being team players).

For theorists, often entire careers revolve around codes that have been
developed over the researcher's whole career-- to be forced to hand it over to
all the first year PhDs in the world is a tremendous slap in the face.

For me, as an "experimentalist" (read, data-analyst), I also maintain job
attractiveness by having sets of code that no one else has. I'm experienced in
a variety of pattern finding algorithms, and have even invented a few myself
for very specific problems-- and within my little sphere, people know this
about me, I'm the person people come to for certain things. I spent years of
my life perfecting these, learning about algorithms, learning the intricacies
and quirks of the various datasets we use-- if someone wants to take my place
in this community as "that guy" then I expect them to devote as much time as I
have to learning these techniques inside out and then to do something better
than me. I have no desire to pass my code off to a masters student and let him
naively plug some dataset into it-- firstly because no one should ever rely on
a black box in this field, and secondly because these codes all need to be
tweaked to account for the different instruments and data structures being
used.

Perhaps if my contract didn't have a built in expiration I would be willing to
care-bear-share my fractal search methods and spend my Thursday afternoons
showing you how to Savitsky-Golay smoothing works, but as long as I'm out of a
job next July, and you're the guy applying for my job, you can keep your
filthy hands off my code.

~~~
tripzilch
> For me, as an "experimentalist" (read, data-analyst), I also maintain job
> attractiveness by having sets of code that no one else has. I'm experienced
> in a variety of pattern finding algorithms, and have even invented a few
> myself for very specific problems-- and within my little sphere, people know
> this about me, I'm the person people come to for certain things. > I spent
> years of my life perfecting these, learning about algorithms, learning the
> intricacies and quirks of the various datasets we use-- if someone wants to
> take my place in this community as "that guy" then I expect them to devote
> as much time as I have to learning these techniques inside out and then to
> do something better than me. > I have no desire to pass my code off to a
> masters student and let him naively plug some dataset into it-- firstly
> because no one should ever rely on a black box in this field, and secondly
> because these codes all need to be tweaked to account for the different
> instruments and data structures being used.

> Perhaps if my contract didn't have a built in expiration I would be willing
> to care-bear-share my fractal search methods and spend my Thursday
> afternoons showing you how to Savitsky-Golay smoothing works, but as long as
> I'm out of a job next July, and you're the guy applying for my job, you can
> keep your filthy hands off my code.

First, WOW. In can't imagine any sector (except indeed academia) where a "keep
your filthy hands off my code" attitude like that is remotely acceptable.

Second question is, _whose_ code, exactly? In most lines of work, if you write
code for an employer (university in this case, I suppose), the copyright for
that code is implicitly given to the employer. In this particular case, it
means they could, as well as _should_ force you to open this code. It could
very well be that your contract explicitly states different, it has to be
clear about _code_ though, it's not just implied together with writings, for
instance.

Imagine if you were to work for, say, a security analysis consultancy firm,
and you write a lot of cool machine learning and analysis code to detect
intrusions or leakage. Regardless of whether your contract expires, if you
leave, you're not leaving with your code. And if you'd refuse to document it
properly so that the next guy can't use it, expect the contract to expire
prematurely.

I can imagine that would seem frustrating and scary NOW, but only because they
made it seem like yours was a proper approach for all those years you gave it.
Of course it wasn't--and deep down inside you know this to be true--if only
everything had been open from the start.

~~~
xyzzy123
> First, WOW. In can't imagine any sector (except indeed academia) where a
> "keep your filthy hands off my code" attitude like that is remotely
> acceptable.

Actually, keeping code "close to your chest" is pretty common in the security
industry. You'll see fuzzer frameworks get released all the time, but fuzzers
which find "real bugs worth money" tend to get hoarded.

If you work for a consultancy and have private tools, you are not expected to
hand those over to the company. They will appreciate it if you release some
results with their name on it (as well as yours) every now and then though.

~~~
tripzilch
I see your point. I was actually trying to think of a counterexample where
"keep your hands off my code" would indeed be a valid attitude, and the first
that came to mind was the security industry (because what you describe indeed
makes sense). But then I remembered the way copyrights work under employment,
making it a bit of a convoluted (countercounter) example, sorry about that.

What you describe, however, is if someone _already_ has developed these tools
and then joins a security consultancy company.

Wouldn't it be _very_ different if one developed the tools _while_ under
employment of a certain company (and it was your job to develop such type of
tools)?

Cause that's the whole point of the way copyright works here, if you develop
this tool on company time, I really doubt they'd let you walk away with that
IP. _Especially_ because they are worth that money. And even if you develop it
in your own spare evenings, law is pretty specific that doesn't matter, for
one reason that it's impossible to prove (and often quite unlikely) that you
didn't use any company resources or knowledge to do so.

There is a good possibility that university contracts have different rules
about the IP you produce while doing research though.

~~~
xyzzy123
> Wouldn't it be very different if one developed the tools while under
> employment of a certain company

Basically, yes, stuff done on company time is definitely theirs under 'work
for hire'. Totally agree with you.

A lot of projects are done as evening/weekend work which is on shaky legal
ground - similar to bootstrapping a company while employed. If you have a side
activity which is making money, it's simplest to stay quiet about it.

Among the limited sample set of "people I know", it's considered extremely
crass for a company to try to pull an undeserved IP grab. Such a company would
find it really hard to get self-motivated people after pulling a move like
that. Because _everyone_ has side projects.

Although some are diligent about getting "my stuff versus your stuff" spelled
out in contract, it's often a "gentleman's agreement"...

As an aside, I wanted to point out that it's really common for security
companies (or simply, "groups") to keep internal tools and only release them
to the public when they've wrung all the "juice" out of them - e.g. publicity
value exceeds the value of the results.

I think this model is not too far off what was described originally.

------
crntaylor
Have you also seen the Open Exoplanet Catalogue proposal by Hanno Rein?
(<http://arxiv.org/abs/1211.7121>)

I have no idea if you know him or not, but if not, you should consider getting
in touch!

~~~
teuben
Isn't that exactly what the virtual observatory already has? They base their
tables on XML, and call it a VOTable. Lots of fabulous software can read and
write those.

