Hacker News new | comments | show | ask | jobs | submit login
Practices in source code sharing in astrophysics (arxiv.org)
73 points by ngoldbaum 1523 days ago | hide | past | web | 63 comments | favorite



I'm a computational biologist. If I read a paper where the results of an analysis are presented, but the code used is not, that analysis is worthless. Thankfully, in computational biology & bioinformatics at least, open source is widespread and growing.

Every programmer knows that every programmer makes mistakes. Lots of them. Mistakes that can completely change the outcome of an analysis. And if the results of your analysis are to be taken seriously, people have to be able to trust them, which means they have to be able to check your working.

Not publishing your code is anti-science, selfish, and IMHO should be disallowed in the literature.


"Not publishing your code is anti-science" ... "open source is widespread and growing"

I've noticed that few people realize that "publishing your code" and "open source" are two overlapping issues, but they are not the same and the needs of one field are sometimes opposed to the other field.

Open source necessarily implies that the people who receive the software have the right to modify the software, redistribute it with or without changes, and be able to do so for a charge.

No one has been able to tell me why software for scientific publications requires the third of those abilities. Everyone is agreed that access to the code and the ability to modify it makes it much easier to review and understand what it does. I can also understand the need to share modifications with collaborators, in order to help carry out the analysis. But I don't see why published software can't have a prohibition on charging a fee for using modified versions of the software, or services rendered which use the software.

My first question for you, then, is: do you need the ability to commercialize someone else's software in order to provide good scientific review of their publication? More specifically, what sorts of review would that prohibition eliminate?

In the other direction, I have scientific software which I sell for about $30K. (This is not hypothetical - I really do have this). Customers get the source code under the BSD license. This falls into every standard definition of "free software" and "open source" software. There's even an essay (at http://www.gnu.org/philosophy/selling.html ) encouraging people to sell free software.

If I publish a paper which uses the software, then am I obligated to give my peers access to the source code for no fee? Or can I publish the paper, say it's available under a BSD license, and charge $30K for access to it?

So my second question is, are there limits on what I can charge people in order to get access to my open source software, which was used for a paper? If so, what are they, and what is the ethical basis for that judgment? (For example, should it be "fair, reasonable and non-discriminatory"? Can it cover development costs? Distribution costs? Web site development costs?)


To clarify, I'm saying that publishing the source code so that others can run and modify it is the crucial thing. I'm not saying that not open sourcing your code is anti-science, but not making it available for scrutiny is.

So to answer your first question, I don't think the right to sell someone else's code is related to the scientific process. I can't think of a way it affects the progress of science to use (or not use) commercialisation-friendly licenses, except in some indirect economic ways I don't know enough about to predict.

As to the second question, I think if you tried to charge reviewers for access to your code, your paper would never be accepted at a journal. Science is largely public funded. When public funds are used for research, the fruits of that research should be made public. That includes your code.


Overall then it looks like you agree with me. The philosophical underpinnings of free software are different than the underpinnings of good science.

nods to your first answer. I would also say that being able to distribute source (with or without modifications) to others is important, though not as crucial. This prevents someone (cough Elsevier cough ACS) from having a monopoly on providing the source. But a restriction on, say, military use or wiretapping would not inhibit scientific progress, even if such a restriction makes it something other than free or open source software.

As for the second question, the scientific software I sell is not nor ever has been publicly funded. The people who have paid for it are all for-profit organizations. I don't even get R&D tax credits for it. While you're correct that much of science is publicly funded, my software isn't.

Going back to that question: are there limits on what I can charge people in order to get access to my open source software, which was used for a paper? If so, what are they, and what is the ethical basis for that judgment?

I can tell you that I would rather not publish a paper, in order to earn income selling my software, than to describe the techniques I used in making the software and the corrections and improvements to the existing literature that I developed. Which is better for overall scientific progress, and why?

This is especially important should I choose to publish as open access, since that journal in my field costs about $1,200, so I'm already planning to pay over a month's rent in order to publish. Should I also expect to lose the equivalent of a year's salary, in the hopes that publishing the paper helps as big enough advertisement to my services?


Yours is an interesting situation. Without public funding, in my opinion you have no moral obligation to provide free (as in beer) access to your software. Indeed, the greater social benefit (if there's a binary choice between nothing being published and there being a paper but no free (as in beer) source code), is gained by you publishing a description of the work. Others can then at least benefit from your theoretical advances. To directly answer your first question: I don't think there are any ethical limits on what you can charge (but there are economic ones).

Which outcome is better for overall scientific progress depends entirely on what your software does, how large the need for it is, etc. However, one thing that is almost certainly true is that providing free (all senses) access to the software will maximise the social benefit. So if you can find a model that allows you to profit whilst still doing this, I encourage you to do so (see last point).

If I were in your situation I would say the issue of whether or not to publish is economic: publishing a paper about the software will bring it to the attention of the scientific community. That should lead to increased custom for you, provided the software is good and priced appropriately. An example of this is Robert Edgar of http://www.drive5.com. He has written several pieces of software which have really advanced computational biology, especially USEARCH. He makes the 32-bit version available freely, and there's a ~$800 per machine license for the 64-bit version. He also makes some tools available freely. This mixed model seems to have performed very well for him, as he is very widely cited which gets him a high profile and many large bioinformatics institutions buy licenses for his software.

Again, whether publishing open access is a good idea depends on which journal you publish in, how many people will benefit from learning of your advances, etc.

Since you say all the people who have paid for your software are for-profits, you could consider an unrestricted academic license with a paid commercial license (similar to all the baseclear products: http://www.baseclear.com/landingpages/basetools-a-wide-range...).


"I don't think there are any ethical limits on what you can charge"

I think you can see that some others hold a different viewpoint from you and me. I've heard people say that the ability to review the software is essential for science and that the software must, in all cases, be available for trivial if not no cost, in order to allow that review. (I think this view doesn't have a moral justification.)

"the issue of whether or not to publish is economic"

Absolutely. It's advertising. It's then a question for me to decide how to maximize my profits AND maximize improvements to the field. (Only somewhat apropos, I always loved reading the PNAS blub for each article: "The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.")

Getting back to the topic, suppose that a group uses 64-bit USEARCH to do research. You wrote "I'm saying that publishing the source code so that others can run and modify it is the crucial thing." However, that group is unable to provide the source code used to do their analysis. ("Licensee will not allow copies of the Software to be made or used by others", says the 32-bit license.)

If publishing the source code is crucial, then this secondary group, which uses the algorithm, is obligated to publish the software they used, no? Or does that obligation only apply to the primary developer of the algorithm? If the primary developer never publishes the source code, then should all secondary users be prohibited from using it in order to develop new science?

Does that prohibition extend to using Excel? Oracle? Built-in software in sequencing hardware? I don't see an obvious bright-line demarcation.

As for "unrestricted academic license", I disagree with some of the distinction between academic license/commercial license. There are academic labs with a lot more money than I have. The group I was in, in the 1990s, had a NeXT or IRIX box on each student's desk, for example. There are also academic labs which act as a front, of sorts, for a professor's commercial interests. Also, the software I'm working on makes things fast - some 40x more than what people would do on their own. Any group can make the time/money tradeoff, and an academic group may easily have more money than time.

I've decided on a different view. Anyone can get access to the older versions, at no cost and under a BSD license. The newer versions are available to anyone, for a fee, and under the BSD license. I'm not dependent on this product for revenue, so it's a test to see how successful this business model might be.


If the paper is based on the analysis of sequence data, it's almost certainly been processed by closed or "shared source" primary data analysis tools. That's true of Illumina/Roche/ABI. Biotech vendors have quite a long way to go in this regard.


In astronomy we are still fighting for public data. Verification of results comes from another researcher collecting his own data and running his own analysis.

Out of curiosity, do you work for industry or academia?


That's an extremely inefficient and costly way to validate a result! It would be rendered unnecessary in cases where just looking at the code invalidated the result.

I'm in academia (see my profile for details).


It's good and useful for independent researchers to collect their own data and analyse it. That's a check on the data collection methods.

It's also useful for others to check the analysis of trusted data. If they do it with their own methods and get a different result, then it's time to actually compare the methods.

It's a good thing for the original researchers to release their code too. But looking for bugs in the code is no substitute for doing independent analysis and independent data collection. It's an important supplement, but it isn't enough.


I agree with your points. The saving in the case of released source code is that, if the original analysis is flawed in such a way that there's no need to repeat it because it's wrong, then a large amount of expense and effort can be saved. Of course, if the original analysis holds up given the data and code, independent replication is a necessary thing.


Sometimes validation can be done without knowing the actual algorithm or being able to rerun it.

For example, if a paper publishes a multiple sequence alignment of 100 sequences, then all you need to do is verify that it's at least as good as what other MSA programs generate. You don't need to be able to rerun the program.

Indeed, sometimes you can't. Someone could have manually aligned the sequence, and that process isn't reproducible.

You see this as well with fold prediction software, where it can be easy to show that a predicted fold is likely correct (ie, matches experimental observations), and where you don't necessarily care about the method used to get that fold prediction.

A genetic algorithm might be very sensitive to the compute environment. For example, the order of float operations generated by two different compiler settings, or by network traffic timings in a distributed system. This can lead to different minima, where the overall effectiveness is the same but the actual configuration is different. The GA search doesn't need to be reproducible; it's only the final effectiveness which is important.

For those cases, I don't see the need for access to the underlying software in order to validate the result.


The problem with the MSA question is that there is no objective measure of goodness. Most people compare them to a subjective 'gold standard', which is a manual alignment. Black box comparisons are not progressive: the only non-subjective advances can come from incorporating the results of sequence evolution experiments into the algorithms, and then the algorithm needs to be trusted, for which it needs to be open to scrutiny.

I agree about the genetic algorithm.


I'm a particle physics theorist. In our field there are a lot of open-source (see http://www.hepforge.org/projects), as well as some private projects. I believe that source code should be published/shared in tandem with results it produces. Otherwise, those results have no scientific value and I personaly tend not to trust them, even though they are in agreement with well-known predictions. What matters is an algorithm results are obtained with, not results themselves. In other words, scientists should form a hacker-like community and don't get into the trap of a developer-user relationships where some develop while others use software.

I actually wanted to know your opinion concerning the migration of open-source to private software. The most common open-source and free software licenses, like MIT or GNU, allow this kind of migration. That is Bob could modify Alice's public project, produce results, and publish them without sharing sources with anybody. That looks unfair, since Alice did a great job and most likely would like to see the changes. How to protect Alice from such situations? Are there any kind of license which forces to share modified software when some results it produces are published or available to the general public?


Isn't that what GPL does?


I think that no. GPL permits to keep one's changes private and doesn't force to make them public, that is bad in the case of scientific software. However, it forbids to distribute changes under the terms of other licenses, i.e. non-free, that is good.

I believe that an enforcement to share source code should be imposed on provate modifications of open-source scientific software.


As an astronomer, I hope this idea takes off. Unfortunately, judging by the reactions my colleagues have had when I've brought up this idea in the past, I don't think it's very likely.


I hope the idea takes off, too.

Providing codes is imperative; they are part of the methods and should be available for examination to ensure the integrity of the science. Those conducting research funded with public monies should be required by the funding agencies to release the products of that research, not just the results, but the data and codes, too (absent truly compelling reasons, such as national security). Eventually, I expect funding agencies will indeed require this for astronomy, just as they do for some other sciences; journals could help the field along, could improve the transparency and reproducibility of research, by requiring code release upon publication.

Absent funding agencies and journals insisting on code release and the moral argument of reproducibility, what incentives would help convince code authors to release their software?


> Absent funding agencies and journals insisting on code release and the moral argument of reproducibility, what incentives would help convince code authors to release their software?

Impact Factor? -> Patrick Vandewalle, "Code Sharing Is Associated with Research Impact in Image Processing", CiSE 2012 http://doi.ieeecomputersociety.org/10.1109/MCSE.2012.63

It could also be the pressure of colleagues who, as anonymous reviewers, would always ask for the code whenever a paper depends on computation. Journal policies will not switch to REQUIRING the code anytime soon, but peer-review can add some pressure.


Is it a code quality or a code portability issue?


As discussed in the linked article, it's mostly because of code quality but with a healthy dose of percieved threat to job security. Unfortunately the academic citation economy isn't very nurturing of community codes, so a lot of work gets repeated when secretive research groups compete.

That's not the whole story and the fact that this article is getting published speaks to that, however I still don't think it's likely that things will change in the short term.

My friend Matt Turk wrote about a similar topic recently: http://arxiv.org/abs/1301.7064


Repetition isn't necessarily a bad thing. Whether compatible results can be obtained with independent analyses of the same data is surely an important indicator of success.

Of course that should not preclude the possibility of replication as described here.

I find the idea of threats to job security misguided. In the particle physics world some significant code (not everything, granted) has been made available for some time; e.g. Monte-Carlo simulators like Geant or lattice gauge theory codes like those from MILC, SciDAC or FermiQCD. Even highly optimised code has been available. Users are requested to cite the authors. No careers have been harmed, as far as I can tell.

Also, the code itself is not much use for replicating results without making the data available too. This can be more problematic politically and technically, but even here there are good precedents in the particle physics world.


Unfortunately the world of scientific collaboration follows a strict pattern: The Article

Which is good as a way of 'settling information' into a medium but an awful way for today. It's the equivalent of stone tablets compared to what's available today

It would be much better if the product of research could be shared in a more flexible format. For example, don't get me started on reading two columns PDFs in a computer screen.

Flexible formatting, attachable metadata, reusable "source code" (that could be an excel spreadsheet, a txt list of data, or source code per se), redrawable graphs, etc

We need a faster, more flexible, albeit official way of sharing data. Yes, there are ownership issues, IP issues, etc. But maybe having a more easily searchable/traceable system may help with this.


> We need a faster, more flexible, albeit official way of sharing data.

Metajournals are trying to do it, with an "Article" overlay to data/software publication : http://metajnl.com/. IPOL also tries to publish image processing software: http://www.ipol.im/. And there is also some work to interlink science assets (more than articles) on http://linkedscience.org/. \ I welcome publication models not based on "PDF standard" (ie an electronic clone of the centuries-old ink&paper text article), but simply sharing data/software is not sufficient. I think it is important to have an editorial line and a peer review, otherwise anything online would qualify as scientific communication.

Moreover, software and data is hardly sufficient by itself; it needs to be explained, documented, illustrated and discussed to be more than a "harddisk dump".


I think this is a deeper problem that affects the whole scientific academic community and maybe the solution to this, is the growth of a new parallel movement of amateur scientists that are pursuing research from passion and the desire to advance human knowledge, not because their job/position depends on it. In such an environment sharing your tools would probably be the norm.


> While software and algorithms have become increasingly important in astronomy, the majority of authors who publish computational astronomy research do not share the source code they develop, making it difficult to replicate and reuse the work.

This is troubling. There's no field involving computation in which withholding source code is routinely accepted. The four-color map theorem proof (Appel & Haken, 1976) would never have been accepted without source code. Modern mathematics, to the degree to which it relies on computer results, also relies on the publication of source.

Another example is the recent revelation involving an error in an Excel spreadsheet and its effect on an economic analysis -- the correction wouldn't have been possible without publication of the spreadsheet alongside the conclusions drawn from it.

Also, replication is a cornerstone of serious science. Without replication, astrophysics becomes psychology, where replication is rare.

I hope this paper has the effect of correcting this systematic flaw in astrophysics publication.


As an astronomer I hope this does not take off.

The threat to job security is not just "perceived". Astronomers are frequently kept in a state of constant job-induced anxiety by the prevalent practice of fighting for 6 month - 2 year "postdocs" for the first 15-20 years of their careers (seriously, go to an astronomical conference, you would not believe the amount of people greying under 35). To "keep" your job (in reality, get another postdoc) you must do two things: author papers and get papers cited.

This practice makes collaboration actually detrimental to a researcher's career unless one of two things happens: 1) they lead the collaboration and get their name as first author 2) the collaboration allows some kind of recompense for the time invested into the collaboration.

For this reason, many collaborations have a period of proprietarity, a time when the collaboration data is available exclusively to the people who have sunk their time into the collaboration. Imagine, if the people behind the Millenium or Aquarius simulations were forced to publish their code as soon as they had run the simulations (this is true for any theorist)-- now these people have spent the past years of their lives, tweaking and perfecting code _for which they get no publications or citations_ and before they can even begin to analyze the easiest results ("low hanging fruit", the simplistic papers that usually go to collaboration members), the simulations are being run and analyzed all over the world. They have gained nothing by their efforts in actually developing the code (our world does not work like industry, we don't get a pat on the back and a raise for being team players).

For theorists, often entire careers revolve around codes that have been developed over the researcher's whole career-- to be forced to hand it over to all the first year PhDs in the world is a tremendous slap in the face.

For me, as an "experimentalist" (read, data-analyst), I also maintain job attractiveness by having sets of code that no one else has. I'm experienced in a variety of pattern finding algorithms, and have even invented a few myself for very specific problems-- and within my little sphere, people know this about me, I'm the person people come to for certain things. I spent years of my life perfecting these, learning about algorithms, learning the intricacies and quirks of the various datasets we use-- if someone wants to take my place in this community as "that guy" then I expect them to devote as much time as I have to learning these techniques inside out and then to do something better than me. I have no desire to pass my code off to a masters student and let him naively plug some dataset into it-- firstly because no one should ever rely on a black box in this field, and secondly because these codes all need to be tweaked to account for the different instruments and data structures being used.

Perhaps if my contract didn't have a built in expiration I would be willing to care-bear-share my fractal search methods and spend my Thursday afternoons showing you how to Savitsky-Golay smoothing works, but as long as I'm out of a job next July, and you're the guy applying for my job, you can keep your filthy hands off my code.


Wow. This reply has only served to illustrate how broken the current system is.

If you are publishing articles with no source code, how is that even science? Why not just skip the articles and publish unsubstantiated assertions? After all, if you publish details of how you worked stuff out, even without source code someone else might reproduce your work.

If the new rule was that you had to publish source then there would still be the same number of jobs in astronomy. I don't seem how _everyone's_ careers could be negatively impacted. And don't you think if you are using the software from another group you would cite the authors?


"If you are publishing articles with no source code, how is that even science?" - We have a methods section in the paper, just like in the science fair.

"After all, if you publish details of how you worked stuff out, even without source code someone else might reproduce your work." - hopefully they do! and hopefully they do it in a different way, doing the same thing twice is pointless. This is why the Hubble Constant started off being 150, then was 50, and now hovers around 69. Because everyone did the problem in their own unique way and didn't borrow each other's code and procedures.

"there would still be the same number of jobs in astronomy" - yes, that's the problem, there aren't enough. So the job goes to the guy with the most citations and publications-- I want that guy to be me.

"And don't you think if you are using the software from another group you would cite the authors?" - that's not the whole point. Writing a simulation may take 3 years, then analysis can take another 3 years. If I wrote it, I want first dibs on analysing it, I don't want someone else taking the results I created and publishing them before me.

The system is cutthroat, sure, but it does have benefits if you know how to play the game. Politics is surprisingly omnipresent here.


> ... doing the same thing twice is pointless.

That's what psychologists say (according to Richard Feynman in his now-famous criticism of the practice in "Cargo Cult Science"), but it's a fundamentally flawed view of science. Replication -- an exact duplication of a study and hopefully its results -- is a cornerstone of science. And without full disclosure of the original study's method and results, replication isn't possible.

> This is why the Hubble Constant started off being 150, then was 50, and now hovers around 69. Because everyone did the problem in their own unique way and didn't borrow each other's code and procedures.

That isn't an argument for the practice of withholding methods, it's an argument against it. Arriving at a realistic value for the Hubble constant wasn't accelerated by withholding methods, it was slowed. Imagine Einstein claiming that mass and energy are bound together in a clearly defined relationship, but not publishing the relationship or its derivation.

Arthur C. Clarke said, "Any sufficiently advanced technology is indistinguishable from magic." Apropos the present topic, a result unaccompanied by the methods used to produce it is also indistinguishable from magic, and certainly isn't science.

> If I wrote it, I want first dibs on analysing it, I don't want someone else taking the results I created and publishing them before me.

We're not taking about someone beating the originator to publication, we're talking about the originator publishing his methods along with his results and getting credit for both.

> Politics is surprisingly omnipresent here.

Perhaps, but always at the expense of science. Science relies on full disclosure -- both methods and results. How can a theoretical consensus be arrived at if the participants don't know what they're agreeing about?


"Replication -- an exact duplication of a study and hopefully its results -- is a cornerstone of science. And without full disclosure of the original study's method and results, replication isn't possible."

My first quarter in grad school in statistics, a girl who was a couple of years ahead of me quit. She was doing an internship with an MD, and every time he got a new diagnosed kid in the project he pondered which group to put the kid in. "This one's going to die, where do I put him to make the results come out right?"

The institution was considered second-class in that field, so he got the money to replicate a Harvard result. If he got the right results it would make him look more competent and he might get better grants. So he was doing his best to fudge the results so they would come out right.

She tried to argue that he should do his statistics correctly and he disagreed. She was so upset that her work was pointless, and that her career would be pointless, that she quit completely.

Ideally scientists would be rewarded for doing open science correctly. In some ways that is not the case now, and we should look at ways to fix that.

How can average mediocre scientists get job security without keeping secrets? Perhaps we are giving too many people a chance to be professional scientists, so that too many of them must lose out?


What a sad account. What a negative outcome for science. And I'm sure it's common in many fields.

> How can average mediocre scientists get job security without keeping secrets?

By way of systematic changes, not by anything they necessarily can do themselves at the level you're describing.

By way of people near the situation willing to speak out, like those who brought Jan Hendrik Schön down:

http://en.wikipedia.org/wiki/Sch%C3%B6n_scandal

Or in a similar way, by the actions of those who brought several highly regarded psychologists down:

http://science.nbcnews.com/_news/2013/02/20/17032396-scandal...

Title: "Scandals force psychologists to do some soul-searching"

Quote relevant to the present topic: "For instance, right now there's no incentive for researchers to share their data, and a 2006 study found that of 141 researchers who had previously agreed to share their data, only 38 did so when asked."

But the existence of the linked article and others like it means something is being done -- many psychologists have been outed as frauds, or fired, or forced into retirement.

So there's reason to hope that, if enough daylight shines on these practices and laboratories, things will change.


> "If you are publishing articles with no source code, how is that even science?" - We have a methods section in the paper, just like in the science fair.

yes but the pipeline goes METHODS -> (CODE) -> RESULTS.

you can read the methods and they check out. fine. you can't see the code. the results show something new and amazing. how do you know this is because of the methods used or because somebody flipped a minus sign in the code?

well, you don't. unless you reimplement the code from the methods and see if it's reproducible. except that nobody does this. but let's say they did and found completely different results, that are also new and amazing. who is right? the answer is: NOBODY KNOWS, because nobody is publishing their code!

so astrophysicist number 3 comes around, wants to know who is right, and has only these methods to go on, but no code. he has to start from scratch. because the universe hates us, he will find a third set of completely different new and amazing results that aren't even of the same datatype as the first two--but nobody knows who is right, because everyone is hiding their code and claiming it's based on the same methodology.

> doing the same thing twice is pointless.

No that's science.

What is pointless is doing the same thing twice, in the hope you're doing the same thing as the last guy, because you don't know what the last guy did. Also, hoping that if you make errors (like the last guy undoubtedly did, because hey this is unreviewed closed source code we're talking about), that they'll be different errors. Except you can't know, because nobody publishes their code.

All of this reminds me a little bit of the parable of the Stone Soup. And if the above behaviour is causing people stress and grey hairs about their job security, well I don't wish that upon anybody, but hoarding source code isn't the way forward either.


> well, you don't. unless you reimplement the code from the methods and see if it's reproducible. except that nobody does this. but let's say they did and found completely different results, that are also new and amazing. who is right? the answer is: NOBODY KNOWS, because nobody is publishing their code!

No one is going to spend time auditing the code to find a mistake either. In academia the way these things are, and should be, validated is by comparing the output to known results. In simulation this means do my simulations match experimental reality. In data analysis this means you check that you can verify something that's known to be true about a particular dataset.

That kind of validation in itself doesn't happen very often (it's often missing for example in computational chemistry). But in all honesty having the code, wouldn't (and doesn't) help.

I'm not arguing that the code shouldn't be open source, but that if you're presenting "new and amazing" results the way you back those up is not by saying you checked your code really well, but by showing that your method and implementation are consistent with known facts while presenting something new.


> you can read the methods and they check out.

I'm a materials scientist specializing in microscopy and image analysis, not an astrophysicist. In my area of interest, there is rarely sufficient information published to confidently repeat results. Indeed, the current reproducible research community had origins in computer vision (and was then embraced by the biostatistics community). I applaud the expectation of openness - especially if funded by government grants. Too often, we taxpayers pay for research that get trapped behind paywalls.

I agree the the academic astrophysics community has it tough. This is symptomatic of the entire academic community that is producing too many Ph.D.s desiring too few academic positions. Guess what - it will only get worse when the unsustainable government spending and debt causes the education bubble to burst.


> …hopefully they do it in a different way, doing the same thing twice is pointless.

I believe Newton said, "If I have seen further it is by standing on the shoulders of giants.", not, "I have rediscovered everything in a slightly different way because no-one gave me more than a vague method to go on"

It is no bad thing to have several independent implementations of an algorithm. If there is a problem with one, then the others are likely to show that there is a problem. However, without open source code, all you can do is say, "Mmm, something's wrong somewhere" and write another version. You may end up with several papers that agree and one that doesn't, but you still can't draw any conclusions without looking at the code. Ultimately, you get a situation where everyone has to repeat the same work and see what they get, when what they should be doing is poring over the original soources and discussing which bits could/should have been implemented differently and how that affects the result.

The code is part of the method. If you can't show us the code, don't expect me to believe your 'results'. That is science.


Newton also said, "If others would think as hard as I did, then they would get similar results." This seems to be part of the attitude OP is taking, that if others want his best, most unique results, they should arrive at them independently. But then when someone did that with Newton (Leibniz), the resulting fallout "redounded to the discredit of all concerned." (Here is a nice summary of the beef: http://www.ams.org/notices/200905/rtx090500602p.pdf)


I once looked and could not find any evidence that Newton said that. Since the quote is designed to sound good to modern ears, the attribution to Newton is probably bogus.


"no one should ever rely on a black box in this field"

This is deeply hypocritical. When someone publishes results generated by code, but don't publish the code, they are asking all readers to rely on the output of a black box.


Valid.

But they are presenting the methodology they claim went into the code and allowing you to agree or disagree with the methodology. If you agree, you take their answer, if you disagree, you implement a different methodology, which necessitates new code anyways, and publish your contra-finding.

We usually converge on answers over years of varying different methods and attempts. No one in their right mind reads a paper and says "well, they found out the sun is actually in M31, I believe that now" They look at the culmination of the literature (fun fact, we still aren't positive exactly where the sun is, or how fast it's moving) and I suppose that is the black box.

What I was really referring to though was students who walk up to me and say they gaussian smoothed a sample, and have no idea what I mean when I ask if 3 sigma outliers were used in the fit or trashed. They just used some gaussSmooth algorithm and may or may not know what a gaussian even is.


(edit the "code" I'm talking about is complex stuff like numerical simulations etc, if we were talking about the code used for plotting a simple graph or histogram, that's a different matter--although it wouldn't hurt anyone publishing a full paper open source as a make-file with some TeX sources calling a few plotting libraries)

> But they are presenting the methodology they claim went into the code and allowing you to agree or disagree with the methodology. If you agree, you take their answer--

No. And this is the whole point. You can agree with the methodology, but as long as you haven't inspected the code, you cannot just accept they actually implemented it right!

In fact, chances are, they probably didn't. The best programmers white buggy and incorrect code. And from what I've seen in the field of physics, scientific code is everything but an exception to that. I'd be surprised if it was very different for astrophysics.

So by publishing the methodology and the results, but not the code, they put up a nice show. But that's it, not reproducible. So all that's really brought to the table is the methodology and some "say-so" results that nobody can check are accurate.


True. Nobody can check if your data + your methodology = the results you claim.

But it is quite common for someone to take the same problem, collect their own data and run their own analysis (both steps can be completely different from the original statement-- collecting data of a different type, from a different instrument, etc. Running an analysis of a different paradigm, running the same analysis to a different precision, running the same analysis via a different algorithm, etc.).

If you read the arxiv on a daily basis you can see huge academic arguments unfolding over the course of months and years.

There seems to be this idea that the conversation goes: "Yo I found this hypervelocity star" "Dope, let's move on"

It's actually more like: "Yo I found this hypervelocity star" "Nope, I got spectra and you're wrong" "Well I got ultra high-res spectra and I think he's right" "Actually all of you are forgetting asymmetric drift, this is just a geometry problem, l2angles" "Hey, I sit in my basement and play with MOND, it might help" "My simulations show something completely different"

Authors are called out and proven right or wrong on a daily basis, even if we can't watch them code over the shoulder. I actually think that's one of the beauties of it-- most of our methods are invented by trying to prove or disprove something in a new way.

The monster codes like GADGET (which ran the millenium sim everyone's seen) are usually made public after ~5 years of proprietariness.


> It's actually more like: "Yo I found this hypervelocity star" "Nope, I got spectra and you're wrong" "Well I got ultra high-res spectra and I think he's right" "Actually all of you are forgetting asymmetric drift, this is just a geometry problem, l2angles" "Hey, I sit in my basement and play with MOND, it might help" "My simulations show something completely different"

Haha, that actually sounds almost exactly like the Stone Soup parable :)

( http://en.wikipedia.org/wiki/Stone_soup )


I certainly agree that people shouldn't be using techniques they don't understand.

But to address the main point - there are two separate things to consider when assessing a paper's methods. First is the methodology, which as you say is always included in the paper. But secondly, and equally importantly, is the implementation.

Agreeing with the methodology does not make me confident in the results. Someone wrote (probably) a lot of code to generate the analysis, and the likelihood that it contains bugs is high. They may or may not affect the outcome. Without seeing the code, I'm not going to trust the results.

Of course, I don't expect to read the source code of every analysis, but if it's open to scrutiny by the community, and the results are of any importance, it will be validated.

The problem is actually worse in many experimental methods, where the results rely completely on the practitioner having done exactly what they say they did and done it correctly. No source code to publish there, but that doesn't excuse not publishing analytical code when it is available.


This is disappointing. I'll say that I wish the situation would change so that you would be able to get funding without the dog eat dog, allowing you to share source code and ideas. Think how much would the next version of you be able to do if he had access to the ideas you invented! Not to admonish you; obviously it's the situation that's causing this.

It's utterly bizarre that there would be any field of science that actively deterred collaboration because without collaboration it seems incredibly difficult to achieve great things.


I'm confused by one of your scenarios:

Imagine, if the people behind the Millenium or Aquarius simulations were forced to publish their code as soon as they had run the simulations (this is true for any theorist)-- now these people have spent the past years of their lives, tweaking and perfecting code _for which they get no publications or citations_ and before they can even begin to analyze the easiest results ("low hanging fruit", the simplistic papers that usually go to collaboration members), the simulations are being run and analyzed all over the world. They have gained nothing by their efforts in actually developing the code (our world does not work like industry, we don't get a pat on the back and a raise for being team players).

My confusion: who said they had to release their code when it was finished? My understanding is that they would only release their code when they first publish. So I don't see a problem here: when they finish their code, but before they can analyze their first results, they have no external pressures.

However, once they publish those preliminary results? Yes, there will be external pressures, because others can start using their code to also look for interesting results. But that's the way things should progress. The original authors will still have an enormous benefit from having a deep understanding of the code, and the techniques they used in it.

I'm a computer scientist. I work on systems software research. I work in industry, so I can't release most of my code. When I was a PhD student, I always released my code. I wanted people to try to build on what I did. In fact, releasing my code is what has actually gotten me citations - people have either used my code directly in a larger project, or they have extended and improved upon it.

In systems software, academics always release their code. They want people to use it and build on it. The frequency with which other groups are able to beat out the original authors is almost zero - understanding a non-trivial code base takes time, and if the original authors are still working on the same problem, they have an enormous advantage.

Your attitude protects you, but prevents other from learning from you.


> For me, as an "experimentalist" (read, data-analyst), I also maintain job attractiveness by having sets of code that no one else has. I'm experienced in a variety of pattern finding algorithms, and have even invented a few myself for very specific problems-- and within my little sphere, people know this about me, I'm the person people come to for certain things. > I spent years of my life perfecting these, learning about algorithms, learning the intricacies and quirks of the various datasets we use-- if someone wants to take my place in this community as "that guy" then I expect them to devote as much time as I have to learning these techniques inside out and then to do something better than me. > I have no desire to pass my code off to a masters student and let him naively plug some dataset into it-- firstly because no one should ever rely on a black box in this field, and secondly because these codes all need to be tweaked to account for the different instruments and data structures being used.

> Perhaps if my contract didn't have a built in expiration I would be willing to care-bear-share my fractal search methods and spend my Thursday afternoons showing you how to Savitsky-Golay smoothing works, but as long as I'm out of a job next July, and you're the guy applying for my job, you can keep your filthy hands off my code.

First, WOW. In can't imagine any sector (except indeed academia) where a "keep your filthy hands off my code" attitude like that is remotely acceptable.

Second question is, whose code, exactly? In most lines of work, if you write code for an employer (university in this case, I suppose), the copyright for that code is implicitly given to the employer. In this particular case, it means they could, as well as should force you to open this code. It could very well be that your contract explicitly states different, it has to be clear about code though, it's not just implied together with writings, for instance.

Imagine if you were to work for, say, a security analysis consultancy firm, and you write a lot of cool machine learning and analysis code to detect intrusions or leakage. Regardless of whether your contract expires, if you leave, you're not leaving with your code. And if you'd refuse to document it properly so that the next guy can't use it, expect the contract to expire prematurely.

I can imagine that would seem frustrating and scary NOW, but only because they made it seem like yours was a proper approach for all those years you gave it. Of course it wasn't--and deep down inside you know this to be true--if only everything had been open from the start.


> First, WOW. In can't imagine any sector (except indeed academia) where a "keep your filthy hands off my code" attitude like that is remotely acceptable.

Actually, keeping code "close to your chest" is pretty common in the security industry. You'll see fuzzer frameworks get released all the time, but fuzzers which find "real bugs worth money" tend to get hoarded.

If you work for a consultancy and have private tools, you are not expected to hand those over to the company. They will appreciate it if you release some results with their name on it (as well as yours) every now and then though.


I see your point. I was actually trying to think of a counterexample where "keep your hands off my code" would indeed be a valid attitude, and the first that came to mind was the security industry (because what you describe indeed makes sense). But then I remembered the way copyrights work under employment, making it a bit of a convoluted (countercounter) example, sorry about that.

What you describe, however, is if someone already has developed these tools and then joins a security consultancy company.

Wouldn't it be very different if one developed the tools while under employment of a certain company (and it was your job to develop such type of tools)?

Cause that's the whole point of the way copyright works here, if you develop this tool on company time, I really doubt they'd let you walk away with that IP. Especially because they are worth that money. And even if you develop it in your own spare evenings, law is pretty specific that doesn't matter, for one reason that it's impossible to prove (and often quite unlikely) that you didn't use any company resources or knowledge to do so.

There is a good possibility that university contracts have different rules about the IP you produce while doing research though.


> Wouldn't it be very different if one developed the tools while under employment of a certain company

Basically, yes, stuff done on company time is definitely theirs under 'work for hire'. Totally agree with you.

A lot of projects are done as evening/weekend work which is on shaky legal ground - similar to bootstrapping a company while employed. If you have a side activity which is making money, it's simplest to stay quiet about it.

Among the limited sample set of "people I know", it's considered extremely crass for a company to try to pull an undeserved IP grab. Such a company would find it really hard to get self-motivated people after pulling a move like that. Because everyone has side projects.

Although some are diligent about getting "my stuff versus your stuff" spelled out in contract, it's often a "gentleman's agreement"...

As an aside, I wanted to point out that it's really common for security companies (or simply, "groups") to keep internal tools and only release them to the public when they've wrung all the "juice" out of them - e.g. publicity value exceeds the value of the results.

I think this model is not too far off what was described originally.


The problem is that in terms of getting jobs, promotions, and tenure, producing code has no value. Except for the people that actually write it, "programming" is generally considered to be a trivial task. So if you invest time writing data analysis code, the only way you can get a return on that investment is to keep it proprietary to your group and use it to get results before your competitors do.

We can't expect source code publication until we change the perception that it's trivial. If published code were as valuable as refereed papers for career advancement, the problem would be solved instantly. But if that change does not occur, requiring that analysis code be published will just stop the development of any new code (at least by career rational people) and probably hurt the field more than it helps.


> Perhaps if my contract didn't have a built in expiration I would be willing to care-bear-share my fractal search methods and spend my Thursday afternoons showing you how to Savitsky-Golay smoothing works, but as long as I'm out of a job next July, and you're the guy applying for my job, you can keep your filthy hands off my code.

This seems to be getting closer to the root of the problem, any insight into why the job market is set up that way?


Several reasons.

Firstly, it's a boys club up top-- a lot of PhD requirements are build around basically hazing rituals. It's designed to make you "prove" yourself-- instead of remembering the Frats founding members, we memorize moments of inertia and rotation matrices, but it's really the same thing, seeing as how I don't deal with those items on a daily basis at all, and a quick duckduckgo will give you more information than I ever could.

Secondly, astronomy and astrophysics is arguably the least essential of the sciences, so our funding shrinks accordingly when economies crash (what do we really make, coffee books?). We don't make quantum cryptography, and we can't levitate objects... although we have some remarkably military grade trajectory maps for all your satellites...

Thirdly I think it really is an attempt to make us work well past what any other industry would consider remotely acceptable. For a bit of insight into the situation, check this completely serious correspondence sent to an unnamed faculty which is now famous in the astrophysical community: http://jjcharfman.tumblr.com/post/33151387354/a-motivational...

I personally work with a 40 year old doctor, a leader in their field, who has not published a major paper in the last 3 years. They're being deported and are moving back in with their family if they can't get a job in the next month or so after having 2-to-3 3-month "pity" contracts. Of course this is just a hearsay example, not the norm, but it illustrates.

Another thing to remember is that we basically cannot have real families-- our jobs frequently require us to change states or countries at the terminus of our postdocs, and our salaries are on the order of entry position BSc coders 50-80k usually depending on skill. The hotshots in the field get 5-year contracts. At age 30 a 5-year contract is pretty much the best you can possibly do.


The tone of that letter reminds me why I left academic research...


Sounds like an honest assessment. You are neither a slave nor without choices. You are free to examine your job skills and transform yourself into someone who can get a more stable position in another field. As someone working through that process now to get out of my own no-win situation, I can tell you it is neither easy nor fun, but at the end of each day I sleep well knowing that I tried to make the most prudent choice presented to me.


"I spent years of my life perfecting these, learning about algorithms, learning the intricacies and quirks of the various datasets we use-- if someone wants to take my place in this community as "that guy" then I expect them to devote as much time as I have to learning these techniques inside out and then to do something better than me. I have no desire to pass my code off to a masters student and let him naively plug some dataset into it-- firstly because no one should ever rely on a black box in this field, and secondly because these codes all need to be tweaked to account for the different instruments and data structures being used."

The result is that you are important, and no one else can check on you. That's kind of good for you, but there should be a way you can do better.

My junior year in college some psychologists told me about a stastistician who would once a year write a paper demonstrating ways that psychologists mis-used some statistical technique. He would quote 20 or 30 psychology papers and explain why their statistics were wrong and so their conclusions were wrong. They lived in fear of him.

If there existed robust and reasonably transparent code to do what you do, along with great documentation to show users where the pitfalls are, and you got a valued publication every time you showed that a significant result was done wrong and you gave the better version, very likely you would be better off. I'm pretty sure astronomy would be better off.

I don't know how we could get from here to there, but it's something to consider.


I can't help putting this comment in the context of "Top Ten Reasons to Not Share Your Code" by Randall J Leveque (http://faculty.washington.edu/rjl/icerm2012/icerm_reproducib...).

"The gist of the article is to urge readers to reconsider current attitudes about sharing code related to publications by pondering an “alternative universe” in which mathematics papers are not expected to contain the proofs of theorems. Many of the objections I hear repeatedly to sharing code can be applied to such a universe.[...]

4. Giving the proof to my competitors would be unfair to me. It took years to prove this theorem, and the same idea can be used to prove other theorems. I should be able to publish at least 5 more papers before sharing the proof. If I share it now my competitors can use the ideas in it without having to do any work, and perhaps without even giving me credit since they won’t have to reveal their proof technique in their papers."


I'm sorry, but this attitude is just wrong. In order to trust results, scientific source needs to be verified. I'm a physicist and I've seen people not share code for a few reasons: 1) The code smells bad--I've had code that's useful, but that I feel is really ugly--but even that I've released when people have asked because it's better than having it just sitting around 2) To gain papers--for example, once I needed maxent for something and a guy wouldn't release his source and wanted to run that portion of the problem to be a co-author--so, I just wrote my own 3) Competitive advantage--like the OP says, he wants to keep an edge by having code that other people don't. Personally, I prefer to keep an edge by doing better science. I think if you're coming up with software that others use, then you still get some indirect credit from it, but I think that things progress better if the code is available and everyone isn't wasting time reinventing the wheel--it also helps for error-checking.


Contrast with the CPU simulator SimpleScalar, whose source code is freely available to academic users. Google Scholar lists hundreds of citations for the original tech report (you wouldn't do work based on someone else's simulation code without citing them, would you?), so it seems safe to say the work has done quite a bit for the author's professional status.


I had similar dilemma at second semester at university. I changed my career to business and do astronomical software as hobby. Now I probably write more astronomical software than if I would stay at field. It is more relaxing and I can focus on long-term projects.


As a first-year master student in astronomy, I fully agree with you.


Have you also seen the Open Exoplanet Catalogue proposal by Hanno Rein? (http://arxiv.org/abs/1211.7121)

I have no idea if you know him or not, but if not, you should consider getting in touch!


Isn't that exactly what the virtual observatory already has? They base their tables on XML, and call it a VOTable. Lots of fabulous software can read and write those.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: