Every programmer knows that every programmer makes mistakes. Lots of them. Mistakes that can completely change the outcome of an analysis. And if the results of your analysis are to be taken seriously, people have to be able to trust them, which means they have to be able to check your working.
Not publishing your code is anti-science, selfish, and IMHO should be disallowed in the literature.
I've noticed that few people realize that "publishing your code" and "open source" are two overlapping issues, but they are not the same and the needs of one field are sometimes opposed to the other field.
Open source necessarily implies that the people who receive the software have the right to modify the software, redistribute it with or without changes, and be able to do so for a charge.
No one has been able to tell me why software for scientific publications requires the third of those abilities. Everyone is agreed that access to the code and the ability to modify it makes it much easier to review and understand what it does. I can also understand the need to share modifications with collaborators, in order to help carry out the analysis. But I don't see why published software can't have a prohibition on charging a fee for using modified versions of the software, or services rendered which use the software.
My first question for you, then, is: do you need the ability to commercialize someone else's software in order to provide good scientific review of their publication? More specifically, what sorts of review would that prohibition eliminate?
In the other direction, I have scientific software which I sell for about $30K. (This is not hypothetical - I really do have this). Customers get the source code under the BSD license. This falls into every standard definition of "free software" and "open source" software. There's even an essay (at http://www.gnu.org/philosophy/selling.html ) encouraging people to sell free software.
If I publish a paper which uses the software, then am I obligated to give my peers access to the source code for no fee? Or can I publish the paper, say it's available under a BSD license, and charge $30K for access to it?
So my second question is, are there limits on what I can charge people in order to get access to my open source software, which was used for a paper? If so, what are they, and what is the ethical basis for that judgment? (For example, should it be "fair, reasonable and non-discriminatory"? Can it cover development costs? Distribution costs? Web site development costs?)
So to answer your first question, I don't think the right to sell someone else's code is related to the scientific process. I can't think of a way it affects the progress of science to use (or not use) commercialisation-friendly licenses, except in some indirect economic ways I don't know enough about to predict.
As to the second question, I think if you tried to charge reviewers for access to your code, your paper would never be accepted at a journal. Science is largely public funded. When public funds are used for research, the fruits of that research should be made public. That includes your code.
nods to your first answer. I would also say that being able to distribute source (with or without modifications) to others is important, though not as crucial. This prevents someone (cough Elsevier cough ACS) from having a monopoly on providing the source. But a restriction on, say, military use or wiretapping would not inhibit scientific progress, even if such a restriction makes it something other than free or open source software.
As for the second question, the scientific software I sell is not nor ever has been publicly funded. The people who have paid for it are all for-profit organizations. I don't even get R&D tax credits for it. While you're correct that much of science is publicly funded, my software isn't.
Going back to that question: are there limits on what I can charge people in order to get access to my open source software, which was used for a paper? If so, what are they, and what is the ethical basis for that judgment?
I can tell you that I would rather not publish a paper, in order to earn income selling my software, than to describe the techniques I used in making the software and the corrections and improvements to the existing literature that I developed. Which is better for overall scientific progress, and why?
This is especially important should I choose to publish as open access, since that journal in my field costs about $1,200, so I'm already planning to pay over a month's rent in order to publish. Should I also expect to lose the equivalent of a year's salary, in the hopes that publishing the paper helps as big enough advertisement to my services?
Which outcome is better for overall scientific progress depends entirely on what your software does, how large the need for it is, etc. However, one thing that is almost certainly true is that providing free (all senses) access to the software will maximise the social benefit. So if you can find a model that allows you to profit whilst still doing this, I encourage you to do so (see last point).
If I were in your situation I would say the issue of whether or not to publish is economic: publishing a paper about the software will bring it to the attention of the scientific community. That should lead to increased custom for you, provided the software is good and priced appropriately. An example of this is Robert Edgar of http://www.drive5.com. He has written several pieces of software which have really advanced computational biology, especially USEARCH. He makes the 32-bit version available freely, and there's a ~$800 per machine license for the 64-bit version. He also makes some tools available freely. This mixed model seems to have performed very well for him, as he is very widely cited which gets him a high profile and many large bioinformatics institutions buy licenses for his software.
Again, whether publishing open access is a good idea depends on which journal you publish in, how many people will benefit from learning of your advances, etc.
Since you say all the people who have paid for your software are for-profits, you could consider an unrestricted academic license with a paid commercial license (similar to all the baseclear products: http://www.baseclear.com/landingpages/basetools-a-wide-range...).
I think you can see that some others hold a different viewpoint from you and me. I've heard people say that the ability to review the software is essential for science and that the software must, in all cases, be available for trivial if not no cost, in order to allow that review. (I think this view doesn't have a moral justification.)
"the issue of whether or not to publish is economic"
Absolutely. It's advertising. It's then a question for me to decide how to maximize my profits AND maximize improvements to the field. (Only somewhat apropos, I always loved reading the PNAS blub for each article: "The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.")
Getting back to the topic, suppose that a group uses 64-bit USEARCH to do research. You wrote "I'm saying that publishing the source code so that others can run and modify it is the crucial thing." However, that group is unable to provide the source code used to do their analysis. ("Licensee will not allow copies of the Software to be made or used by others", says the 32-bit license.)
If publishing the source code is crucial, then this secondary group, which uses the algorithm, is obligated to publish the software they used, no? Or does that obligation only apply to the primary developer of the algorithm? If the primary developer never publishes the source code, then should all secondary users be prohibited from using it in order to develop new science?
Does that prohibition extend to using Excel? Oracle? Built-in software in sequencing hardware? I don't see an obvious bright-line demarcation.
As for "unrestricted academic license", I disagree with some of the distinction between academic license/commercial license. There are academic labs with a lot more money than I have. The group I was in, in the 1990s, had a NeXT or IRIX box on each student's desk, for example. There are also academic labs which act as a front, of sorts, for a professor's commercial interests. Also, the software I'm working on makes things fast - some 40x more than what people would do on their own. Any group can make the time/money tradeoff, and an academic group may easily have more money than time.
I've decided on a different view. Anyone can get access to the older versions, at no cost and under a BSD license. The newer versions are available to anyone, for a fee, and under the BSD license. I'm not dependent on this product for revenue, so it's a test to see how successful this business model might be.
Out of curiosity, do you work for industry or academia?
I'm in academia (see my profile for details).
It's also useful for others to check the analysis of trusted data. If they do it with their own methods and get a different result, then it's time to actually compare the methods.
It's a good thing for the original researchers to release their code too. But looking for bugs in the code is no substitute for doing independent analysis and independent data collection. It's an important supplement, but it isn't enough.
For example, if a paper publishes a multiple sequence alignment of 100 sequences, then all you need to do is verify that it's at least as good as what other MSA programs generate. You don't need to be able to rerun the program.
Indeed, sometimes you can't. Someone could have manually aligned the sequence, and that process isn't reproducible.
You see this as well with fold prediction software, where it can be easy to show that a predicted fold is likely correct (ie, matches experimental observations), and where you don't necessarily care about the method used to get that fold prediction.
A genetic algorithm might be very sensitive to the compute environment. For example, the order of float operations generated by two different compiler settings, or by network traffic timings in a distributed system. This can lead to different minima, where the overall effectiveness is the same but the actual configuration is different. The GA search doesn't need to be reproducible; it's only the final effectiveness which is important.
For those cases, I don't see the need for access to the underlying software in order to validate the result.
I agree about the genetic algorithm.
I actually wanted to know your opinion concerning the migration of open-source to private software. The most common open-source and free software licenses, like MIT or GNU, allow this kind of migration. That is Bob could modify Alice's public project, produce results, and publish them without sharing sources with anybody. That looks unfair, since Alice did a great job and most likely would like to see the changes. How to protect Alice from such situations? Are there any kind of license which forces to share modified software when some results it produces are published or available to the general public?
I believe that an enforcement to share source code should be imposed on provate modifications of open-source scientific software.
Providing codes is imperative; they are part of the methods and should be available for examination to ensure the integrity of the science. Those conducting research funded with public monies should be required by the funding agencies to release the products of that research, not just the results, but the data and codes, too (absent truly compelling reasons, such as national security). Eventually, I expect funding agencies will indeed require this for astronomy, just as they do for some other sciences; journals could help the field along, could improve the transparency and reproducibility of research, by requiring code release upon publication.
Absent funding agencies and journals insisting on code release and the moral argument of reproducibility, what incentives would help convince code authors to release their software?
-> Patrick Vandewalle, "Code Sharing Is Associated with Research Impact in Image Processing", CiSE 2012
It could also be the pressure of colleagues who, as anonymous reviewers, would always ask for the code whenever a paper depends on computation. Journal policies will not switch to REQUIRING the code anytime soon, but peer-review can add some pressure.
That's not the whole story and the fact that this article is getting published speaks to that, however I still don't think it's likely that things will change in the short term.
My friend Matt Turk wrote about a similar topic recently: http://arxiv.org/abs/1301.7064
Of course that should not preclude the possibility of replication as described here.
I find the idea of threats to job security misguided. In the particle physics world some significant code (not everything, granted) has been made available for some time; e.g. Monte-Carlo simulators like Geant or lattice gauge theory codes like those from MILC, SciDAC or FermiQCD. Even highly optimised code has been available. Users are requested to cite the authors. No careers have been harmed, as far as I can tell.
Also, the code itself is not much use for replicating results without making the data available too. This can be more problematic politically and technically, but even here there are good precedents in the particle physics world.
Which is good as a way of 'settling information' into a medium but an awful way for today. It's the equivalent of stone tablets compared to what's available today
It would be much better if the product of research could be shared in a more flexible format. For example, don't get me started on reading two columns PDFs in a computer screen.
Flexible formatting, attachable metadata, reusable "source code" (that could be an excel spreadsheet, a txt list of data, or source code per se), redrawable graphs, etc
We need a faster, more flexible, albeit official way of sharing data. Yes, there are ownership issues, IP issues, etc. But maybe having a more easily searchable/traceable system may help with this.
Metajournals are trying to do it, with an "Article" overlay to data/software publication : http://metajnl.com/. IPOL also tries to publish image processing software: http://www.ipol.im/. And there is also some work to interlink science assets (more than articles) on http://linkedscience.org/.
I welcome publication models not based on "PDF standard" (ie an electronic clone of the centuries-old ink&paper text article), but simply sharing data/software is not sufficient. I think it is important to have an editorial line and a peer review, otherwise anything online would qualify as scientific communication.
Moreover, software and data is hardly sufficient by itself; it needs to be explained, documented, illustrated and discussed to be more than a "harddisk dump".
This is troubling. There's no field involving computation in which withholding source code is routinely accepted. The four-color map theorem proof (Appel & Haken, 1976) would never have been accepted without source code. Modern mathematics, to the degree to which it relies on computer results, also relies on the publication of source.
Another example is the recent revelation involving an error in an Excel spreadsheet and its effect on an economic analysis -- the correction wouldn't have been possible without publication of the spreadsheet alongside the conclusions drawn from it.
Also, replication is a cornerstone of serious science. Without replication, astrophysics becomes psychology, where replication is rare.
I hope this paper has the effect of correcting this systematic flaw in astrophysics publication.
The threat to job security is not just "perceived". Astronomers are frequently kept in a state of constant job-induced anxiety by the prevalent practice of fighting for 6 month - 2 year "postdocs" for the first 15-20 years of their careers (seriously, go to an astronomical conference, you would not believe the amount of people greying under 35). To "keep" your job (in reality, get another postdoc) you must do two things: author papers and get papers cited.
This practice makes collaboration actually detrimental to a researcher's career unless one of two things happens: 1) they lead the collaboration and get their name as first author 2) the collaboration allows some kind of recompense for the time invested into the collaboration.
For this reason, many collaborations have a period of proprietarity, a time when the collaboration data is available exclusively to the people who have sunk their time into the collaboration. Imagine, if the people behind the Millenium or Aquarius simulations were forced to publish their code as soon as they had run the simulations (this is true for any theorist)-- now these people have spent the past years of their lives, tweaking and perfecting code _for which they get no publications or citations_ and before they can even begin to analyze the easiest results ("low hanging fruit", the simplistic papers that usually go to collaboration members), the simulations are being run and analyzed all over the world. They have gained nothing by their efforts in actually developing the code (our world does not work like industry, we don't get a pat on the back and a raise for being team players).
For theorists, often entire careers revolve around codes that have been developed over the researcher's whole career-- to be forced to hand it over to all the first year PhDs in the world is a tremendous slap in the face.
For me, as an "experimentalist" (read, data-analyst), I also maintain job attractiveness by having sets of code that no one else has. I'm experienced in a variety of pattern finding algorithms, and have even invented a few myself for very specific problems-- and within my little sphere, people know this about me, I'm the person people come to for certain things. I spent years of my life perfecting these, learning about algorithms, learning the intricacies and quirks of the various datasets we use-- if someone wants to take my place in this community as "that guy" then I expect them to devote as much time as I have to learning these techniques inside out and then to do something better than me. I have no desire to pass my code off to a masters student and let him naively plug some dataset into it-- firstly because no one should ever rely on a black box in this field, and secondly because these codes all need to be tweaked to account for the different instruments and data structures being used.
Perhaps if my contract didn't have a built in expiration I would be willing to care-bear-share my fractal search methods and spend my Thursday afternoons showing you how to Savitsky-Golay smoothing works, but as long as I'm out of a job next July, and you're the guy applying for my job, you can keep your filthy hands off my code.
If you are publishing articles with no source code, how is that even science? Why not just skip the articles and publish unsubstantiated assertions? After all, if you publish details of how you worked stuff out, even without source code someone else might reproduce your work.
If the new rule was that you had to publish source then there would still be the same number of jobs in astronomy. I don't seem how _everyone's_ careers could be negatively impacted. And don't you think if you are using the software from another group you would cite the authors?
"After all, if you publish details of how you worked stuff out, even without source code someone else might reproduce your work." - hopefully they do! and hopefully they do it in a different way, doing the same thing twice is pointless. This is why the Hubble Constant started off being 150, then was 50, and now hovers around 69. Because everyone did the problem in their own unique way and didn't borrow each other's code and procedures.
"there would still be the same number of jobs in astronomy" - yes, that's the problem, there aren't enough. So the job goes to the guy with the most citations and publications-- I want that guy to be me.
"And don't you think if you are using the software from another group you would cite the authors?" - that's not the whole point. Writing a simulation may take 3 years, then analysis can take another 3 years. If I wrote it, I want first dibs on analysing it, I don't want someone else taking the results I created and publishing them before me.
The system is cutthroat, sure, but it does have benefits if you know how to play the game. Politics is surprisingly omnipresent here.
That's what psychologists say (according to Richard Feynman in his now-famous criticism of the practice in "Cargo Cult Science"), but it's a fundamentally flawed view of science. Replication -- an exact duplication of a study and hopefully its results -- is a cornerstone of science. And without full disclosure of the original study's method and results, replication isn't possible.
> This is why the Hubble Constant started off being 150, then was 50, and now hovers around 69. Because everyone did the problem in their own unique way and didn't borrow each other's code and procedures.
That isn't an argument for the practice of withholding methods, it's an argument against it. Arriving at a realistic value for the Hubble constant wasn't accelerated by withholding methods, it was slowed. Imagine Einstein claiming that mass and energy are bound together in a clearly defined relationship, but not publishing the relationship or its derivation.
Arthur C. Clarke said, "Any sufficiently advanced technology is indistinguishable from magic." Apropos the present topic, a result unaccompanied by the methods used to produce it is also indistinguishable from magic, and certainly isn't science.
> If I wrote it, I want first dibs on analysing it, I don't want someone else taking the results I created and publishing them before me.
We're not taking about someone beating the originator to publication, we're talking about the originator publishing his methods along with his results and getting credit for both.
> Politics is surprisingly omnipresent here.
Perhaps, but always at the expense of science. Science relies on full disclosure -- both methods and results. How can a theoretical consensus be arrived at if the participants don't know what they're agreeing about?
My first quarter in grad school in statistics, a girl who was a couple of years ahead of me quit. She was doing an internship with an MD, and every time he got a new diagnosed kid in the project he pondered which group to put the kid in. "This one's going to die, where do I put him to make the results come out right?"
The institution was considered second-class in that field, so he got the money to replicate a Harvard result. If he got the right results it would make him look more competent and he might get better grants. So he was doing his best to fudge the results so they would come out right.
She tried to argue that he should do his statistics correctly and he disagreed. She was so upset that her work was pointless, and that her career would be pointless, that she quit completely.
Ideally scientists would be rewarded for doing open science correctly. In some ways that is not the case now, and we should look at ways to fix that.
How can average mediocre scientists get job security without keeping secrets? Perhaps we are giving too many people a chance to be professional scientists, so that too many of them must lose out?
> How can average mediocre scientists get job security without keeping secrets?
By way of systematic changes, not by anything they necessarily can do themselves at the level you're describing.
By way of people near the situation willing to speak out, like those who brought Jan Hendrik Schön down:
Or in a similar way, by the actions of those who brought several highly regarded psychologists down:
Title: "Scandals force psychologists to do some soul-searching"
Quote relevant to the present topic: "For instance, right now there's no incentive for researchers to share their data, and a 2006 study found that of 141 researchers who had previously agreed to share their data, only 38 did so when asked."
But the existence of the linked article and others like it means something is being done -- many psychologists have been outed as frauds, or fired, or forced into retirement.
So there's reason to hope that, if enough daylight shines on these practices and laboratories, things will change.
yes but the pipeline goes METHODS -> (CODE) -> RESULTS.
you can read the methods and they check out. fine. you can't see the code. the results show something new and amazing. how do you know this is because of the methods used or because somebody flipped a minus sign in the code?
well, you don't. unless you reimplement the code from the methods and see if it's reproducible. except that nobody does this. but let's say they did and found completely different results, that are also new and amazing. who is right? the answer is: NOBODY KNOWS, because nobody is publishing their code!
so astrophysicist number 3 comes around, wants to know who is right, and has only these methods to go on, but no code. he has to start from scratch. because the universe hates us, he will find a third set of completely different new and amazing results that aren't even of the same datatype as the first two--but nobody knows who is right, because everyone is hiding their code and claiming it's based on the same methodology.
> doing the same thing twice is pointless.
No that's science.
What is pointless is doing the same thing twice, in the hope you're doing the same thing as the last guy, because you don't know what the last guy did. Also, hoping that if you make errors (like the last guy undoubtedly did, because hey this is unreviewed closed source code we're talking about), that they'll be different errors. Except you can't know, because nobody publishes their code.
All of this reminds me a little bit of the parable of the Stone Soup. And if the above behaviour is causing people stress and grey hairs about their job security, well I don't wish that upon anybody, but hoarding source code isn't the way forward either.
No one is going to spend time auditing the code to find a mistake either. In academia the way these things are, and should be, validated is by comparing the output to known results. In simulation this means do my simulations match experimental reality. In data analysis this means you check that you can verify something that's known to be true about a particular dataset.
That kind of validation in itself doesn't happen very often (it's often missing for example in computational chemistry). But in all honesty having the code, wouldn't (and doesn't) help.
I'm not arguing that the code shouldn't be open source, but that if you're presenting "new and amazing" results the way you back those up is not by saying you checked your code really well, but by showing that your method and implementation are consistent with known facts while presenting something new.
I'm a materials scientist specializing in microscopy and image analysis, not an astrophysicist. In my area of interest, there is rarely sufficient information published to confidently repeat results. Indeed, the current reproducible research community had origins in computer vision (and was then embraced by the biostatistics community). I applaud the expectation of openness - especially if funded by government grants. Too often, we taxpayers pay for research that get trapped behind paywalls.
I agree the the academic astrophysics community has it tough. This is symptomatic of the entire academic community that is producing too many Ph.D.s desiring too few academic positions. Guess what - it will only get worse when the unsustainable government spending and debt causes the education bubble to burst.
I believe Newton said, "If I have seen further it is by standing on the shoulders of giants.", not, "I have rediscovered everything in a slightly different way because no-one gave me more than a vague method to go on"
It is no bad thing to have several independent implementations of an algorithm. If there is a problem with one, then the others are likely to show that there is a problem. However, without open source code, all you can do is say, "Mmm, something's wrong somewhere" and write another version. You may end up with several papers that agree and one that doesn't, but you still can't draw any conclusions without looking at the code. Ultimately, you get a situation where everyone has to repeat the same work and see what they get, when what they should be doing is poring over the original soources and discussing which bits could/should have been implemented differently and how that affects the result.
The code is part of the method. If you can't show us the code, don't expect me to believe your 'results'. That is science.
This is deeply hypocritical. When someone publishes results generated by code, but don't publish the code, they are asking all readers to rely on the output of a black box.
But they are presenting the methodology they claim went into the code and allowing you to agree or disagree with the methodology. If you agree, you take their answer, if you disagree, you implement a different methodology, which necessitates new code anyways, and publish your contra-finding.
We usually converge on answers over years of varying different methods and attempts. No one in their right mind reads a paper and says "well, they found out the sun is actually in M31, I believe that now" They look at the culmination of the literature (fun fact, we still aren't positive exactly where the sun is, or how fast it's moving) and I suppose that is the black box.
What I was really referring to though was students who walk up to me and say they gaussian smoothed a sample, and have no idea what I mean when I ask if 3 sigma outliers were used in the fit or trashed. They just used some gaussSmooth algorithm and may or may not know what a gaussian even is.
> But they are presenting the methodology they claim went into the code and allowing you to agree or disagree with the methodology. If you agree, you take their answer--
No. And this is the whole point. You can agree with the methodology, but as long as you haven't inspected the code, you cannot just accept they actually implemented it right!
In fact, chances are, they probably didn't. The best programmers white buggy and incorrect code. And from what I've seen in the field of physics, scientific code is everything but an exception to that. I'd be surprised if it was very different for astrophysics.
So by publishing the methodology and the results, but not the code, they put up a nice show. But that's it, not reproducible. So all that's really brought to the table is the methodology and some "say-so" results that nobody can check are accurate.
But it is quite common for someone to take the same problem, collect their own data and run their own analysis (both steps can be completely different from the original statement-- collecting data of a different type, from a different instrument, etc. Running an analysis of a different paradigm, running the same analysis to a different precision, running the same analysis via a different algorithm, etc.).
If you read the arxiv on a daily basis you can see huge academic arguments unfolding over the course of months and years.
There seems to be this idea that the conversation goes: "Yo I found this hypervelocity star" "Dope, let's move on"
It's actually more like: "Yo I found this hypervelocity star" "Nope, I got spectra and you're wrong" "Well I got ultra high-res spectra and I think he's right" "Actually all of you are forgetting asymmetric drift, this is just a geometry problem, l2angles" "Hey, I sit in my basement and play with MOND, it might help" "My simulations show something completely different"
Authors are called out and proven right or wrong on a daily basis, even if we can't watch them code over the shoulder. I actually think that's one of the beauties of it-- most of our methods are invented by trying to prove or disprove something in a new way.
The monster codes like GADGET (which ran the millenium sim everyone's seen) are usually made public after ~5 years of proprietariness.
Haha, that actually sounds almost exactly like the Stone Soup parable :)
( http://en.wikipedia.org/wiki/Stone_soup )
But to address the main point - there are two separate things to consider when assessing a paper's methods. First is the methodology, which as you say is always included in the paper. But secondly, and equally importantly, is the implementation.
Agreeing with the methodology does not make me confident in the results. Someone wrote (probably) a lot of code to generate the analysis, and the likelihood that it contains bugs is high. They may or may not affect the outcome. Without seeing the code, I'm not going to trust the results.
Of course, I don't expect to read the source code of every analysis, but if it's open to scrutiny by the community, and the results are of any importance, it will be validated.
The problem is actually worse in many experimental methods, where the results rely completely on the practitioner having done exactly what they say they did and done it correctly. No source code to publish there, but that doesn't excuse not publishing analytical code when it is available.
It's utterly bizarre that there would be any field of science that actively deterred collaboration because without collaboration it seems incredibly difficult to achieve great things.
Imagine, if the people behind the Millenium or Aquarius simulations were forced to publish their code as soon as they had run the simulations (this is true for any theorist)-- now these people have spent the past years of their lives, tweaking and perfecting code _for which they get no publications or citations_ and before they can even begin to analyze the easiest results ("low hanging fruit", the simplistic papers that usually go to collaboration members), the simulations are being run and analyzed all over the world. They have gained nothing by their efforts in actually developing the code (our world does not work like industry, we don't get a pat on the back and a raise for being team players).
My confusion: who said they had to release their code when it was finished? My understanding is that they would only release their code when they first publish. So I don't see a problem here: when they finish their code, but before they can analyze their first results, they have no external pressures.
However, once they publish those preliminary results? Yes, there will be external pressures, because others can start using their code to also look for interesting results. But that's the way things should progress. The original authors will still have an enormous benefit from having a deep understanding of the code, and the techniques they used in it.
I'm a computer scientist. I work on systems software research. I work in industry, so I can't release most of my code. When I was a PhD student, I always released my code. I wanted people to try to build on what I did. In fact, releasing my code is what has actually gotten me citations - people have either used my code directly in a larger project, or they have extended and improved upon it.
In systems software, academics always release their code. They want people to use it and build on it. The frequency with which other groups are able to beat out the original authors is almost zero - understanding a non-trivial code base takes time, and if the original authors are still working on the same problem, they have an enormous advantage.
Your attitude protects you, but prevents other from learning from you.
> Perhaps if my contract didn't have a built in expiration I would be willing to care-bear-share my fractal search methods and spend my Thursday afternoons showing you how to Savitsky-Golay smoothing works, but as long as I'm out of a job next July, and you're the guy applying for my job, you can keep your filthy hands off my code.
First, WOW. In can't imagine any sector (except indeed academia) where a "keep your filthy hands off my code" attitude like that is remotely acceptable.
Second question is, whose code, exactly? In most lines of work, if you write code for an employer (university in this case, I suppose), the copyright for that code is implicitly given to the employer. In this particular case, it means they could, as well as should force you to open this code. It could very well be that your contract explicitly states different, it has to be clear about code though, it's not just implied together with writings, for instance.
Imagine if you were to work for, say, a security analysis consultancy firm, and you write a lot of cool machine learning and analysis code to detect intrusions or leakage. Regardless of whether your contract expires, if you leave, you're not leaving with your code. And if you'd refuse to document it properly so that the next guy can't use it, expect the contract to expire prematurely.
I can imagine that would seem frustrating and scary NOW, but only because they made it seem like yours was a proper approach for all those years you gave it. Of course it wasn't--and deep down inside you know this to be true--if only everything had been open from the start.
Actually, keeping code "close to your chest" is pretty common in the security industry. You'll see fuzzer frameworks get released all the time, but fuzzers which find "real bugs worth money" tend to get hoarded.
If you work for a consultancy and have private tools, you are not expected to hand those over to the company. They will appreciate it if you release some results with their name on it (as well as yours) every now and then though.
What you describe, however, is if someone already has developed these tools and then joins a security consultancy company.
Wouldn't it be very different if one developed the tools while under employment of a certain company (and it was your job to develop such type of tools)?
Cause that's the whole point of the way copyright works here, if you develop this tool on company time, I really doubt they'd let you walk away with that IP. Especially because they are worth that money. And even if you develop it in your own spare evenings, law is pretty specific that doesn't matter, for one reason that it's impossible to prove (and often quite unlikely) that you didn't use any company resources or knowledge to do so.
There is a good possibility that university contracts have different rules about the IP you produce while doing research though.
Basically, yes, stuff done on company time is definitely theirs under 'work for hire'. Totally agree with you.
A lot of projects are done as evening/weekend work which is on shaky legal ground - similar to bootstrapping a company while employed. If you have a side activity which is making money, it's simplest to stay quiet about it.
Among the limited sample set of "people I know", it's considered extremely crass for a company to try to pull an undeserved IP grab. Such a company would find it really hard to get self-motivated people after pulling a move like that. Because everyone has side projects.
Although some are diligent about getting "my stuff versus your stuff" spelled out in contract, it's often a "gentleman's agreement"...
As an aside, I wanted to point out that it's really common for security companies (or simply, "groups") to keep internal tools and only release them to the public when they've wrung all the "juice" out of them - e.g. publicity value exceeds the value of the results.
I think this model is not too far off what was described originally.
We can't expect source code publication until we change the perception that it's trivial. If published code were as valuable as refereed papers for career advancement, the problem would be solved instantly. But if that change does not occur, requiring that analysis code be published will just stop the development of any new code (at least by career rational people) and probably hurt the field more than it helps.
This seems to be getting closer to the root of the problem, any insight into why the job market is set up that way?
Firstly, it's a boys club up top-- a lot of PhD requirements are build around basically hazing rituals. It's designed to make you "prove" yourself-- instead of remembering the Frats founding members, we memorize moments of inertia and rotation matrices, but it's really the same thing, seeing as how I don't deal with those items on a daily basis at all, and a quick duckduckgo will give you more information than I ever could.
Secondly, astronomy and astrophysics is arguably the least essential of the sciences, so our funding shrinks accordingly when economies crash (what do we really make, coffee books?). We don't make quantum cryptography, and we can't levitate objects... although we have some remarkably military grade trajectory maps for all your satellites...
Thirdly I think it really is an attempt to make us work well past what any other industry would consider remotely acceptable. For a bit of insight into the situation, check this completely serious correspondence sent to an unnamed faculty which is now famous in the astrophysical community: http://jjcharfman.tumblr.com/post/33151387354/a-motivational...
I personally work with a 40 year old doctor, a leader in their field, who has not published a major paper in the last 3 years. They're being deported and are moving back in with their family if they can't get a job in the next month or so after having 2-to-3 3-month "pity" contracts. Of course this is just a hearsay example, not the norm, but it illustrates.
Another thing to remember is that we basically cannot have real families-- our jobs frequently require us to change states or countries at the terminus of our postdocs, and our salaries are on the order of entry position BSc coders 50-80k usually depending on skill. The hotshots in the field get 5-year contracts. At age 30 a 5-year contract is pretty much the best you can possibly do.
The result is that you are important, and no one else can check on you. That's kind of good for you, but there should be a way you can do better.
My junior year in college some psychologists told me about a stastistician who would once a year write a paper demonstrating ways that psychologists mis-used some statistical technique. He would quote 20 or 30 psychology papers and explain why their statistics were wrong and so their conclusions were wrong. They lived in fear of him.
If there existed robust and reasonably transparent code to do what you do, along with great documentation to show users where the pitfalls are, and you got a valued publication every time you showed that a significant result was done wrong and you gave the better version, very likely you would be better off. I'm pretty sure astronomy would be better off.
I don't know how we could get from here to there, but it's something to consider.
"The gist of the article is to urge readers to reconsider current attitudes about sharing code related to publications by pondering an “alternative universe” in which mathematics
papers are not expected to contain the proofs of theorems. Many of the objections I hear repeatedly to sharing code can be applied to such a universe.[...]
4. Giving the proof to my competitors would be unfair to me.
It took years to prove this theorem, and the same idea can be used to prove other theorems. I should be able to publish at least 5 more papers before sharing the proof. If I share it now my competitors can use the ideas in it without having to do any work, and perhaps without even giving me credit since they won’t have to reveal their proof technique in their papers."
I have no idea if you know him or not, but if not, you should consider getting in touch!