On a less polemic note: To address the reproducibility of computational experiments, Andrew Davison presented his Sumatra project at Euroscipy 2010 (http://www.euroscipy.org/talk/1960). I thought I'd throw this in here as the OP's gripes are a problem in all of modern science, not just computer science. It's also a reminder that making the code available is but one element of the problem of reproducibility/falsifiability in modern science (though probably the biggest one).
I think, however, that most journals should at least ask for a shell script that downloads the code and data, runs the experiments and regenerates the graphs and tables as seen in the paper. This is not always practical (a lot of papers deal with over a terabyte of data, for example), but it is so more often than not. At least for my papers this is a nearly-attained goal.
I absolutely understand your feelings on hacky code. Every academic produces hacky code, there are precious few who don't. I myself, when I started, did not want to release my code for the same reason.
However, once I began to realize that we were all on the same boat of HMS Hacked Together, that feeling began to dissipate. My advisor calls it "research code", and it's fine, because as academics, we're all used to it!
That's why I usually just ask for source. I assume the build won't execute on my Mac, and that's OK. I'm not really interested in running the tool, but explicitly finding out how you solved the problem.
I've asked people for code a few times, but my experience after getting it is actually that I don't really ask for it anymore, because I've never found it to help me. What I really want in most of the cases is a clear enough English writeup, perhaps with pseudocode, so that I can understand how they solved their problems, and ideally reimplement it myself. At least, that's the case if it's at a scale where that's feasible to reimplement; if they built something absolutely gigantic then it might be another story, but then their megabytes of messy research code I can't grok aren't very useful to me either, and I have no real choice but to wait for the cleaned-up release.
In short, I think "can this be reimplemented by a third party from the published literature?" is a better test for reproducibility than .tar.gzs are. And there's certainly a ways to go on that front, not least because in areas where 6-to-8-page conference papers are the norm, even well-meaning authors can't include enough details, and most don't get around to writing the detail-laden tech report version. But I guess I find code mostly useless for that purpose; it might as well be an asm dump for all the good I usually get out of it.
If I'm going to propose a probably impossible sea change, taking the baby step of saying "just show me what you've already done" instead of "now write another 12-20 page set of documentation" is the more likely of the impossible two :) In a perfect world, we'd have both!
And I think a part of the reason why this is worse with research code is that the meaty part of the code tends to be (at least in ML/NLP) a few equations from the paper, in a hacky and convoluted way, and unless you're a world-class expert on keeping track of indexes and one-greek-letter variable names, there's very little to get from the code to a well-written paper. I make an exception for tuning parameters, constants, tweaks, etc, but these shouldn't matter much anyway.
I agree with another poster here that having somebody else repeating the experiment with their own implementation is a better test for validity - if a second paper just copies the source code from the first and makes a few tweaks, mistakes could easily carry over.
But having the data available would be great.
Coming back to the topic of source code. I think there are three additional reasons source code is often not published:
1. Some scientists (I am trying to avoid making overgeneralizations) are bad programmers. Sometimes just enough hacks are stacked to produce results, but the result is not something to be particularly proud of.
2. Rinse and repeat. It's often possible to get more out of a discovery by applying it to multiple datasets. If the source is published, others could be eating your lunch.
3. There is a contract that avoids publishing source code.
My PhD project is financed by the Dutch national science foundation. Fortunately, since software developed in my project adds to existing work under the LGPL (creating derivative works), my work is under the LGPL too. Copyleft can help you if (3) applies.
I try to follow the same strategy as the author: make software public on Github once a paper is accepted.
Publishing scripts for the complete workflow starting with the raw data and printing the table with the results in the end would be the best. But I've seen academics working in a way that is completely orhogonal to this - copying & pasting data to Excel or Matlab (or even re-typing them) and doing the analysis by hand in the GUI... I don't have any doubts they would be able to learn how to write the script, but I'm very sure they would put up heavy resistance to do so.
There is also the issue of preventing competitor (other researchers here) to get a free ride on your work - getting data, preparing them is a huge part of the researcher's work in some fields.
There's a story (unfortunately I forget where, perhaps someone else can jog my memory?) about a string theory PhD student who, after getting annoyed wasting months on a set of hundreds of straightforward but tedious and time consuming calculations, decided to just do it all once and for all and spend the time to put together a table of results for all of them. Of course, this helped his future work immensely.
When he went to his PhD advisor to ask what he thought about publishing that table, the guy looked at him like he was crazy. Again, I don't remember the quote, but it was along the lines of "What you have there will give you a 1000% speed advantage pushing out papers in this field compared to your peers - you'd have to be crazy to share that sort of competitive advantage with everyone else when you could keep it to yourself, this is your golden ticket!"
Which brings to light very clearly the source of the problem - the ideal of academia is to advance the overall state of knowledge as fast as possible, but once you start using the word "competitor" in a serious way that actually has bearing on whether you publish something useful or not, that ideal has been perverted.
Obviously it's the "publish or perish" mindset that causes this, and I absolutely understand why people would be tempted to see their supposed colleagues as competitors instead of collaborators (in the general sense, when they're not actively collaborating on a paper); it's one of the main reasons I decided not to go into academia, in fact, I saw too much political bullshit spewing about even in the harder fields like physics and math. I have no idea how to solve any of this, but it's a serious breakdown in the system, and I suspect it (publish-or-perish, not just information hiding) hinders the long term progression of academic knowledge in some of these fields by a large amount, not in the least because it rewards herd-like behavior and punishes exploration. That's another can of worms for another day, though...
Yes, but the person you are replying to is right to note the competitive aspects of research. A lot of people might say "well, this project is on-going, and I don't want people scooping/stealing it from me." It's a sad thing, but most research labs are in a a competitive relationship with other ones, and a citation is less useful in those oh-so-important tenure reviews than a publication. I wish it was more the "standing on the shoulders of giants"!
That's partly why I didn't call on academics themselves to release code, but for some sort of authoritative institution instead, to level the field for everyone. That should remove the competitive aspects (I understand "that should" is a very naïve outlook ;) )
The reason CS researchers do not usually publish their code has nothing to do with dishonesty - nobody is trying to hide their code because it does not really work, or anything like that. It's not even that people are worried of scooping, though that sometimes happens.
The main problem is that any time spent on cleaning up the code, packaging examples, writing instructions, answering bug complaints, etc. is time not spent on things that matter in academia - doing research, presenting it, and teaching students.
It might help if some conferences required source code submissions - but people might just submit to different conferences instead. The only real solution would be if funding agencies like NSF required that any projects funded through them have to release source code. This makes sense from a taxpayer's point of view, and would make the extra work acceptable (since everyone would have to do it).
The NSF is a great point.
One of the things that bugs me about this, that I didn't go into in the post for brevity's sake, is that a lot research is funded by some government institution under the banner of public interest. If you are paid to create something, and then you lock it away for whatever reason, that's not in the public interest. Worse still, more money has to be spent for someone else to re-implement the exact same thing if they liked it!
I hadn't thought this out to the logical conclusion of having the funding body also ask for the code to be released, but I think it's a great idea.
Aside from legal issues, it seems to me the business proposition is the same as the open-source business proposition: you know the most about the system you've created, so you're in the best position to consult on it. If you want a startup, I think guys like Cloudera show that even if you give away what was traditionally thought of as the family jewels, you can still very effectively monetize. That's what the university should leverage.
Anyway, for most projects, you've already given the game away in the paper (or at least, should have done): the expensive thing was the idea. Reimplementation is cheap.
It's funny because the public (like you) demands access to the research because they paid for it, but the politicians view you as an economic investment and demand that you monetize and produce returns (e.g. tax revenue, employment from startups). The public doesn't write the rules.
In academic settings competitive aspects of research likely produce the same sorts of issues, and in compsci there is likely commercial interests involved too. Come up with a clever new NPL technique for semantic searching and the VCs come out of the woodwork.
However I agree with the author and commenters that both code and data ought to be available to all, mainly because that's the only way to make progress. Research is hard, very hard, and building on the half-baked ideas, good and bad, is the only way progress is made.
In contrasting it to maths it would see stupid for a maths paper to describe a new maths proof without putting it forward in mathematical notation in the paper.
And for these other things, it might be good to see code a couple of times, but mostly at first, and to get up to speed in an area.
I think a problem is that, if you force all papers to include source code, due to the fact that science builds upon itself, you'd find that the average paper length for an area should climb a bit (going back down when a paradigm shift happens, because then the tricks stop working, but climbing all over again), and most of it would just be repeats of what's already there. Comparing to maths, it's like asking every paper proving a theorem to prove their lemmas, even very basic ones: sure, it'd help a novice understand what's going on, but it'll hinder progress more often than it'd help it. There's a place for introductory writing and a place for stand-in-the-shoulders-of-giants writing, and papers are mostly of the last sort.
For example, a couple of decades ago every paper that used a naive bayes classifier would derive the equations, describe feature selection, weighting, etc; today, most just say "I use a naive bayes classifier for this, that, and that" and move on. Likewise for SVMs---you don't want to see the full code for most papers that use that, since it's a mess of kernel caches and dual variables that mean nothing whatsoever to the problem at hand (but the algorithm won't work without it).
Your right about the classifier, not needing a mention now. My problem starting out of my honours thesis is that I would read something like that then have to go research the thing they have just mentioned in passing because their core audience knows all about it already. So I guess it has an aspect of knowing the best starting point for what you want to research as well.
It's not (exactly) my community, but SIGMOD has been trying to do this since 2008 with the "repeatability/workability committee." See this:
There's also an interesting FAQ about the repeatability requirements specifically here:
... as well as various follow-ups from that work, if you search on Google Scholar.
The issues as I understand them are:
1. Research code is often of poor quality, usually thousands of lines of unchecked output by a single graduate student. There are probably bugs, some of which may change the results. In the absence of code, results are assumed to be correct. It takes substantially longer to write good code, and hurts researcher "output" to do so. As a result, good quality code is actively discouraged.
2. Releasing code and/or data usually makes it substantially easier for others to duplicate or catch up to your research program, which is seen as a disadvantage if you are still in an area. It may lead to more citations, but that probably isn't enough for the effort. Other researchers also have an incentive to try to find bugs in the code/data, which they may overstate as they try to get their own work accepted.
3. The code and/or data itself my be copyrighted or have unclear distribution terms, for example, if you are doing experiments on a web crawl.
4. Actual production quality code that does something useful can be used as the basis of a startup or other venture, especially if the researcher is the only one who has and understands it. Furthermore, research groups can make money licensing their code to outside companies if they do not release it openly.
5. Many (most?) industrial papers involve code or data that cannot be released. Ultimately, in highly competitive conferences, it is hard to balance "unverifiable" papers written by industry with academia papers. A blanket ban on papers without code or data would remove a huge number of industry contributions, but an optional requirement for code or data mostly continues the status quo. Many of the most interesting recent papers (e.g., MapReduce) might not have been published with a code/data requirement.