(Disclaimer: Author of the OP) I absolutely understand your feelings on hacky co...

_delirium · on July 24, 2010

(Disclaimer: I work in your lab. ;-))

I've asked people for code a few times, but my experience after getting it is actually that I don't really ask for it anymore, because I've never found it to help me. What I really want in most of the cases is a clear enough English writeup, perhaps with pseudocode, so that I can understand how they solved their problems, and ideally reimplement it myself. At least, that's the case if it's at a scale where that's feasible to reimplement; if they built something absolutely gigantic then it might be another story, but then their megabytes of messy research code I can't grok aren't very useful to me either, and I have no real choice but to wait for the cleaned-up release.

In short, I think "can this be reimplemented by a third party from the published literature?" is a better test for reproducibility than .tar.gzs are. And there's certainly a ways to go on that front, not least because in areas where 6-to-8-page conference papers are the norm, even well-meaning authors can't include enough details, and most don't get around to writing the detail-laden tech report version. But I guess I find code mostly useless for that purpose; it might as well be an asm dump for all the good I usually get out of it.

Lewisham · on July 24, 2010

I agree with your TLDR;, but as you say, we're in a culture of 6-to-8 pages. I actually quite like the 8 page limit for most papers, it forces authors to a brevity of expression that aids focus, but you're right that details are the first thing jettisoned.

If I'm going to propose a probably impossible sea change, taking the baby step of saying "just show me what you've already done" instead of "now write another 12-20 page set of documentation" is the more likely of the impossible two :) In a perfect world, we'd have both!

alextp · on July 25, 2010

This is also very true.

And I think a part of the reason why this is worse with research code is that the meaty part of the code tends to be (at least in ML/NLP) a few equations from the paper, in a hacky and convoluted way, and unless you're a world-class expert on keeping track of indexes and one-greek-letter variable names, there's very little to get from the code to a well-written paper. I make an exception for tuning parameters, constants, tweaks, etc, but these shouldn't matter much anyway.